Users who have enabled the WAF do not easily know what the WAF is blocking, allowing, or how much traffic it processes. This lack of visibility means it is more difficult to determine how to configure, tune, and evaluate the WAF.
Intended users
Users will view this after initially creating a cluster and installing the WAF, to confirm they are seeing traffic
Security team members will view this to see what the distribution of blocked vs. allowed traffic is
Further details
Reporting statistics and information about the WAF's behavior could be a very deep experience if we invested a lot of time in it. For this iteration, I'd like us to view it as an MVC and find a way to provide visibility with a minimal amount of product changes. Then we can get feedback on what is useful or missing and then build out a deeper experience in future iterations.
Proposal
Minimal
Display to users with the WAF how frequently the WAF:
Identifies Anomalous traffic
Identifies Total amount of traffic
Default time frame is 30 days.
Proposal that we put this in the security tab as a set of text boxes, similar to how we have # of vulns in the security dashboard. Would like input from others on this.
Next steps
Table with detailed event listings
View raw logs themselves in the pod logs
Permissions and Security
Users should be required to have the same permissions as they would for the security dashboard to view.
What does success look like, and how can we measure that?
Of all users who have the WAF installed, at least 75% view the provided information at least once within 30 days.
Of all users who have the WAF installed, at least 90% view the provided information within 90 days.
This demonstrates that the users who are using the WAF actually are looking at the results and not just installing it and ignoring it.
Documentation
Explain how to access the statistics.
Explain what the statistics are showing.
Explain what different results could be indicating.
@stkerr Thanks for creating this issue a few milestones out so we have plenty of time to consider design and research tasks!
@andyvolpe and I created a few JTBDs, and your feedback would be most valuable:
MVC:
When I activate my WAF for the first time, I want an indication that it is working, so that I can feel confident I set it up successfully.
When I am reviewing blocked/allowed traffic, I want to see this information visualized in real time, so that I can take action if I see an anomaly.
Post-MVC:
When I am reviewing blocked/allowed traffic, I want to be able to filter my data, so that I can focus and take action on anomalies.
When I am reviewing blocked/allowed traffic, I want to see historical info, so that I can spot trends and plan ahead for defensive action.
Re:
Proposal that we put this in the security tab as a set of text boxes, similar to how we have # of vulns in the security dashboard. Would like input from others on this.
Can you clarify what you mean? Are you thinking of it becoming an item under the Security & Compliance menu at the same level as the Security Dashboard and Dependency List?
Or should it say something else; e.g. "WAF Statistics Reporting"?
Current assumptions:
We will need to communicate the path to the WAF Statistics after the user creates a cluster and installs the WAF (on the Configuration page); e.g. a button appears after installation is complete that says something along the lines of "View WAF Statistics".
For MVC, we won't have any control over the visual branding of the graph and will not be adding functionality (e.g. filters) on it. In other words, the graph will be pulled from a 3rd party (ModSecurity) and aside from general framework of the page (page title, padding) there won't be anything to design or configure (at least not until a future iteration, in which case another issue can look into added features and functionality).
Or should it say something else; e.g. "WAF Statistics Reporting"?
Will this screen grow to eventually house more than WAF stats? What do you think of renaming it as "WAF"/"Web Application Firewall"/something more generic so if we want to add things like configuration or WAF-specific reports, they could live there?
We will need to communicate the path to the WAF Statistics after the user creates a cluster and installs the WAF (on the Configuration page); e.g. a button appears after installation is complete that says something along the lines of "View WAF Statistics".
Correct.
For MVC, we won't have any control over the visual branding of the graph and will not be adding functionality (e.g. filters) on it. In other words, the graph will be pulled from a 3rd party (ModSecurity) and aside from general framework of the page (page title, padding) there won't be anything to design or configure (at least not until a future iteration, in which case another issue can look into added features and functionality).
To add some nuance here, ModSecurity is text-only output - there is no graph (that I'm aware of). We'll need to parse the log files it is producing, sum up the results, and display that ourselves. So I think we'll need design in the sense of understanding how we want to present the information, but agree with you that we don't need filtering and sorting for this iteration.
@beckalippert I think we could generalize the term based on what the WAF is reporting. Maybe something like Threat monitoring, Attack monitoring, Threat logs, or something along those lines.
@beckalippert@andyvolpe@stkerr I'm concerned about us over-specifying here and wanted to get your thoughts on a more generic approach to this.
It looks like the current direction is built around a custom dashboard for WAF reporting/stats. I'm seeing a number of similar ideas across our different stages but this could lead to a lot of UI debt rather than a more unified approach.
If the basic problem is that we need to aggregate data in from the WAF (and future orchestration tools) and display it in some way, it might make sense to coordinate with Category:Logging around their work in #3711 (closed) and #30729 (closed). Elasticsearch and/or fluentd is a good candidate for an aggregation tool and we could leverage an existing tool like kibana on top of that to visual data.
To add some nuance here, ModSecurity is text-only output - there is no graph (that I'm aware of). We'll need to parse the log files it is producing, sum up the results, and display that ourselves. So I think we'll need design in the sense of understanding how we want to present the information, but agree with you that we don't need filtering and sorting for this iteration.
So the minimal MVC would be log availability. This would either result in something very close to duplicate the current functionality of Kubernetes Pod Logs (plus the enhancement in #3711 (closed)) or a manual process of parsing of the log file to generate graphs (a basic feature of a visualization tool such as Kibana built on top of elastic search)
So, I guess my basic question is: what do we gain by building a separate UI here and how does this approach fit into similar initiatives with our Category:Logging friends?
So, I guess my basic question is: what do we gain by building a separate UI here and how does this approach fit into similar initiatives with our Category:Logging friends?
@theoretick there are a few assumptions we have that are guiding our approach. Note these aren't validated yet as research is ongoing.
Competitors have a dedicated UI for metrics specific to logging and monitoring for security events (not always a good reason but we have to acknowledge that this is reflected in the industry)
Users who are viewing the WAF output are different than those who are viewing other metrics and have different needs/goals/expectations
We want to build for the eventuality of including the log data in a list format to accompany the metrics and statistics #13555 (closed)
Drection-wise I'm not sure if we want to link to an external service like kibana but I don't want to speak for business decisions, just something I'm unsure of.
So the minimal MVC would be log availability. This would either result in something very close to duplicate the current functionality of Kubernetes Pod Logs (plus the enhancement in #3711 (closed)) or a manual process of parsing of the log file to generate graphs (a basic feature of a visualization tool such as Kibana built on top of elastic search)
This issue is targeting a slightly different value point than what more usable log availability would address. Users can already view the raw logs today after going through some additional steps, but what this issue is trying to do is help them to more easily understand if the WAF is turned on, seeing traffic, and how much, if any, traffic it is blocking.
This same info could be deduced by a user processing the logs themselves, but doing this for them is really the key point this issue addresses, rather than making the logs themselves more readily visible.
We do need to make the logs more readily visible though - not diminishing the need for that at all.
It looks like the current direction is built around a custom dashboard for WAF reporting/stats. I'm seeing a number of similar ideas across our different stages but this could lead to a lot of UI debt rather than a more unified approach.
It's a fair point and a good observation.
I feel fairly confident that all the various security capabilities & data sources we collect from will end up in a centralized "Security" location together, since we want to provide a curated security-specific experience, regardless of where the information comes from in terms of underlying technology (e.g. combining pod logs and a SAST log in one place). We should be mindful as to how this will grow over time to not end up creating more work for ourselves in the future or duplicating efforts multiple times.
So, I guess my basic question is: what do we gain by building a separate UI here and how does this approach fit into similar initiatives with our Category:Logging friends?
Building it in this way is the minimal step we can use to deliver new value and gather additional feedback for future iteration. It's possible we could end up going in slightly different directions, based on what other groups are doing in a similar problem space.
Competitors have a dedicated UI for metrics specific to logging and monitoring for security events (not always a good reason but we have to acknowledge that this is reflected in the industry)
@andyvolpe which competitors? If we are considering AWS and Azure as similar enablement platforms they both use a unified logging system:
Users who are viewing the WAF output are different than those who are viewing other metrics and have different needs/goals/expectations
Potentially, but advocating for a "shift-left" approach, this should have some overlap with the same developers/ops people deploying apps. If we enable blocking mode, I (as an application deployer) want to see why requests are being blocked)
We want to build for the eventuality of including the log data in a list format to accompany the metrics and statistics #13555 (closed)
...
Building it in this way is the minimal step we can use to deliver new value and gather additional feedback for future iteration. It's possible we could end up going in slightly different directions, based on what other groups are doing in a similar problem space.
Thoughts?
@stkerr Currently there's no collective understanding of what we're using to build these charts. The most obvious approach I can think of is some log exporter, like fluentd/elasticsearch feeding these charts.
Monitor is already looking at adding Elasticsearch (#30729 (closed)) as part of the logging vision. However Serverless just had a similar proposal which just had a change of direction in order to align with Monitor as well. to quote @kencjohnston from this issue:
I've got some concern about this approach.
It appears to be at cross purposes with our broader logging and apm work (#30729 (closed))
It breaks our pattern of avoiding popping out to other UIs and breaking the single application value.
Please consider another approach.
So I just want to ensure that we are building something "Highly aligned, but loosely coupled" where-as this currently only feels "loosely-coupled". If we are looking at a way of bringing cluster-based data logging into our UI, it could result in a lot of redundancy and duplicate effort from both a product and implementation perspective.
Perhaps this doesn't change the UI but It very well might. If we bring Monitor into the conversation this should give us time to figure out what the logging vision is and whether we're staying aligned with that vision or not.
@theoretick@twoodham to re-state some of our voice conversation from earlier today:
@stkerr Currently there's no collective understanding of what we're using to build these charts. The most obvious approach I can think of is some log exporter, like fluentd/elasticsearch feeding these charts.
We should build on top of what other teams have built already, but if there isn't something, we shouldn't block ourselves and wait.
We could either pick up the issue from their group to work ourselves or implement a different "boring" solution. In the future, when there is a definite direction & mechanism for all groups to use, we could then convert over. We need to make sure the implementation we do for this iteration is a two-way door approach in case changes are needed later.
One thought we discussed was grep'ing the WAF logs & counting the number of messages blocked that way.
If we bring Monitor into the conversation this should give us time to figure out what the logging vision is and whether we're staying aligned with that vision or not.
Great point. Adding @dhershkovitch here for his thoughts and if there are longer-term logging vision items we should align to in the short term.
We had a call last week with @dhershkovitch on a path forward to stay aligned with devopsmonitor's vision for unified logging here.
Eventually we should plan on backing our interface with modsec logs stored within Elasticsearch, a deployment and process that's being actively worked on in the upcoming releases by groupapm. This should provide a single approach to log aggregation and storage on gitlab.
Since we don't want to stay blocked in the meantime our proposed way forward would be to look at leveraging the existing prometheus integration for data aggregation and storage. This should work in a similar manner to our existing integration with GitLab for stat aggregation based on the existing Prometheus GitLab Managed App and Prometheus::ProxyService.
The only backend component needed will then be pushing the logs off our ingress controller by exposing an exporter to prometheus. We will need this exporter to parse the existing modsec_audit.log or look further into the remote logging capabilities of modsecurity (there's several, which can be explored within #32459 (closed)).
To ensure we can make a seamless transition in the future to ES, we can build our exporter using some basic log broker; i.e. a logstash/fluentd/fluentbit sidecar which we can later retarget away from the prometheus instance and towards ES.
For the frontend we should be able to render any charts using either the existing echarts integration or the upcoming Grafana JSON embedded rendering (#31376 (closed)) if a user has an active grafana instance.
The other item that came up and needs further consideration is how we want to handle data-loss during the eventual transition to an Elasticsearch-backed system. There were a couple of points raised here:
If we configure our initial time window for stat rendering to correlate with our server log rotation then we should be able to easily re-scan the active modsec log when swapping primary data stores and the user should experience no data loss as we preserve the same timeframe retroactively. This rotation window be small however so should require further research.
From a usage perspective, going with a similar UI to the existing monitor page (or embedding within) initially would allow us to later swap to a dedicated standalone UI. This could be considered a future feature release and therefore make "data loss" acceptable since it's not technically the same feature.
Ability to show not only blocked but also logged actions by the WAF (so it will show useful data for users not in blocking mode)
Ability to graph blocked and/or logged traffic on a per WAF rule basis (so that users can identify both what is being attacked most and also what may be causing false positives)
Ability to show latency introduced by the WAF (in general WAF's will introduce latency, and this can be significant on a per app / per rule basis).
@stkerr While this is scheduled for %12.5 there are some base changes we might want to make to the logging configuration to better enable our exporter, primarily changing the base logging format from Native to JSON logging. Since our existing documentation only provides directions for tailing the log file we can probably change the format without a deprecation warning but if you think that's necessary we could consider adding one within %12.4 in anticipation of this featureset
@theoretick thanks for pointing it out. I don't think we need a deprecation notice for this in advance, as long as we document it properly in 12.5 as part of this issue.
It sounds like we're all on board moving forward with a dedicated page for Threat Monitoring (title open for discussion in case anyone would like to propose an alternative), which, as discussed with Sam earlier in this issue, would live under Security & Compliance > Threat Monitoring. Please let me know if I've interpreted the above discussion erroneously.
WIP wireframes for MVC look something along these lines:
This covers the need to display to users how many times the WAF has:
Allowed traffic
Blocked traffic
Total amount of traffic.
In addition, we'll be providing the instructions for viewing the WAF logs. (This part is the WIP and variations have been included in #13555 (closed).)
@beckalippert My recommendation is to let all on the issue know the copy is ready for review, but also to explicitly add @axil (our Defend Technical Writing counterpart) and ask for a review.
@theoretick Unless I've misunderstood, it sounds like the backend will expose one proxied Prometheus endpoint, which can be queried for the graph data. The actual Prometheus query itself needs to also be provided by the backend by some means. Perhaps the simplest approach this iteration is to just provide both in the rails template via data attribute(s).
What about the single statistics? Do you expect they can be fetched via the same endpoint, just with a different query, or will they be exposed via a different endpoint? Or could these just be the most recent data points from the graph data?
We should be able to expose all stats through the existing prometheus proxy endpoint. Since the "total requests" will just be the existing dataset returned on the Metrics page we can reuse the same request; i.e.
I'll still need to figure out how we're getting the data for blocked requests into prometheus at which point I can update this issue with the appropriate query params, but populating allowed responses using the aforementioned query would be a good start
It seems this will be more straightforward to expose the modsec stats as a separate endpoint, see updated description and WIP MR !19789 (closed).
Obviously splitting queries between the two services isn't ideal, I'll see if we can make this a bit cleaner by proxying the prometheus metrics on the backend and returning as a single payload
@markrian can you provide some insight/preference in how we want to expose this? It seems like we have 2 really extreme approaches to stat payloads.
The vulnerabilities/summary style is basically: { "high": 12, "critical": 0, ...} and I think a hardcoded time range.
On the other hand Operations / Metrics is proxying the full query. From your earlier comment The actual Prometheus query itself needs to also be provided by the backend by some means I guess we do provide this for metrics? I couldn't actually find where this is done.
I think I'd push for something in the middle; an endpoint providing both aggregates and time-series that's queryable to proxy requests; something like:
But I'd prefer we stick to an existing approach if possible and I'm curious what's reusable on the frontend. What are your thoughts on keeping this easy to build out vs maintain?
Here's the current WIP raw output from elasticsearch that I'll keep updating as I normalize it. I'm wondering If I should just shim an adapter to make the queries a bit more prometheus-y, assuming that code is reusable. If we're just injecting the actual query somewhere I suppose it doesn't matter
@lkerr CC'ing you for visibility on my last comment, as well.
tl;dr - we're approaching a point where we need more consensus on negotiating the API contract. I've like some feedback on what ~"technical debt" is already there prior to writing up a swagger doc.
How far are we from getting designs from the wireframe?
I have some questions/thoughts about the wireframe, as it is now.
Single statistics
Are we planning to display percentages in general, or counts? If percentages, then traffic allowed is by definition 100% - % traffic blocked. Does it make sense to display both, in that case?
What would "Gross Traffic" represent? Presumably this is the "Total amount of traffic" in the description, but I'm not clear on how that's quantified as a single statistic. Number of requests per unit of time, perhaps?
Chart
What are we hoping to convey with the chart? Proportion of blocked requests over time compared to allowed requests?
If we're displaying these as percentages, then they would by definition sum to 100%, so it's perhaps worth showing only % blocked. If, however, we're displaying counts (i.e., number of requests?) per unit time interval, should this be a stacked bar chart, or something else?
For the first iteration, the chart won't be configurable or filterable at all, as I understand it. What's a sensible default time frame to graph from and to, then? The last 30 minutes, 8 hours, 24 hours, 7 days, a month, or something else? The Metrics dashboard appears to default to the last 8 hours.
I was hoping the specifics of data that we a) are capable of displaying and b) think are most useful for the WAF user is based on collaborative input from the team. I created a research issue to learn more about our user and their needs, but until that research is underway I look to our security experts internally (cc @theoretick @whaber @stkerr@twoodham@leipert ) and lean on competitive analyses ((1) Statistics Reporting 2) Rule Management) to see what data other WAFs show. From the Statistics Reporting one, it seems pretty common to show attacks blocked (either in a % or #, or both) and page requests or allowed requests (either in a % or #, or both), but that said there doesn't seem to be much standardization of this data across tools, so it seems like we can make an educated guess for MVC and collect feedback from there.
The wireframe was merely to display components that we could use and to start the conversation about the specifics we want to show. I know we use echarts and will want to use the area chart for the graph on the bottom but don't know what defaults are associated with them WRT filters and time frames. Taking a look at competitors, it seems like a month could be a good default (which appears to be the default for Sqreen and Demisto). @andyvolpe might have more thoughts?
Now that #33782 (closed) has been merged I've started exploring both elasticsearch and filebeat for our log aggregator. I was originally considering fluentd but there's no good reason to deviate from the direction devopsmonitor has been exploring towards #3711 (closed).
@markrian One note: this is a deviation from the previous direction since I think it will be better long-term to fetch these from ES rather than prometheus as previously discussed. The number of dependencies is the same, ES is more inline with our strategic direction, and with the combination of !18961 (merged) and gitlab-foss#66362 (moved) we will be able to easily render the raw audit log via pod logs directly with no additional effort.
With the work in !19600 (merged) I've exposed a logging sidecar for pushing the audit log into elasticsearch. Next step will be writing a service to parse and return these results within the rails application. I'll start by defining an API schema we can collaborate on so frontend work wont be blocked until I figure out how we want to expose this
For the traffic allowed block, will the green color change based on some percentage of allowed/blocked traffic? At first glance, green makes me think "everything is all good," but not sure if that is the intent here.
One minor comment on the enabled Threat Monitoring page, can we use different language from Number of attacks? I believe we'll want something more conservative such as "Threat count" or "Anomaly count" to capture the potential without overselling confidence
@beckalippert@markrian one oversight I missed, we need an environment toggle for users to choose which environment they wish to view logs for. This can likely be reusing the same dropdown we use for the existing metrics page but I forgot about it until now:
This also begs the question, what should our default date range be? I have no strong opinion on this, so happy to with whatever interval suites others.
This also begs the question, what should our default date range be? I have no strong opinion on this, so happy to with whatever interval suites others.
30 days/one month is a good choice here, as Becka mentioned in this comment
Per @stkerr's comment about the user potentially conflating green with "everything is all good", I changed it to blue (which is our "informational" color and works in this context).
Per @theoretick's comment about the incorrect y-axis labeling, I changed it to "Traffic". If we need to get more specific maybe we could include a metric, e.g. "Traffic (Requests/ sec)". Please comment!
@theoretick: Question: Did you mean that the traffic graph needs filters or the pod logs will need filters?
@theoretick: Question: Did you mean that the traffic graph needs filters or the pod logs will need filters?
@beckalippert both, in either case we must scope to an environment. A project can have multiple environments (i.e. staging vs production) so both logs and stats would only make sense in the context of a given one. we can default to production (or the first environment) but we need some kind of basic environment filtering
@theoretick Do we want the type to be filterable too (see design below), or do we keep the WAF Statistics Reporting header (as the previous design had)?
@andyvolpe Let me know what you think about the alert + "?" icon used here instead (per your Slack comment: "We can utilize a banner or alert as a first run experience with help text and a link to docs then rely on the (?) for the main way to get to docs from this page."
alert could go at the top of the page and push the title "Threat Monitoring" below it. We do this in other places but it's more of a preference thing than a rule. The gray background seems to be competing a bit with the filter area.
The alert style might change slightly when gitlab-ui!832 (merged) gets merged. More or less an FYI here. If the new style is available then I'd transfer your content into the new style.
Page content. Maybe @axil can have a quick review of your copy to see if they have any thoughts / suggestions. Looks good to me but @axil is a pro
Dropdowns, will there be any options in the first dropdown 'Type' for the user to select other than WAF statistics?
Dropdowns, Show last. Do you think the time toggle at the bottom of the chart does this task too? Will the Count Metrics; Traffic allowed, Traffic blocked and Total traffic, change when this filter is changed?
What do you think the options in this dropdown could be aside from 30days?
Dropdowns, Show last. Do you think the time toggle at the bottom of the chart does this task too? Will the Count Metrics; Traffic allowed, Traffic blocked and Total traffic, change when this filter is changed?
They have overlapping functionality and when the time-window toggle at the bottom changes the dropdown does as well. This matches the existing Metrics page functionality, so it's an existing pattern FWIW.
What do you think the options in this dropdown could be aside from 30days?
As a developer working on this it sure is handy viewing data from the last 30minutes instead of 30days for testing that said, comparing timewindows from a deployment today vs yesterday could be quite useful.
That said, if it makes things at all simpler for the initial implementation I'm more than happy to drop one or both of these filters and simplify things cc @markrian
@theoretick Do we want the type to be filterable too (see design below), or do we keep the WAF Statistics Reporting header (as the previous design had)?
@beckalippert yes, filterable. That was what my previous comment was referring to, I think we’ll eventually need to toggle between views such as Network Policy Statistics. Of course that means it’s temporarily a drop down of 1, so short-term roughness for long-term commonality
@beckalippert Should we ask a help question mark to the Type select for now to explain why the drop down only has one option? It is something we have done way back with the security dashboards:
Thank you @leipert the popover sets expectations much better than clicking on the "?" and having it go to the docs without context! Let's go with that.
Other updates made here, if you could also look over once more and give a thumbs up:
"WAF statistics" now sentence case
Alert copy changed from "The graph below is an overview of traffic coming to your application." to "The graph below is an overview of traffic coming to your application as tracked by the Web Application Firewall (WAF)."
I've changed some of the copy to replace instances of docs with documentation, as I think abbreviations are not preferred.
I'm ok with this pending approval by @axil. There's a button here (on the "Releases" zero state page) that uses the word "Documentation" (but note that for (2, below) that "Documentation" is capitalized.
I've used sentence case in the empty state link button: Learn MoreLearn more.
Sounds good, but we should probably get the Release page (noted above in (1)) to match for consistency.
I've removed the sentence This link is also accessible by clicking the "?" icon next to the title below in the alert, since that ? is a common UX pattern in GitLab, so users should know that it links to relevant documentation.
I'd like to keep this sentence because "Threat Monitoring" will eventually include other features outside of the WAF, so it won't be intuitive as to where the link would take the user and I'd prefer to set some context first. What do you think @markrian?
Also, @leipert, I now realize I misunderstood your comment, and that you were proposing to add another popover to the "Type" filter. What do you think it might say? Something like, "More features coming soon"? Maybe also a question for @axil. I don't think we need an external link here.
It looks like a blue "?: icon is used if it goes to an external link, and a gray "?" is used if it opens a tooltip on hover. Examples:
blue "?"
gray "?"
found in Security & Compliance > Dependency List. Clicking goes to external link
found in Project overview > Cycle Analytics. Hover opens tooltip.
Therefore, this would be the final product (assuming the "Type" would be a tooltip, no external link):
With a popover for the blue Threat Monitoring icon:
@beckalippert this is really coming along! a few thoughts:
Title
The blue icon works well here since it is meant to emulate a link and will have that action embedded within it. Maybe the popover can allude to the fact that more features coming soon as well as provide a link to the documentation.
"At this time, threat monitoring only supports WAF data" View WAF Documentation
Filters
TBH I think we can remove the type filter altogether until we add another feature into this page. The filter area may seem a little empty but our alternatives might add more friction than they intend to remove. By removing the dropdown/filter we won't have to communicate "more features coming soon" and we won't have to disable the dropdown (if we keep it we should disable it since there is nothing else to select).
If you look at the page from 10,000 ft, we are providing the user with a lot of assistance with the panner and title (?) + popover. I think we can lighten up the page by removing the type filter thus allowing us to remove the (?) next to the type filter.
We can regroup when IDS and/or NetworkPolicies make their MVC debut and focus on where + how to add their data into the page.
Thanks for the feedback, @andyvolpe! Responding to your comments:
alert could go at the top of the page and push the title "Threat Monitoring" below it. We do this in other places but it's more of a preference thing than a rule. The gray background seems to be competing a bit with the filter area.
Moved!
The alert style might change slightly when gitlab-ui!832 (merged) gets merged. More or less an FYI here. If the new style is available then I'd transfer your content into the new style.
Looks like it was merged but isn't available in the Sketch library yet. Can FE grab the updated component to use? @markrian
Page content. Maybe @axil can have a quick review of your copy to see if they have any thoughts / suggestions. Looks good to me but @axil is a pro
@axil yes please do make suggestions for the copy in the alert!
Dropdowns, will there be any options in the first dropdown 'Type' for the user to select other than WAF statistics?
It will be the only type until Network Policy Statistics and Intrusion Detection System are ready to be introduced. After collecting feedback from other designers, I've decided to keep the single option in the dropdown rather than the alternative of taking it out, which looks really funny. By keeping it, we a) maintain consistency with the other dropdowns in the row and b) set an expectation that more options are to come. As @theoretick stated, "short-term roughness for long-term commonality".
Dropdowns, Show last. Do you think the time toggle at the bottom of the chart does this task too? Will the Count Metrics; Traffic allowed, Traffic blocked and Total traffic, change when this filter is changed?
After our conversation with Sam yesterday and in the Secure & Defend UX meeting this morning, let's remove the time filter at the bottom of the graph. After thinking it through more, I realized that this might interfere with the expectations set by "Show last" and in the stats above the graph; i.e., Scenario: I've set "shown last" to 30 days but narrowed down to 1 day in the time filter in the graph, but my total traffic (gray box) is remaining the same. Let's do more research on the expectation of the interaction of these 2 filters before implementing to avoid confusion.
What do you think the options in this dropdown could be aside from 30days?
@theoretick mentioned that 30 minutes would be useful, which aligns with what I've seen elsewhere, i.e. Sqreen has:
I propose we use those as options, minus the custom (which I assume adds a layer of complexity we don't need for MVC).
@beckalippert@andyvolpe here's my version. My main comment is that we mention WAF without explaining what it is in the first place.
The graph below is an overview of traffic coming to your application as tracked by the
Web Application Firewall (WAF). View the docs for instructions on how to access the WAF
logs to see what type of malicious traffic is trying to access your app. The docs link
is also accessible by clicking the "?" icon below.
PS1. Can you post the transcript in the future? It makes it easier to propose something :)
PS2. @beckalippert there was another thread going on in #14707 (comment 242242573), it would be nice to keep commenting there, now the discussion is split into two! Something to keep in mind :)
@beckalippert I meant it would be useful to have the text to be used written somewhere in the description! Now I had to read it from the image and propose my changes :)
@stkerr@axil@beckalippert I don't think we have any good place, currently. So I'd assume we either add this to the existing Application Security section or we should create a new section for defend and/or threat monitoring.
@stkerr@theoretickApplication Security works. We may want to consider adding a Threat Monitoring page in those docs at the same level as Application Security for Defend features, though. This would match the naming convention and groupings that we're working on within the Configuration page (which separates Secure from Defend features, without using those exact titles).
I don't think the Application Security top-level page is a good place for it as written - it is directly speaking to GitLab Secure, which isn't what WAF stats are. Maybe under Security Dashboard, but that still seems like a bit of a stretch.
We may want to consider adding a Threat Monitoring page in those docs at the same level as Application Security for Defend features, though.
This seems like a good approach to me and gives us room to add future security tools.
Since the public environments API returns all environments, some of which aren't relevant for WAF (e.g., stopped ones), the private project_environments_path is probably best for this MVC.
Since the public environments API returns all environments, some of which aren't relevant for WAF (e.g., stopped ones), the private project_environments_path is probably best for this MVC.
We could filter the public list by state (available or stopped) in the response, but it doesn't look like there's a param currently. Since review apps are each individual environments gitlab-org/gitlab has 10k active ones currently , so let's stick with the private one unless I find time to make a q
Since the public environments API returns all environments, some of which aren't relevant for WAF (e.g., stopped ones), the private project_environments_path is probably best for this MVC.
We could filter the public list by state (available or stopped) in the response, but it doesn't look like there's a param currently. Since review apps are each individual environments gitlab-org/gitlab has 10k active ones currently , so let's stick with the private one unless I find time to add a quick filter to the public API.
Any future changes here will likely require significant differences, so I'd leave this off. This effectively is specifying which controller we should be hitting ATM.
start= # start of the period we wanna look at. We will have the dropdown data
Assuming we default to day for the unit, I suppose this would be the start of the timeframe? That doesn't seem static per unit size, however so I'm not sure how wide we would want to render this data
tl;dr - if we were to drop the time-frame, all that is currently needed is environment_id. If we keep the timeframe, we can specify a unit and need to decide on the window; i.e. hourly for a week vs hourly for a month.
No environment exists, @markrian and I think this is equivalent to the current empty state design. @theoretick Proposal would be to query on the HAML side whether the project has at least one environment.
If environments exist, but have no data, should there be another empty state?
No environment exists, @markrian and I think this is equivalent to the current empty state design. @theoretick Proposal would be to query on the HAML side whether the project has at least one environment.
that makes sense to me!
If environments exist, but have no data, should there be another empty state?
I would say no, but no strong opinion on the matter.
For (2), yes, there should be a different empty state, because (1) specifically states that the WAF is not enabled.
As far as I can tell, it doesn't look like echarts has any built-in zero states. What would be a reason the user doesn't have any data? Could we surface that reason to them?
As far as I can tell, it doesn't look like echarts has any built-in zero states. What would be a reason the user doesn't have any data? Could we surface that reason to them?
If the deployed application never receives traffic, there will be no data to display. Additionally, if they do not install the Elasticsearch managed app we will not be aggregating that data.
I would say we just render zero-data charts for this usecase because it's a bit complex to determine each of these substates. For example, we can determine if there is a deployed environment and ES is installed but it's quite expensive to determine if the WAF has been enabled or not, see related discussion on implementation metrics
@beckalippert@andyvolpe@markrian so I had a realization last night about one of our bigger oversights here: the configuration of the WAF is contingent on whether or not a user has enabled blocking mode. By default we do not, yet this UI assumes blocking is enabled.
We have discussed whether it's possible to detect whether a user has blocking mode enabled and we do have some progress with #32358 (closed) but for now I think we need to modify this UI to default to passive logging language as this screen should be used regardless of whether the ~WAF is in blocking mode or not.
This would mean rather than Traffic Blocked we should consider alternative wording, any ideas? I've been referring to this as "anomaly statistics", but happy to discuss ideas
Not sure what you mean. If no traffic is being blocked, there wouldn't be any blocked traffic on the graph, right? And to further drive home that blocking mode is disabled, the stats would show that 100% of traffic is allowed, plus the tooltip over the blocked traffic:
Not sure what you mean. If no traffic is being blocked, there wouldn't be any blocked traffic on the graph, right? And to further drive home that blocking mode is disabled, the stats would show that 100% of traffic is allowed, plus the tooltip over the blocked traffic:
What we are not surfacing by only showing truly blocked traffic is how much traffic has the potential to be blocked. By default that number will be 0 and/or n/a in logging-only mode, so there is no clear indicator of what will happen once blocking mode is enabled.
IMO this screen should show all rule violations, regardless of whether they are blocks or passes. This way we are informing the user about the contents of their application traffic and giving them the tools to determine whether to enable blocking mode.
I think to make this have the base functionality we need to show:
% traffic blocked (definitely malicious - but might be false positives and blocking legitimate traffic)
% traffic allowed (definitely not malicious)
% traffic logged but not blocked (possibly malicious and the security team for the customer may want to investigate)
The %'s will add up to 100%.
In the future we will want:
A version of the above information for "blocked" and "logged but not blocked", broken out by which WAF rule fired
Average latency (in milliseconds) that the WAF introduced (so customers can determine if the WAF is introducing a performance issue). It would be great as well to show what rules (whether they are firing or not) make up that latency, however I am not sure if modsecurity logs that or something similar.
So you're saying that in blocking mode, we can show allowed traffic, blocked traffic, and logged traffic? And that in logging-only mode we can show allowed traffic and logged traffic. Is that right?
I think to make this have the base functionality we need to show:
% traffic blocked (definitely malicious - but might be false positives and blocking legitimate traffic)
% traffic allowed (definitely not malicious)
% traffic logged but not blocked (possibly malicious and the security team for the customer may want to investigate)
The %'s will add up to 100%.
That seems the cleanest from a UI perspective but I'd defer distinguishing between blocking/logging for now as we do not yet have a clean way of evaluating that condition. If we simply called it Traffic Violations we could cover both cases with the existing UI. We should distinguish there with improvements to our data model but it will increase the scope.
So you're saying that in blocking mode, we can show allowed traffic, blocked traffic, and logged traffic? And that in logging-only mode we can show allowed traffic and logged traffic. Is that right?
Since we use anomaly-scoring mode, requests can be logged matching a rule violation, but their anomaly score may still be lower than the threshold triggering a "block". So we can distinguish between rule violations and rule violations that are great enough to be blocked, but it is currently difficult to evaluate whether a blocking action is carried out
So Traffic Violations would encompass [blocked] (malicious) and [logged, not blocked] (possibly malicious) in logging mode but only the latter in logging-only mode?
ModSecurity will only have the capability to block when it is blocking mode. (In logging only mode it will never block.)
ModSecurity can choose to block/log, allow/log, or allow/not log when in blocking mode (blocking mode can also decide to just log on some requests based on the rules / other configuration).
So Traffic Violations would encompass [blocked] (malicious) and [logged, not blocked] (possibly malicious) in logging mode but only the latter in logging-only mode?
I'm still struggling with this term since they aren't necessarily violations. Other colorful phrases: Traffic Anomalies (the term modsec itself uses), Traffic Divergence, Traffic Abnormalities, Traffic Irregularities...
But yes, as @whaber pointed out it's more of a spectrum, so assuming we get to distinguishing the two I think it would be better to treat Blocked/Logged-Only as subsets of the total we come up with.
Note: The alert bar has also changed from gray to blue from the last iteration, as @andyvolpe and I discussed that gray is for "tips" and blue is for "info", and the button to the logs here falls into more of the "informational" category.
Just wanted to capture the conversation from Slack last week on why the team considers the latest solution as MVC-worthy:
Becka: @WayneHaber Wondering what your thoughts are on MVCs. Re: the WAF Threat Monitoring page, I'm questioning whether the graph is useful enough for an MVC after the most recent iterations it went through? I understand it's confirming that the WAF is working (monitoring traffic, some potentially malicious). But if I'm the WAF user and I have blocking mode enabled, I can see that it's showing me anomalous requests, but I don't know whether or not those requests are being blocked or not. At the least, do you think we should consider a note on the page saying something about "If blocking mode is enabled, these requests will be blocked"?
In handbook terms under "Content of an MVC":
A new feature also adds a tax on the user experience due to additional complexity in the product. Make sure that the benefits gained by users from the new feature more than cover the tax incurred, accounting for how frequently the new feature will be used. In particular, the new feature should satisfy this inequality: (benefits * frequency) / (tax * (1-frequency)) > 1
Lucas: IMO Yes, it's still useful.
But if I'm the WAF user and I have blocking mode enabled, I can see that it's showing me anomalous requests, but I don't know whether or not those requests are being blocked or not.
The utility for me relates to it still being time-series data. So an event 2 days ago (deployment, black friday sale, blogpost) could cause an uptick in anomalies. This is far easier to track and discover trends with the graph than a simpler percentage
So, I would consider this a screen that helps surface whether traffic is abnormal rather than how the WAF is handling the traffic. The way the WAF addresses traffic (block or not) can be an enhancement but right now I think it's useful in raising a user's awareness to abnormal traffic patterns.
Wayne: It is truly minimal, but accurate and (somewhat) useful as is.
@matt Wilson @david Any thoughts on this?
Derek: I agree with @theoretick. I think that it is useful to an extent the way that it is, simply for being able to see events over time. I think that visualizations like this when we are talking about data are almost always helpful, even if they provide a minimal amount of usefulness at the moment. Just having the number is less useful than seeing what has happened over time.
Therefore, I opened a UX Debt issue (Improve WAF Statistics Reporting page) to address the functionality that we dropped from the original scope of this issue.
marked the checklist item frontend Align UI to latest designs (alert variant/copy, popover in header) (partially done in !21689 (merged), !21916 (merged)) as completed
Paired with @markrian this morning to sync up the frontend and backend work. Things are working but noticed a couple issues that need to be resolved.
Polling only occurs when data is not yet loaded. This is consistent elsewhere within the UI (i.e. "Metrics" page). We should consider auto-refresh as a future improvement (will open issue)
Mark noticed an auto-interval date histogram that could be useful for more easily determining buckets. This wont currently work for us since we're polling 2 separate responses to coordinate but worth considering in the future
And for the sake of completeness here's a screenshot of our pairing session showing realtime data displaying within the UI. Note the primary issue to be investigated (point 3 above) is a discrepancy between the datasets of total and anomalous requests:
backend investigate why anomalies are over 100% (seeing inconsistent nominal and anomalous data)
This is due to a failure to scope modsecurity logs to the relevant upstream. modsecurity's logs do not seem to distinguish upstream servers unfortunately which means we aren't properly segmenting requests to multiple environments within a single cluster (or technically behind a single ingress-controller). This is a significant blocker and we'll need to investigate workarounds.
With the ongoing complexity of tuning the elasticsearch query (both improving indexing as discussed in the open service MR and outstanding issue of segmenting modsec logs appropriately, see above), I'm switching directions and working on prioritizing the controller work (!22437 (closed)) over the key service object work (!19789 (closed)). By focusing on getting the skeleton controller merged first we can isolate out the complexity of the query until we solve the remaining issues
This is due to a failure to scope modsecurity logs to the relevant upstream. modsecurity's logs do not seem to distinguish upstream servers unfortunately which means we aren't properly segmenting requests to multiple environments within a single cluster (or technically behind a single ingress-controller). This is a significant blocker and we'll need to investigate workarounds.
@beckalippert@markrian I almost missed this but it looks like devopsmonitor just updated the default retention period for Elasticsearch to 15 days: !22971 (merged) with #35181 (closed). The rationale isn't capture there but I imagine this is an effort to cut down on ballooning indices which could get quite large given the data they are capturing.
Would we be happy to go with 15 days or do we need to reach out and coordinate on an increased retention period?
@akohlbecker we discussed this briefly the other day but just FYI this issue is one for your awareness since we are depending on the elasticsearch logs for querying request aggregations.
@theoretick Honestly I don't think we have any specific research on this to justify making a case for pushing to 30, so I vote for launching with 15 and seeing if we get any specific feedback from customers to increase the time window.
So, what does this mean for the available time windows we have? Should the frontend change the 30 days option to 15? That's a trivial change, and can be done in %12.8.
Otherwise, I don't think #195524 (closed) (re-using the same time windows as in the Monitoring Dashboard) can be fitted in %12.8.
Otherwise, I don't think #195524 (closed) (re-using the same time windows as in the Monitoring Dashboard) can be fitted in %12.8.
I think we have to go with the trivial change. If the time windows don't work for us then that's a decision point we can't overcome unless we do some additional filtering on the common logic to drop 30 days with #195524 (closed)
If needed, we also didn't have any specific research to justify 15 days being the default, it is just a number I picked. We can certainly bump it to 30 if that would be helpful!
@beckalippert assigning back. It's preferable that designers stay assigned to issues they've done work for, helps to know who to direct questions to in the future.
I wasn't aware of this practice. I've been unassigning UX team members as issues make it to the build phase. I will stop doing that ... sorry if that created any confusion.
Per !24907 (comment 286589965) related discussion, the documentation step will need to include a note that this feature is not compatible with CI-based managed applications initially
@markrian I've been testing !25273 (merged) and I'm seeing a possible bug, would you mind taking a look? When changing the time window I'm not seeing the graph timeframe reflected correctly, only Total Request count, note the 30 day timeframe vs the x-axis.
Thanks @theoretick! I've not looked into this properly yet (will hopefully find some time tomorrow), but at first glance, that response seems a bit odd:
The from timestamp is 2020-01-22T20:04:57.478Z, which is after the to timestamp 2020-02-21T20:04:57.478Z; I would expect the from to be before the to
The single data point has a timestamp that falls outside of the from and to timestamps anyway: 2020-02-21T00:00:00.000Z
Those issues aside, IIRC the x-axis range is inferred from the available datapoints. Given there's only one datapoint here, it makes sense the that x-axis doesn't reflect the full (though empty) range.
So, I've kind of been able to replicate this. I say kind of, because the odd from/to fields in the response above are a separate issue.
Currently, on the frontend, the x-axis range is inferred from the available datapoints. If there's only some datapoints within the requested range, only that range is displayed. For that matter, if datapoints outside the requested range are returned, the returned range is displayed in full anyway.
Another way to put this: the requested range is completely ignored when rendering the returned datapoints.
The fix is relatively simple, and can be achieved in a couple of ways. The frontend could:
Set the x-axis range to the requested range.
If the backend returns datapoints outside of the requested range (as happened above), then those datapoints will not be displayed; but, at least the user will see (roughly) what they asked to see.
Set the x-axis range to the response'sfrom/to fields.
These may not correspond exactly to the requested range, depending on how the backend buckets the data. If so, then the x-axis again won't quite correspond to what the user requested, but simply what the backend returns. However, again, if the from and to fields don't contain some of the datapoints, as in the above comment, those won't be displayed.
Thoughts, @theoretick? I think option 1 might be best, though as it happens 2 is more trivial to implement. In particular, what are your thoughts on the weird from and to fields in the response I highlighted above, and how does that affect this? Was that request generated by the frontend?
Thoughts, @theoretick? I think option 1 might be best, though as it happens 2 is more trivial to implement. In particular, what are your thoughts on the weird from and to fields in the response I highlighted above, and how does that affect this? Was that request generated by the frontend?
@markrian I've been playing with this again and haven't been able to generate the same field issue. That was originally generated by the frontend but I don't feel like I have enough data to properly test this. I'm setting a a demo instance on staging with some scheduled daily pipelines to host more realistic historical data we can keep an eye on. Aside from fixing the x-axis issue I'd say we can go ahead and merge the outstanding work and use the staging environment for testing these data point conditions
Test environment: is up for the staging project's production env. With scheduled pipelines for triggering normal traffic at intervals of every 5th and 9th minute. Also an anomalous request every 13 minutes (plus however often random bots hit our public endpoint)
The initial work for !25273 (merged) has been merged so we can run the schedulers for a few days and once auto-deploy pushes that to staging we can take a look at the chart data
If the backend returns datapoints outside of the requested range (as happened above), then those datapoints will not be displayed; but, at least the user will see (roughly) what they asked to see.
Our system looks broken and unreliable of we're not showing what they requested. They can narrow down if they only see data for a few days within the requested time range (e.g. if they set the filter to 30 days but there's only data available for 10 of those days.)
FYI @andyvolpe is taking over all things WAF design
Now that we have the latest deployed to staging you can this behavior more clearly. Here's a sample screenshot where I implemented the above logic 5 days ago and when requesting a 7 day window the chart auto-scopes to the returned data. It doesn't look bad with a more full dataset but would still likely benefit from pinning the x-axis to the requested range (March 4th instead of the 6th)
That said, there's ongoing work on the time windows in !26431 (merged) to make the available ranges the same as those in Monitoring, which contradicts the first point. The second point is still valid, though to the 1 week range rather than 7 days. CCing @zmartins as the author of that MR.
That said, there's ongoing work on the time windows in !26431 (merged) to make the available ranges the same as those in Monitoring, which contradicts the first point. The second point is still valid, though to the 1 week range rather than 7 days. CCing @zmartins as the author of that MR.
happy with DRY-first and we can reconsider the scoping later with an improvement ticket for both charts
@markrian Thanks for bringing !26431 (merged) up. I am hoping that having a single component for time range will help us in the long run. In this case we have the potential of having more features from the threat management dashboard with less lines of code.
Just a note here: the devopsmonitor team is looking at an upgrade to Elasticsearch with #209442 (closed) to fix an issue with k8s 1.16 compatibility. According to the changelog there are no breaking changes but this would be worth keeping on our radar and testing once that issue is prioritized
I've moved this into ~"workflow::In review", since I believe the only remaining MR that needs to be merged is !22911 (merged). Feel free to relabel if that's not the case, @theoretick!
Lucas Charlesmarked the checklist item documentation Add docs for new page, and enable feature by default !22911 (merged), including limitation around CI-based ES installations as completed
marked the checklist item documentation Add docs for new page, and enable feature by default !22911 (merged), including limitation around CI-based ES installations as completed
Closing out as this has shipped and verified on Staging.
I found a minor bug with review apps which I'll fix in a subsequent issue. Since it only affects certain environments I'm assigning it a lower severity and we can push this into %12.10 with the WIP MR I have #211615 (closed)