Users who have enabled the WAF do not easily know what the WAF is blocking, allowing, or how much traffic it processes. This lack of visibility means it is more difficult to determine how to configure, tune, and evaluate the WAF.
Intended users
Users will view this after initially creating a cluster and installing the WAF, to confirm they are seeing traffic
Security team members will view this to see what the distribution of blocked vs. allowed traffic is
Further details
Reporting statistics and information about the WAF's behavior could be a very deep experience if we invested a lot of time in it. For this iteration, I'd like us to view it as an MVC and find a way to provide visibility with a minimal amount of product changes. Then we can get feedback on what is useful or missing and then build out a deeper experience in future iterations.
Proposal
Minimal
Display to users with the WAF how frequently the WAF:
Identifies Anomalous traffic
Identifies Total amount of traffic
Default time frame is 30 days.
Proposal that we put this in the security tab as a set of text boxes, similar to how we have # of vulns in the security dashboard. Would like input from others on this.
Next steps
Table with detailed event listings
View raw logs themselves in the pod logs
Permissions and Security
Users should be required to have the same permissions as they would for the security dashboard to view.
What does success look like, and how can we measure that?
Of all users who have the WAF installed, at least 75% view the provided information at least once within 30 days.
Of all users who have the WAF installed, at least 90% view the provided information within 90 days.
This demonstrates that the users who are using the WAF actually are looking at the results and not just installing it and ignoring it.
Documentation
Explain how to access the statistics.
Explain what the statistics are showing.
Explain what different results could be indicating.
@stkerr Thanks for creating this issue a few milestones out so we have plenty of time to consider design and research tasks!
@andyvolpe and I created a few JTBDs, and your feedback would be most valuable:
MVC:
When I activate my WAF for the first time, I want an indication that it is working, so that I can feel confident I set it up successfully.
When I am reviewing blocked/allowed traffic, I want to see this information visualized in real time, so that I can take action if I see an anomaly.
Post-MVC:
When I am reviewing blocked/allowed traffic, I want to be able to filter my data, so that I can focus and take action on anomalies.
When I am reviewing blocked/allowed traffic, I want to see historical info, so that I can spot trends and plan ahead for defensive action.
Re:
Proposal that we put this in the security tab as a set of text boxes, similar to how we have # of vulns in the security dashboard. Would like input from others on this.
Can you clarify what you mean? Are you thinking of it becoming an item under the Security & Compliance menu at the same level as the Security Dashboard and Dependency List?
Or should it say something else; e.g. "WAF Statistics Reporting"?
Current assumptions:
We will need to communicate the path to the WAF Statistics after the user creates a cluster and installs the WAF (on the Configuration page); e.g. a button appears after installation is complete that says something along the lines of "View WAF Statistics".
For MVC, we won't have any control over the visual branding of the graph and will not be adding functionality (e.g. filters) on it. In other words, the graph will be pulled from a 3rd party (ModSecurity) and aside from general framework of the page (page title, padding) there won't be anything to design or configure (at least not until a future iteration, in which case another issue can look into added features and functionality).
Or should it say something else; e.g. "WAF Statistics Reporting"?
Will this screen grow to eventually house more than WAF stats? What do you think of renaming it as "WAF"/"Web Application Firewall"/something more generic so if we want to add things like configuration or WAF-specific reports, they could live there?
We will need to communicate the path to the WAF Statistics after the user creates a cluster and installs the WAF (on the Configuration page); e.g. a button appears after installation is complete that says something along the lines of "View WAF Statistics".
Correct.
For MVC, we won't have any control over the visual branding of the graph and will not be adding functionality (e.g. filters) on it. In other words, the graph will be pulled from a 3rd party (ModSecurity) and aside from general framework of the page (page title, padding) there won't be anything to design or configure (at least not until a future iteration, in which case another issue can look into added features and functionality).
To add some nuance here, ModSecurity is text-only output - there is no graph (that I'm aware of). We'll need to parse the log files it is producing, sum up the results, and display that ourselves. So I think we'll need design in the sense of understanding how we want to present the information, but agree with you that we don't need filtering and sorting for this iteration.
@beckalippert I think we could generalize the term based on what the WAF is reporting. Maybe something like Threat monitoring, Attack monitoring, Threat logs, or something along those lines.
@beckalippert@andyvolpe@stkerr I'm concerned about us over-specifying here and wanted to get your thoughts on a more generic approach to this.
It looks like the current direction is built around a custom dashboard for WAF reporting/stats. I'm seeing a number of similar ideas across our different stages but this could lead to a lot of UI debt rather than a more unified approach.
If the basic problem is that we need to aggregate data in from the WAF (and future orchestration tools) and display it in some way, it might make sense to coordinate with Category:Logging around their work in #3711 (closed) and #30729 (closed). Elasticsearch and/or fluentd is a good candidate for an aggregation tool and we could leverage an existing tool like kibana on top of that to visual data.
To add some nuance here, ModSecurity is text-only output - there is no graph (that I'm aware of). We'll need to parse the log files it is producing, sum up the results, and display that ourselves. So I think we'll need design in the sense of understanding how we want to present the information, but agree with you that we don't need filtering and sorting for this iteration.
So the minimal MVC would be log availability. This would either result in something very close to duplicate the current functionality of Kubernetes Pod Logs (plus the enhancement in #3711 (closed)) or a manual process of parsing of the log file to generate graphs (a basic feature of a visualization tool such as Kibana built on top of elastic search)
So, I guess my basic question is: what do we gain by building a separate UI here and how does this approach fit into similar initiatives with our Category:Logging friends?
So, I guess my basic question is: what do we gain by building a separate UI here and how does this approach fit into similar initiatives with our Category:Logging friends?
@theoretick there are a few assumptions we have that are guiding our approach. Note these aren't validated yet as research is ongoing.
Competitors have a dedicated UI for metrics specific to logging and monitoring for security events (not always a good reason but we have to acknowledge that this is reflected in the industry)
Users who are viewing the WAF output are different than those who are viewing other metrics and have different needs/goals/expectations
We want to build for the eventuality of including the log data in a list format to accompany the metrics and statistics #13555 (closed)
Drection-wise I'm not sure if we want to link to an external service like kibana but I don't want to speak for business decisions, just something I'm unsure of.
So the minimal MVC would be log availability. This would either result in something very close to duplicate the current functionality of Kubernetes Pod Logs (plus the enhancement in #3711 (closed)) or a manual process of parsing of the log file to generate graphs (a basic feature of a visualization tool such as Kibana built on top of elastic search)
This issue is targeting a slightly different value point than what more usable log availability would address. Users can already view the raw logs today after going through some additional steps, but what this issue is trying to do is help them to more easily understand if the WAF is turned on, seeing traffic, and how much, if any, traffic it is blocking.
This same info could be deduced by a user processing the logs themselves, but doing this for them is really the key point this issue addresses, rather than making the logs themselves more readily visible.
We do need to make the logs more readily visible though - not diminishing the need for that at all.
It looks like the current direction is built around a custom dashboard for WAF reporting/stats. I'm seeing a number of similar ideas across our different stages but this could lead to a lot of UI debt rather than a more unified approach.
It's a fair point and a good observation.
I feel fairly confident that all the various security capabilities & data sources we collect from will end up in a centralized "Security" location together, since we want to provide a curated security-specific experience, regardless of where the information comes from in terms of underlying technology (e.g. combining pod logs and a SAST log in one place). We should be mindful as to how this will grow over time to not end up creating more work for ourselves in the future or duplicating efforts multiple times.
So, I guess my basic question is: what do we gain by building a separate UI here and how does this approach fit into similar initiatives with our Category:Logging friends?
Building it in this way is the minimal step we can use to deliver new value and gather additional feedback for future iteration. It's possible we could end up going in slightly different directions, based on what other groups are doing in a similar problem space.
Competitors have a dedicated UI for metrics specific to logging and monitoring for security events (not always a good reason but we have to acknowledge that this is reflected in the industry)
@andyvolpe which competitors? If we are considering AWS and Azure as similar enablement platforms they both use a unified logging system:
Users who are viewing the WAF output are different than those who are viewing other metrics and have different needs/goals/expectations
Potentially, but advocating for a "shift-left" approach, this should have some overlap with the same developers/ops people deploying apps. If we enable blocking mode, I (as an application deployer) want to see why requests are being blocked)
We want to build for the eventuality of including the log data in a list format to accompany the metrics and statistics #13555 (closed)
...
Building it in this way is the minimal step we can use to deliver new value and gather additional feedback for future iteration. It's possible we could end up going in slightly different directions, based on what other groups are doing in a similar problem space.
Thoughts?
@stkerr Currently there's no collective understanding of what we're using to build these charts. The most obvious approach I can think of is some log exporter, like fluentd/elasticsearch feeding these charts.
Monitor is already looking at adding Elasticsearch (#30729 (closed)) as part of the logging vision. However Serverless just had a similar proposal which just had a change of direction in order to align with Monitor as well. to quote @kencjohnston from this issue:
I've got some concern about this approach.
It appears to be at cross purposes with our broader logging and apm work (#30729 (closed))
It breaks our pattern of avoiding popping out to other UIs and breaking the single application value.
Please consider another approach.
So I just want to ensure that we are building something "Highly aligned, but loosely coupled" where-as this currently only feels "loosely-coupled". If we are looking at a way of bringing cluster-based data logging into our UI, it could result in a lot of redundancy and duplicate effort from both a product and implementation perspective.
Perhaps this doesn't change the UI but It very well might. If we bring Monitor into the conversation this should give us time to figure out what the logging vision is and whether we're staying aligned with that vision or not.
@theoretick@twoodham to re-state some of our voice conversation from earlier today:
@stkerr Currently there's no collective understanding of what we're using to build these charts. The most obvious approach I can think of is some log exporter, like fluentd/elasticsearch feeding these charts.
We should build on top of what other teams have built already, but if there isn't something, we shouldn't block ourselves and wait.
We could either pick up the issue from their group to work ourselves or implement a different "boring" solution. In the future, when there is a definite direction & mechanism for all groups to use, we could then convert over. We need to make sure the implementation we do for this iteration is a two-way door approach in case changes are needed later.
One thought we discussed was grep'ing the WAF logs & counting the number of messages blocked that way.
If we bring Monitor into the conversation this should give us time to figure out what the logging vision is and whether we're staying aligned with that vision or not.
Great point. Adding @dhershkovitch here for his thoughts and if there are longer-term logging vision items we should align to in the short term.
We had a call last week with @dhershkovitch on a path forward to stay aligned with devopsmonitor's vision for unified logging here.
Eventually we should plan on backing our interface with modsec logs stored within Elasticsearch, a deployment and process that's being actively worked on in the upcoming releases by groupapm. This should provide a single approach to log aggregation and storage on gitlab.
Since we don't want to stay blocked in the meantime our proposed way forward would be to look at leveraging the existing prometheus integration for data aggregation and storage. This should work in a similar manner to our existing integration with GitLab for stat aggregation based on the existing Prometheus GitLab Managed App and Prometheus::ProxyService.
The only backend component needed will then be pushing the logs off our ingress controller by exposing an exporter to prometheus. We will need this exporter to parse the existing modsec_audit.log or look further into the remote logging capabilities of modsecurity (there's several, which can be explored within #32459 (closed)).
To ensure we can make a seamless transition in the future to ES, we can build our exporter using some basic log broker; i.e. a logstash/fluentd/fluentbit sidecar which we can later retarget away from the prometheus instance and towards ES.
For the frontend we should be able to render any charts using either the existing echarts integration or the upcoming Grafana JSON embedded rendering (#31376 (closed)) if a user has an active grafana instance.
The other item that came up and needs further consideration is how we want to handle data-loss during the eventual transition to an Elasticsearch-backed system. There were a couple of points raised here:
If we configure our initial time window for stat rendering to correlate with our server log rotation then we should be able to easily re-scan the active modsec log when swapping primary data stores and the user should experience no data loss as we preserve the same timeframe retroactively. This rotation window be small however so should require further research.
From a usage perspective, going with a similar UI to the existing monitor page (or embedding within) initially would allow us to later swap to a dedicated standalone UI. This could be considered a future feature release and therefore make "data loss" acceptable since it's not technically the same feature.
Ability to show not only blocked but also logged actions by the WAF (so it will show useful data for users not in blocking mode)
Ability to graph blocked and/or logged traffic on a per WAF rule basis (so that users can identify both what is being attacked most and also what may be causing false positives)
Ability to show latency introduced by the WAF (in general WAF's will introduce latency, and this can be significant on a per app / per rule basis).
@stkerr While this is scheduled for %12.5 there are some base changes we might want to make to the logging configuration to better enable our exporter, primarily changing the base logging format from Native to JSON logging. Since our existing documentation only provides directions for tailing the log file we can probably change the format without a deprecation warning but if you think that's necessary we could consider adding one within %12.4 in anticipation of this featureset
@theoretick thanks for pointing it out. I don't think we need a deprecation notice for this in advance, as long as we document it properly in 12.5 as part of this issue.
It sounds like we're all on board moving forward with a dedicated page for Threat Monitoring (title open for discussion in case anyone would like to propose an alternative), which, as discussed with Sam earlier in this issue, would live under Security & Compliance > Threat Monitoring. Please let me know if I've interpreted the above discussion erroneously.
WIP wireframes for MVC look something along these lines:
This covers the need to display to users how many times the WAF has:
Allowed traffic
Blocked traffic
Total amount of traffic.
In addition, we'll be providing the instructions for viewing the WAF logs. (This part is the WIP and variations have been included in #13555 (closed).)
@beckalippert My recommendation is to let all on the issue know the copy is ready for review, but also to explicitly add @axil (our Defend Technical Writing counterpart) and ask for a review.
@theoretick Unless I've misunderstood, it sounds like the backend will expose one proxied Prometheus endpoint, which can be queried for the graph data. The actual Prometheus query itself needs to also be provided by the backend by some means. Perhaps the simplest approach this iteration is to just provide both in the rails template via data attribute(s).
What about the single statistics? Do you expect they can be fetched via the same endpoint, just with a different query, or will they be exposed via a different endpoint? Or could these just be the most recent data points from the graph data?
We should be able to expose all stats through the existing prometheus proxy endpoint. Since the "total requests" will just be the existing dataset returned on the Metrics page we can reuse the same request; i.e.
I'll still need to figure out how we're getting the data for blocked requests into prometheus at which point I can update this issue with the appropriate query params, but populating allowed responses using the aforementioned query would be a good start
It seems this will be more straightforward to expose the modsec stats as a separate endpoint, see updated description and WIP MR !19789 (closed).
Obviously splitting queries between the two services isn't ideal, I'll see if we can make this a bit cleaner by proxying the prometheus metrics on the backend and returning as a single payload
@markrian can you provide some insight/preference in how we want to expose this? It seems like we have 2 really extreme approaches to stat payloads.
The vulnerabilities/summary style is basically: { "high": 12, "critical": 0, ...} and I think a hardcoded time range.
On the other hand Operations / Metrics is proxying the full query. From your earlier comment The actual Prometheus query itself needs to also be provided by the backend by some means I guess we do provide this for metrics? I couldn't actually find where this is done.
I think I'd push for something in the middle; an endpoint providing both aggregates and time-series that's queryable to proxy requests; something like:
But I'd prefer we stick to an existing approach if possible and I'm curious what's reusable on the frontend. What are your thoughts on keeping this easy to build out vs maintain?
Here's the current WIP raw output from elasticsearch that I'll keep updating as I normalize it. I'm wondering If I should just shim an adapter to make the queries a bit more prometheus-y, assuming that code is reusable. If we're just injecting the actual query somewhere I suppose it doesn't matter
@lkerr CC'ing you for visibility on my last comment, as well.
tl;dr - we're approaching a point where we need more consensus on negotiating the API contract. I've like some feedback on what ~"technical debt" is already there prior to writing up a swagger doc.
How far are we from getting designs from the wireframe?
I have some questions/thoughts about the wireframe, as it is now.
Single statistics
Are we planning to display percentages in general, or counts? If percentages, then traffic allowed is by definition 100% - % traffic blocked. Does it make sense to display both, in that case?
What would "Gross Traffic" represent? Presumably this is the "Total amount of traffic" in the description, but I'm not clear on how that's quantified as a single statistic. Number of requests per unit of time, perhaps?
Chart
What are we hoping to convey with the chart? Proportion of blocked requests over time compared to allowed requests?
If we're displaying these as percentages, then they would by definition sum to 100%, so it's perhaps worth showing only % blocked. If, however, we're displaying counts (i.e., number of requests?) per unit time interval, should this be a stacked bar chart, or something else?
For the first iteration, the chart won't be configurable or filterable at all, as I understand it. What's a sensible default time frame to graph from and to, then? The last 30 minutes, 8 hours, 24 hours, 7 days, a month, or something else? The Metrics dashboard appears to default to the last 8 hours.
I was hoping the specifics of data that we a) are capable of displaying and b) think are most useful for the WAF user is based on collaborative input from the team. I created a research issue to learn more about our user and their needs, but until that research is underway I look to our security experts internally (cc @theoretick @whaber @stkerr@twoodham@leipert ) and lean on competitive analyses ((1) Statistics Reporting 2) Rule Management) to see what data other WAFs show. From the Statistics Reporting one, it seems pretty common to show attacks blocked (either in a % or #, or both) and page requests or allowed requests (either in a % or #, or both), but that said there doesn't seem to be much standardization of this data across tools, so it seems like we can make an educated guess for MVC and collect feedback from there.
The wireframe was merely to display components that we could use and to start the conversation about the specifics we want to show. I know we use echarts and will want to use the area chart for the graph on the bottom but don't know what defaults are associated with them WRT filters and time frames. Taking a look at competitors, it seems like a month could be a good default (which appears to be the default for Sqreen and Demisto). @andyvolpe might have more thoughts?