The New Members Chart will likely require us to start by implementing a new Generics Metrics API end-point for storing data. We will track discussion related to persistence in this new issue.
Another bit of work required is on the BE to figure out how to fetch the data and get it into that API. Since we started that discussion below, we can keep it here.
We will also need to enhance Report Pages configuration and data fetching capability to support the new Generics Metrics API end-point to read and display the data in the chart. I have opened Report Pages Fetches Generic Data to track that discussion and work.
The most importance acceptance criteria for this story is that we demonstrate that it is generic. We can show that by putting posting arbitrary data associated with a slug in the new end point, updating a config file, and displaying it by navigating to the report page URL with the new slug.
Implementation issues
Back End
Figure out how to Fetch new members data efficiently (this issue)
@npost Please see the acceptance criteria that came out of further conversations with yourself, @djensen and @wortschi. When you've done your magic, please ping @wortschi and @djensen for estimation.
@djensen suggested that we simplify breadcrumb to just back so that we don't have to pass parameters into the page to handle that in the first iteration. I have captured that in point 9 above. Can we remove the more robust breadcrumb from the image for clarify?
The back link will display only back and will fire the browser back button so no context needs to be passed to the page. We will revisit a richer breadcrumb later.
One problem I can think of when using history.back() is that it conflicts with any query parameters. So when you are already on the generic page and apply any filters (which could be added URL as query params to support deep links), history.back() would remove the params as opposed to redirecting to the previous visited page. So we might need to persist the root page (which the user came from) either in a cookie or localStorage.
This thread discusses clarifications and requirements for the Data Set Definition that the Dev team asked for during planning. Please respond with any further questions. Note that the intent of these classifications is to create confidence so that we can move ahead.
As we said in point two above, the engineering team has room to maneuver here. The key acceptance criteria being number eleven in the description above.
Clarification RE: Acceptance Criteria Three (Above)
@djensen@wortschi One of the questions you asked during our planning meeting was around was around the whether our approach to Data Set Definition was to be query parameter based or more like a YAML configuration file such as used by Insights. @wortschi pointed out that these seem to favor different use cases--for example query params would be good to support navigating from many places in the app, while the YAML configuration would be good for user customization.
My answer to this is that in acceptance criteria three above I pointed out that for the scope of this story we can assume that this is a system provided Data Set Definition which cannot be changed by the user.
So one approach would be to say that we are going to go with a YAML approach but that the YAML files are going to be stored in a system configuration directory inaccessible to users. A segment of the URL allows us to lookup the Data Set Definition which applies from this configuration directory. In a future iteration the same YAML file format could be used but storied the .gitlab directory as w/ Insights. This approach would also allow us to put things in this file for features that are baked into the product but which aren't ready to expose to users.
YAML is not a requirement. My intent above is just to illustrate the scope of the problem we are solving in this issue (c.f. point 3 above) is for system defined Data Set Definitions--not yet user-defined nor user customizable.
My instinct is to think that it would be most useful to have some document whether YAML, JSON, etc. stored on the file system or in the database that describes the the Data Set Definition and which is identified by a URL segment will be the most versatile and easiest to start with.
Eventually we may decide that we want to allow parts the Data Set Definition it to be over-written via query params, but that seems out of scope for now.
The only downside to this approach that I am aware of is that it does depend on something begin stored ahead of time whether in the database, filesystem, etc. and so limits how dynamic interactions could be around entirely new Data Set Definitions. That is a limitation I am very happy to have for now as there are also benefits that come from having a pre-defined catalog of these Data Set Definitions whether defined by users or the system.
I earlier expressed hesitation to @djensen regarding using Raw SQL or SQL snippets given sharp edges we may expose to user in terms of security and performance / system stability. Given that we are only talking about system defined measures in the scope of this story, those remarks should not be interpreted as limiting our options. Raw SQL or SQL Snippets are permissible.
What must a Data Set Definition do in the long term?
The fundamental idea of the Data Set Definition is to include one or more data series with hints or options about how they can be visualized to solve answer a specific question or help solve a particular problem.
It provides a catalog of existing metrics available to the user.
It provides a mechanism for abstracting the display of data so that front-end code doesn't need to be written when a new measure is required.
Computation & Aggregation
The Data Set Definition may or may not be the source of truth about the computation or aggregation of this data. It may specify a query in SQL or DSL or both that handles the aggregation or computation, or it may simply point to data which has been precomputed by some other means. The point being, we don't have to solve all computation or aggregation problems with a query of some kind in the Data Set Defintion.
Dashboard and user preferences
The Data Set Defintion does NOT store information about the layout of dashboards, selected options for the chart or preferences about the display beyond constraining choices those that are reasonable defaults or meaningful possibilities. The job of Dashboard layouts and user preferences around displays is a separate mechanism.
No such mechanism is required for this story as scoped.
My instinct is to think that it would be most useful to have some document whether YAML, JSON, etc. stored on the file system or in the database that describes the the Data Set Definition and which is identified by a URL segment will be the most versatile and easiest to start with.
@jshackelford: I think we can be less opinionated about implementation and just lean in on the state of the world that we will find ourselves in once this is solved. All of the above seems to be motivated by:
Performance. We want the views we present to be fast. If a user browses to a URL, they should be presented with information quickly.
Flexibility. We anticipate lots of different reporting needs, so we don't want to reinvent the wheel every time. The ideal implementation here is agnostic to the specific data that we're trying to visualize.
I personally don't care how we handle this as long as those needs are met. What may help engineering create iterate toward a global optimum may be a few bullet points on the end result, if we had a magic wand:
Vision:
We anticipate users wanting a wide variety of reports.
We want them to visualize these reports in GitLab. This keeps them in the application and lets us integrate these insights into other areas of the product.
We also anticipate users sharing these reports with each other, so they should be able to share them easily. One way may be sending a URL to another person. That person should see the same data that I do.
A report should be able to incorporate:
One or multiple charts
Data filtering
Custom grouping
Scoping to multiple projects and groups
This information would be available in real-time (not pre-computed).
Rendering the page should take less than 1 second per chart.
@djensen@wortschi: What do you think? Would it be helpful if we focused on an approach like the above?
I'd love to get some back-and-forth in this issue on how we approach the engineering work. Do you have what you need to propose an implementation plan?
@jeremy Thanks for contributing! The MVC you have proposed is what we have in mind.
I think we can be less opinionated about implementation and just lean in on the state of the world that we will find ourselves in once this is solved
Agree. I was writing in response to a specific question raised in planning where Dev seemed to be looking for clarification.
RE: Vision agree with all but one point: multiple charts. Multiple charts will be handled by a dashboard which sits on top of the report pages. We will cover that in another epic later. The report pages themselves provide not only the chart but a table with the underlying data for further exploration. The data table is not in scope for this MVC.
@jeremy Thanks for sharing your thoughts. The scope of the MVC is clear to me and makes sense from a FE perspective.
One of the questions that was brought up in a recent discussion was about the potential scenarios (like multiple series, different chart types) that we wanted to cover in upcoming iterations. Understanding those requirements would help to build a solid foundation (especially from a backend perspective) which I believe should be one of the key features of this MVC.
@pshutsin@ahegyi@m_frankiewicz this issue is intended to be the first step toward a generic Analytics reports page. The chart in this MVC is relatively simple, but there is a twist: this needs to be based on some kind of generic approach that could work for other data sets like issues, merge_requests, todos, users, etc. There have been some ideas about formal data set definitions (like insights.yml, or something stored in the database), passing query parameters, etc. It would be great to learn your thoughts and suggestions!
I like the second example given by Adam, but wonder if GraphQL would not be a too powerful tool with the task. Therefore I would weight more on the side of having a YAML file for the Data Set Definition.
Something like
id:members-count-by-weektime_interval:unit:year# or month, day, etc.number:1group_by:week# or month, day, etc.time_serieses:-source:group-members# this can be one of the "existing" sources of time series dataaggregation:count
Here user could choose only from a given set of parameters.
Having time series as array would enable for graph using more than one time series at the time (if that's desired).
As far as I understood Data Set Definition can be prebuilt by us or created by users (maybe in future iterations). So there can be cases where aggregation\filtering logic is stored on our side, and cases where it's stored on user side. So I'd not expect any kind of aggregation logic to be defined in YAML.
General flow picture as I see it:
we have Data Set Definition defined as Ruby classes, exposing their metadata to users as "available data sets" in separate API endpoint. Those classes can also contain actual logic how to aggregate data and save it as data points.
DSD classes invoked in background to refresh data nightly.
Generic metrics chart page looks up for DSD and applies generic filtering to retrieve a data set.
Open questions:
So far the structure of Data Points is not clear. How we'll store different primitives and where? According to &3105 (closed) it can be an integer, 2 integers, multiple key-value pairs, float.
who and where will define filtering options for given data set definition? Will they vary from DSD to DSD? I guess they all will be limited to our columns structure for data points.
According to design issue the same metric can be aggregated per day\week\month, Assuming we store data aggregated per day we'll have to either save 3 aggregation levels or perform aggregation on the fly for week and month.
For simplicity of MVC I'd recommend to limit "generic graph page" to one graph only and also exclude user-created DSD and user-written data from MVC for simplicity.
According to design issue the same metric can be aggregated per day\week\month, Assuming we store data aggregated per day we'll have to either save 3 aggregation levels or perform aggregation on the fly for week and month.
Several aggregation levels/tables (day, week, month) is common in data warehousing. Scheduling and executing aggregations in a performant manner will be challenging (GL.com).
Concern:
I'm not sure how can we enforce permissions and visibility rules on aggregated data. For specific aggregations we might not need to (AVG, Median), however enforcing permission on COUNT is a must. We had several security issues related to leaky issue/MR counts.
@ahegyi I'd be interested in hearing more about the scenarios you have in mind where security of aggregated data was a problem.
My current thinking is that any stored data set can be tied to a group or project and those who have access to the that group or project can see that data set. This MVC is all about new activity in group level counts and I don't think the numbers should change based on what projects a user can or cannot see in the group. We may have to do some filtering in the data table below it when we get there and display a message like "three items not shown due to users permissions" but we don't have to solve that problem now.
There may be other scenarios where we are looking at a sum of all project-level data sets and only those projects the user has access to are shown, but that isn't our MVC scenario.
Note: this is just an example, not related to this issue.
A group can have several subgroups with different permissions. Projects can be also configured to have different access levels.
Aggregating data for a project is not a problem.
Aggregating data (MRs, Issues) on the project level is tricky.
Consider the following group structure:
Root Group - Project 1 - Project 2 - Nested Group 1 - Nested Project 1 - Nested Group 2 - Nested Project 2
User is part of the Root Group as reporter.
User is maintainer in Nested Group 1.
User is guest in Nested Group 2.
Generally we allow users to see analytics pages when they have reporter permission.
Example 1: Number of issues opened (last 90 days) for a group.
Getting the counts from projects in the Root Group and Nested Group 1 is easy. Getting the counts from Nested Group 2 requires an extra filter to exclude confidential issues.
If we'd just show a count that would allow the user to "guess" how many issues/confidential issues are in Nested Project 2.
Thanks for the example @ahegyi. You mentioned past customer concerns about exposing counts of confidential issues. Are there specific issues or conversations you have in mind there?
Data Set Definition can be prebuilt by us or created by users (maybe in future iterations)
Yes!
I'd not expect any kind of aggregation logic to be defined in YAML.
In my original conception of this (as described in &3105 (closed)), all data is pre-computed either by us via background jobs or by the user via API calls. @djensen raised concerns about how this would scale and suggested we take an approach more like insights where the DSD specifies a query.
In then end, we realized that these aren't mutually exclusive and there may be cases when we need to show data that is precomputed along with data from a query as different series in the same chart.
In my view we don't need to solve for all of that in this story and the two follow-ons. We've deliberately kept the scope narrow so we can pick an approach that works for these and iterate as we go.
I suggest picking either query or pre-compute -- whatever is going to be simple and performant for this story and the next two -- and going with that.
With regard to meta-data in the DSD I suggest keeping the format fairly consistent across user defined and system defined (though there may be specific data retrieval approaches you allow only in system defined for safety) as the most important success criteria of the story is that we can see new reports with no code change. (In other words, we can either define a new query in a YAML file or push new data via an API call and see it displayed.)
as the most important success criteria of the story is that we can see new reports with no code change. (In other words, we can either define a new query in a YAML file or push new data via an API call and see it displayed.)
As for user-written data I don't see a problem here because we don't calculate anything on our own and just store user-calculated data and display it with our UI components. As for "GitLab-designed" metrics I'm concerned on performance of queries that are built by yaml definition. Those will be hard to debug and estimate. Also it will take time that safely converts yaml definition of a query to ruby code. So I'd personally vote for a solution which requires manual definition of aggregation logic with all the appropriate tests and performance evaluations for each new metric we calculate on our own.
what do you think we are likely to find problematic in terms of the queries?
@jshackelford The process of developing and verification of those. Who'll know what are the available options for query definition in yaml? Who and how will verify that queries defined are safe (no injections, no access violations) and performant (proper indexes, fast query plans)? What will translate yaml to Ruby code? I see only one candidate for this job => Backend engineer, then I don't see a value of defining query logic in yaml.
do we actually need aggregation at API level? I thought pre-calculation and aggregation will happen in background, so API exposes filtered data points only.
E.g. avg_members_count is one DSD, total_members_count is another DSD etc and they all precalculated in background. So in the data points table will be rows like (avg_members_count, 2020-05-20, 900) and we pass them to FE.
Update: Hiding a crazy long comment to tidy up this issue.
Curious about the WHY? This might help.
@pshutsin@ahegyi@m_frankiewicz In the following comment I am not trying to nudge you toward query executed at runtime or toward pre-computed aggregations for this MVC, though I am happy to engage there if you need/want me to, but I do want to be very clear about the problem we are trying to solve. We don't need to solve it all in this MVC--but I do want to be clear about direction. Please forgive the length of this comment but I hope it will help.
When we are successful in analytics we will have many thousands of users who are using our capabilities on a daily or weekly basis. At scale it is not realistic to expect that our small team will be able to conceive of or implement all of the various needs our users have to apply data we have in GitLab to understand and manage their own software products, projects, value streams, etc. We will provide signature experiences that express an opinion and which we think will be broadly applicable--but those will meet only some of the needs that customers have.
We could take the approach that all needs beyond those signature experiences need to be met by an external reporting tool like Tableau or Sisense. Most enterprises have these tools and often they ask us about making it easy to get data into them. Some of our competitors have been taken the approach--for at least initial versions of their products' reporting capabilities--to OEM Tableau for their reporting solution. We could chose to do the same but I and others in product leadership at GitLab believe that taking that approach is not the right move for GitLab for several reasons.
First, being a single application is core part of our value proposition. As soon was we say you need to add on another tool whether provided by us or others we significantly compromise on what is a key differentiator for us. Second, the typical user and consumer of tools such as Sisense or Tableau are not the user profile we are after. While they are often used in corporate governance scenarios, they are rarely used day-to-day by developers, project managers, product managers, operations teams, etc. They are much more likely use the dashboarding capabilities in JIRA or whatever agile issue tracking system they are using, open source tools like Hygieia, sometimes even fancy add-ons like easyBI. The scenarios these users want to solve do not require the flexibility and power of Tableau or Sisense but they do require some openness. Our mission is to provide a level of openness without opening a tarpit of complexity in the implementation and we will do that by constraining the problem and placing limits on that flexibility. I am very confident that we can work together to find that balance. This is not a new problem for me and we've got a great team.
Today, insights gets more use internally than most of our other features and it is used in places and for scenarios we wouldn't have necessarily thought of. The marketing dept. for example uses it to track their work. We need to put into the hands of our technical sellers and service teams the power to resolve customer requests -- at least come close to what the customer wants even if not all the way there -- without raising a feature request with us. Their solution may not always be elegant, may some times involve performing computations and inserting data with scripts, etc. but there is a big difference between being able to show the customer something and having to say there is nothing we can do -- talk to product management or use another tool. Open source helps, but opening an MR just to get a chart one wants is also too high a bar when compared to flexibility our customers see in completing tools such as JIRA, etc. We don't have to have everything they do, but we do have to have something.
As another example, today one of the most important assets we have in describing our vision for Value Stream Management to analysts and customers is this snippet of video https://youtu.be/oG0VESUOFAI?t=75 which is prominently featured on our website at https://about.gitlab.com/solutions/value-stream-management/ -- where we advertise features that more directly relate to the Analytics team than any other. Interestingly the capability we are showing here is not a scenario that anyone coded for. The flexibility of an open system in the dashboarding used by the ops side of the house made it possible to implement this. Now it is true that they are using prometheus for this scenario, but it isn't prometheus that was important here. It was its openness. For the kinds of aggregation we need to do over months and years PG is fine.
Today we have data for every part of the life cycle in GitLab but when our sellers go to show off the product it is very hard to tell a good story because charts and data are spread across far too many different pages. What we need to be able to do is allow teams to publish metrics into a library that our sellers and users can add into a dashboard that tells the story that is important to them. When I say that I do not necessarily mean that that data needs to be rewritten to another store or table. In some cases additional aggregation may help in others we just need metadata that tells us how to run a query to get it and display it. But the point here is that we need to move away from thinking that we have to do everything as the analytics team and toward a way of thinking that is oriented toward giving others the tools they need to contribute data and ways of arranging it into stories that help teams understand their own environment and deliver faster.
So now I come to the main point of all of this. There are two fundamental things we need to solve in this MVC. We need to start down the road of having one UI for report pages that is decoupled from specific knowledge about data retrieval and rendering and instead is informed by metadata. This way we have one place to enhance and refine and one consistent experience for users. The second is that we want to start down the path of establishing patterns for defining the data we want to display in way that will lead toward an open system so that everyone can contribute.
It seems very likely that in order to cover the breadth of scenarios we will need to address, we will need both an approach where we execute queries at runtime based on a pre-defined query (as Insights does) and an approach where data can be pushed in via API calls. It doesn't matter much to me where we start.
What is really important is that we use the open system we build to be sure that it actually works. There will be steps along the way where we use an approach that isn't safe to release to the world, but we have to view those as temporary steps which we will change to make more safe and then we will adopt and expose. There will also be edge cases where we want to do something special that we aren't going to give our user community, sellers or other GitLab teams the ability to do or do easily. But those have to be the edge cases, 5% of what we do when we are talking about basic aggregations. For most of what we do for basic aggregation charts going forward, we need to move toward implementing features in ways that fit the model we would expect others to use.
(Please note that I am not here talking about the complex stuff like calculating lead time and cycle time from event data. The complexity of that stuff is on a completely different level than what I am talking about here and I do not expect ever to expose to customers an approach to querying or computing that kind of stuff beyond the kind of constrained configuration options we have in VSA or pointing them to our APIs and giving them a place to put their results.)
What does that mean for this MVC?
It means that if we are saying our initial approach to openness is pre-computed data stored via an API for that purpose, this story needs to build the jobs that populate that store and retrieve it through those APIs. Our customers might not define jobs inside our code base but they could do essentially the same thing via API calls and a script running as a cron job.
Or if we are saying our initial approach to openness is to use a document that defines a query which is stored in a config file or in a blob in the database, then we need to do the same for this story.
We will not advertise any APIs or place configuration files where users can mess with them in this MVC so we can change the format in the next milestone to get closer to what would be safer for end-user consumption. We also don't have to provide a mechanism that works with every kind of query or expose every table in the DB. It's okay to constrain the problem. Insights has a very limited set of capabilities too, but even with those limitations its flexibility is useful.
There are likely hybrid solutions that we haven't even thought of. But the key thing is that we need to start the journey toward building a more open system and we need to be consumers of that system if it is really going to properly mature.
Also remember that this is an MVC. The approach we take to foundations here needs to solve the problem that is in the scope of our first three stories. We do not need to solve for every scenario now and we should not try to.
I understand your vision, I'm concerned about constraints and complexity we get with some of points you suggest.
Generic reports page
No problems with the page itself. All the questions are for metrics calculation process.
Signatures a.k.a. "Metrics that we own and calculate"
No problems with this approach.
Expand for details
Since we own the logic here we can write it in Ruby without limitations. We guarantee efficiency, safety etc.
External user metrics pushed via API a.k.a. "Metrics that we don't own and don't calculate"
No problems with this approach.
Expand for details
These are just "store and display" metrics. We can restrict data to be in our supported format, allow users to retrieve data they need for calculations through common api (e.g. /api/merge_requests or /api/milestones) then it's up to users how they want to transform the data before load it back to GitLab. We guarantee efficiency and safety of API as well as "generic reports page" safety and efficiency. Standalone instances can even access DB directly instead of generic listing API if they want.
External user metrics defined via some query config a.k.a. "Metrics that we don't own and we calculate"
Here is where all the concerns happen.
If we really want to make it happen then we need:
Develop configuration engine that is safe and efficient in any (or at least majority) combination of supported options.
Provide verbose documentation about configuration options available.
Provide load balancing of background jobs, that needs to be executed to calculate each user-defined metric.
Provide "safety net" when configured query actually fails, when server load is too big. Inform users when their data will be available and why it's not available at the moment, what's wrong with their query.
That what comes to my head as first thoughts.
First point is biggest concern, others are just time-consumable. That's quite big work and I believe it will spread across multiple releases and teams. I'd suggest to cut this configuration framework development out from MVC and think about it more in separate issue. That effectively means to limit MVC to "generic reports page" + "signatures" only.
@pshutsin Thanks for the quick response and for the detail in Metrics that we don't own and we calculate.
user metrics defined via some query config
I share your concerns about the potential complexity here which is why my first proposal was the store and display approach. Layers that sit on top of the database can have plenty of complexity and one doesn't have to look too far to see examples: hibernate comes to mind and even it isn't consumer facing. On the other hand, we actually are already doing most or all of the four things you are mentioning in insights today. The scope of the problems it solves is limited but it isn't hard to imagine how in incremental steps it could be broader, gradually adding new tables or objects on which computations would become available. Our system will never do everything we could do with a SQL statement, but that isn't the goal.
As to the amount of effort this takes, it isn't hard to imagine scopes or approaches that would require a year's work or more. But it also can be done incrementally. When I've done this kind of thing in the past I've seen a single engineer go from no prior experience implementing a grammar / parser to an ANTLR expression parser that implemented an expression language for basic searches and could retrieve data in just a few weeks. And so far we are not really proposing a grammar. Only suggestions have been around options in a YAML file. Another team I worked with smaller than ours built a UI and backend for visually constructing queries, adding tables to a canvas, connecting the table widgets with lines that produced joins, etc. in just three months. That is well beyond what I have in mind for the capability we are discussing. The key thing is that we limit the scope and constrain the problem in way where we gain more flexibility than we have today.
Just to take an example for this MVC. If we were to use the runtime query approach for example, we would not need to solve the background job problem since we aren't writing data. We could either add raw SQL to a file, put it in a config directory not accessible to end users and execute without protections. In the next iteration we could break it into separate terms stored in YAML or JSON and whitelist values. We aren't trying to support every query. In another iteration we can add timeouts, etc. Or another approach would be to take the kernel of insights and start adding capabilities. Today it runs queries on MRs and issues and can already handle counting new opens. We would need to ability to run queries on group members. In future iterations we can add more options for the kind objects we are counting, where clause options, etc. For more on evolving insights please see &3170 (closed).
Metrics that we own and calculate
These seem to me to fall into a few categories:
Complex queries executed at runtime for very specialized calculations that perform well without pre-computing.
Complex calculations that we make simpler or more performant by calculating / aggregating via background jobs and storing results.
Relatively simple aggregations such as those we have in this MVC that execute at runtime.
For complex queries at runtime, if we think this is a calculation that users will want to display in their own dashboard and the report page, we create a DSD document for it (however we store it) storing metadata for display parameters, etc. and we include an attribute that identifies the Ruby class responsible for the complex query.
For background aggregation / calculations we should store these results in the same metrics store we would expect users to employ and display them in the same way.
For simple aggregations we should use the DSD in the same way we expect end-users to use it as a way of dogfooding our system and providing examples to others. If we are not storing pre-aggregated data for this MVC, I would expect us to use this approach. The DSD and our processing of it will not be mature enough to open to the world at this point, but we can start on that journey. And as always, I'd be happy to break this into multiple steps if it allows us to iterate faster. For example if we want to have one story that implements the MVC with the complex calculation approach and then immediately follow up with a story that moves the implementation to stored results or simple aggregations I am happy to see us split it up.
across multiple releases and teams
I agree that this MVC is the first of a number of releases we will be investing in this area. Please tell me more about multiple teams.
IMHO storing counts gives more flexibility: you can easily get averages or counts (or median).
I think we are talking about different levels of aggregation. I believe you can't get "median merge time" for MR if you have "total MRs merged" or "total merge time" for given date. Those are different metrics.
I think averages, median or any other trend-lines for aggregated data can be easily done on FE. API returns set of 90 dots and FE builds any trend-line they want.
@pshutsin Yes. If we keep counts of daily arrivals, departures, inventory (total open), we can do quite a lot. With cumulative open and cumulative closed we can do more. But this is a different problem than durations like MTTM, etc. Those are more complicated aggregations because you have to look at each item. I don't expect that we'd expose a query language for computing durations anytime soon.
I think we'll need quite a bit of time to figure out how to aggregate data in a scalable manner.
Things to consider:
Aggregating data for all groups is very expensive.
Keeping new member counts daily/monthly/yearly (id, date, group_id) per group is about 7kb / year (worst case).
It'd take about 50GB database storage (estimated based on the worst case scenario) for GL.com including all root namespaces.
Moving computation into background jobs are going to add significant load to Sidekiq. We'll need to be able to estimate how many jobs we're going to execute. We might need to reach out to the scalability team.
Distribute and limit the concurrent aggregation jobs to avoid overloading the database.
Reliability: currently we retry failed background jobs 3 times and then they're dropped.
For now we should focus on Group Activity - MRs Chart w/ Single Series since we can use Insights to help us with that bit as we are discussing here. Once we have a plan for that and issues we can circle back to the New Members problem which is clearly going to be more complex to break down and figure out.
If we can define a small spike that will help us answer some key questions for New Members for our next milestone, that will be fine.
It'd take about 50GB database storage (estimated based on the worst case scenario) for GL.com including all root namespaces.
The Recent Activity feature is currently listed in the docs as for starter/bronze and above. Does your figure for groups include Free groups?
Do you know what is DB daily growth rate in GBs is on dot-com is today to help put that 50 GB figure in perspective?
We can always move the drill-down for new members up a tier if we need to do so to contain costs / load. Another option would be to do this for top-level groups only.
Yes, all root groups including free groups. Collecting data for licensed groups only will drop the data usage significantly. One concern: large self-hosted installations.
Growth: About 10GB / day. We have 3-4 tables large tables where we store a lot's of text (comments, diff files). I think there is a plan to "fix" this.
If we're looking for tables related to analytics (they're in the top 15 largest tables):
We'll need to be able to estimate how many jobs we're going to execute. We might need to reach out to the scalability team.
Perhaps the question here is how many times does a new member get added to or removed from a group in any given day? (Are there batch add operations where more than one member is added at once?)
If the number of change events is smaller than the number of target groups then we can at least build a dataset that has values for days on which there is a change. This gives us enough data to show the adds and removes on the chart. The inventory line will have missing values for days where there was no activity but these can be carried forward when the chart is retrieved. These can be displayed on the front end and not persisted or saved in a batch back to the database when computed.
The problem of carrying forward results to days with no activity is a general problem in DevOps reporting and applies to things like test case counts or test code coverage line counts when aggregated across many applications since some applications have days without commits / builds.
Do we have to update the table on each new record? I thought we can do it overnight => one job per day which pulls out "all new members that joined today" and update daily\weekly\yearly metric values. Do we accept daily lag in our analytics metrics? As far as I understood we are not trying to build live state monitoring, but more of a retrospective overview analytics.
Yes. Real-time update is not a requirement. The lag is acceptable.
"all new members that joined today"
If it is possible to query new members that joined today across many groups, this may work. If we have to check each group, then it appears the load will be too high given the number of groups and most will have no membership change. (@ahegyi am I understanding your point?)
The same metric can be aggregated per day\week\month, Assuming we store data aggregated per day we'll have to either save 3 aggregation levels or perform aggregation on the fly for week and month.
Perhaps we could begin by assuming that if we have counts by day, we could compute week, month on the fly and then just run some tests that tell us how many data points is too many for this to work. In general, I rather like the approach of saying, "We are going to take approach X and it will have limitation Y." I can then validate that the assumption is likely to be safe for the future or that we revisit to address it.
Note that we specifically split off the user selection of day\week\month into its own story so that you don't have to bother with it now if it adds too much complexity.
As discussed on the Analytics call, here the example data format used for column chart;
[ [ "Joe", // x-axis 1220. // y-axis ], ...
Data returned from the API can have a different format as transformation can easily be applied on the FE. We'll just need to ensure that the format is standardised (if possible) in order have generic transformations applied without requiring additional configuration meta.
The minimum that we’ll need in order to begin work on the frontend is a route for the page.
It will be a good idea to introduce this behind a feature flag as it will allow us to iterate while preventing users from browsing directly to the page. We can then also remove the clickable links which direct users to the page based on the flag.
This URL sits above the group level because eventually we may have measures that span multiple groups.
analytics because we will have other analytics pages in the future at this level such as dashboards which also may span groups.
rp report page in nice short URL.
Counterpoints:
If there is always a top-level group in an instance (even in self-managed) than it would make sense to include group in the URL since one is always in the context of some group.
Proposal: Use Insights as the foundation for Generic Reports. That would require modifying Insights to be able to query the members table (currently it can only query "issuable" tables). The Insights feature offers answers to many of the questions we are discussing in this issue, like:
Basically, it seems like using Insights could minimize the amount of backend and frontend work that needs to be done to implement this issue (and others). I'm very interested in your thoughts, please comment here: &3214 (comment 349660566)
It is no longer our MVC since Group Activity - MRs Chart w/ Single Series is much easier to implement with Insights as our starting place. I will move acceptance criteria more appropriate to the MVC to that issues.
Going forward, we can use this to discuss fetching data the New Members chart picking up the conversation in this thread.
We can keep this issue clean by tracking discussion related to persistence of data via a new Generics Metrics API end-point in this new issue.
If someone thinks that this issue is too much of a mess to save, close it and create a new one. You won't hurt my feelings. :)