Aggregating Gitter community data into reports (stats)
Data to be gathered
Following views on Gitter data would be of great benefit to both the Gitter team and the wider Gitter community:
List most active communities on Gitter (messages/time period)
Would be great to see what types of communities thrive on Gitter. Possibly ask these communities to test Gitter features like feature toggles.
This is a "simple" aggregation grouping chat messages by rooms.
Query production DB
The following query returns a sorted list of rooms with their message count for the period since the timestamp hidden in _id
- use https://steveridout.github.io/mongo-object-time/ to change
Query script: https://gitlab.com/gitlab-org/gitter/webapp/snippets/1915718
List most populous communities on Gitter
Troupes in the database have userCount
property, but there is no index on it and sorting by userCount
would mean walking through 770199 rooms in our production DB.
The solution is to reduce the number of rooms to walk through. The final script only considers rooms that had a message sent to them in a given time period:
https://gitlab.com/gitlab-org/gitter/webapp/snippets/1916776
Public message keyword trends
Looking into public messages and counting keyword occurrence. A very lightweight version of https://trends.google.com/
Gitter activity ploted over time
You can click on the chart to open interactive version
- ★ - March 2017 - Gitter aquired by GitLab
The above chart shows 3 important metrics sampled each month since 2016. The exact method of counting each dataset can be found in !1668 (merged).
The important takeaway is that the amount of active rooms is slightly decreasing, the number of active users is decreasing a bit more rapidly and the mention of our competitors in messages is gradually increasing with some staggering numbers last two motnths.
Iterations
- Manually obtaining the data running DB commands
- Running a script periodically and automatically and producing a text report
- Interactive interface for exploring the data
This issue is only concerned with a way of manually compiling the reports.
Database querying
ssh mongo-replica-02.gitter.prod # make sure you connect to a secondary node
mongo
use gitter
rs.slaveOk()
The key concept is to only use fields that we have indexes on. E.g. for chatmessages
it would be only
ChatMessageSchema.index({ toTroupeId: 1, sent: -1 });
ChatMessageSchema.index({ parentId: 1, sent: -1 }, { background: true });
ChatMessageSchema.index({ fromUserId: 1 }, { background: true });
plus the default _id
And even when using secondary it is a good idea to limit the initial data size by asking only for recent objects using the {_id: {$gt: ObjectID(<recentId>)}}
.