Aggregating Gitter community data into reports (stats)

Data to be gathered

Following views on Gitter data would be of great benefit to both the Gitter team and the wider Gitter community:

List most active communities on Gitter (messages/time period)

Would be great to see what types of communities thrive on Gitter. Possibly ask these communities to test Gitter features like feature toggles.

This is a "simple" aggregation grouping chat messages by rooms.

Query production DB

The following query returns a sorted list of rooms with their message count for the period since the timestamp hidden in _id - use https://steveridout.github.io/mongo-object-time/ to change

Query script: https://gitlab.com/gitlab-org/gitter/webapp/snippets/1915718

List most populous communities on Gitter

Troupes in the database have userCount property, but there is no index on it and sorting by userCount would mean walking through 770199 rooms in our production DB.

The solution is to reduce the number of rooms to walk through. The final script only considers rooms that had a message sent to them in a given time period:

https://gitlab.com/gitlab-org/gitter/webapp/snippets/1916776

Public message keyword trends

Looking into public messages and counting keyword occurrence. A very lightweight version of https://trends.google.com/

Gitter activity ploted over time

You can click on the chart to open interactive version

★ - March 2017 - Gitter aquired by GitLab

The above chart shows 3 important metrics sampled each month since 2016. The exact method of counting each dataset can be found in !1668 (merged).

The important takeaway is that the amount of active rooms is slightly decreasing, the number of active users is decreasing a bit more rapidly and the mention of our competitors in messages is gradually increasing with some staggering numbers last two motnths.

Iterations

Manually obtaining the data running DB commands
Running a script periodically and automatically and producing a text report
Interactive interface for exploring the data

This issue is only concerned with a way of manually compiling the reports.

Database querying

ssh mongo-replica-02.gitter.prod # make sure you connect to a secondary node
mongo
use gitter
rs.slaveOk()

The key concept is to only use fields that we have indexes on. E.g. for chatmessages it would be only

ChatMessageSchema.index({ toTroupeId: 1, sent: -1 });
ChatMessageSchema.index({ parentId: 1, sent: -1 }, { background: true });
ChatMessageSchema.index({ fromUserId: 1 }, { background: true });

plus the default _id

And even when using secondary it is a good idea to limit the initial data size by asking only for recent objects using the {_id: {$gt: ObjectID(<recentId>)}}.

Edited Dec 11, 2019 by Eric Eastwood