Aggregate statistics for the contribution calendar
The contribution calendar currently aggregates data directly from the events
table. This prevents us from further cutting down the table size as removing data would lead to incorrect contribution statistics.
To deal with this we should change the calendar to instead use a separate table, e.g. user_contributions
. This table would store a single row per user per day containing the number of contributions for that date. This table would thus have the following columns:
- id (primary key)
- user_id
- date (YYYY-MM-DD)
- contributions
This table would have the following indexes:
-
(id)
primary key index -
(user_id, date)
UNIQUE
The application in turn would take the following steps to insert or update a row:
- Start a transaction
- Check if today's row exists
- If it does not, insert it and rescue duplicate key errors. Upon a duplicate key error jump to the next step
- If it does, increment
contributions
by 1 usingUPDATE user_contributions SET contributions = contributions + 1 WHERE user_id = X AND date = Y
This would lead to something like:
today = Time.now.strftime('%Y-%m-%d')
transaction do
# THIS MUST BE IN SQL, do **not** load data into Ruby to do the update as this
# will lead to incorrect contribution counts being set
updated = user.contributions.where(date: today).
update_all('contributions = contributions + 1')
# Row didn't exist, let's try inserting it
if updated == 0
begin
user.contributions.insert(date: today, contributions: 1)
# Somebody beat us to it, let's update the row instead.
rescue ActiveRecord::DuplicateKeyError
user.contributions.where(date: today).update_all('contributions = contributions + 1')
end
end
end
Do not use find_or_create
and similar methods as these are not atomic. The reason for performing the update first is that this operation will happen in 99% of all cases this code is called. The lack of a SELECT means that in the best case we only need to run a single UPDATE, instead of having to run a SELECT and an UPDATE. This however requires that we use some Rails method that returns the number of updated rows. The begin/rescue
is to take care of concurrent inserts. This will work as expected since there's a UNIQUE index on (user_id, date)
.
I can't stress how important it is here to rely on database indexes and the above setup, don't use any of Rails' validates_unique
helpers since these are really bad at enforcing uniqueness without any UNIQUE indexes being present.