User curated collections of projects
Exploring projects on a GitLab instance is currently a bit of a pain. This is primarily the result of there really only being 3 ways to "find" a project:
- The trending page (which isn't very accurate / useful)
- Projects with the most stars (this is likely to stay the same over time, so again not very useful)
- Simply going through all existing projects (again not very useful)
To aid with this I would like to propose the addition of user curated collections of projects. The idea is pretty simple: every user can create a project list and add any projects (that they have access to) to it. Other users can browse through the existing lists on a dedicated page. Each list is scoped to the user (e.g. lists/yorickpeterse/ruby-projects
or something like that), this way you don't get conflicting lists. Lists can be starred just like projects. The page displaying these lists would display for example the top 10 most starred lists. Lists can also be marked as "staff picks", in which case they're also displayed on the "Explore Lists" page (or whatever we call it). Users can also search for lists based on words in the title.
The important detail here is that these are user curated lists, not lists solely maintained by GitLab administrators. I feel this better promotes the idea of a community and it also reduces the amount of work that would have to be done by administrators.
Ultimately the idea behind these lists is that one can more easily find projects they may be interested in by going to a certain list for their interest.
Technical Details
Backend wise this is fairly easy to implement. We'd have a table called project_lists
with at least the following columns:
- id (integer)
- title (varchar(X))
- title_html (text)
- description (varchar(X))
- description_html (text) (this can't be varchar since the HTML might be larger than the markdown)
- stars (integer)
- staff_pick (boolean)
A separate table called project_lists_tags
would map the tags to projects and would have at least the following columns:
- project_list_id (foreign key to
project_lists.id
) - tag_id (foreign key to whatever our tags table is called)
A third table would be used to store the people who starred the lists, with at least the following columns:
- project_list_id (foreign key to
project_lists.id
, with cascading delete) - user_id (foreign key to
users.id
, with a cascading delete)
The number of stars is also cached in project_lists.stars
so we don't need this additional table to sort lists. Keeping this in sync is pretty straightforward using database transactions.
Searching should be limited to just the title (project_lists.title
to be exact) to remove the need for also having to index the description. This is necessary because we already struggle to search for issues (where we also index the description) and I'd rather not introduce something new that suffers from the same problems.
Descriptions should be limited to a reasonable amount of characters so users don't go crazy and post entire spam advertisements into them. We also need to validate the title/description using Akismet.
For performance reasons the title and description should only support a limited subset of our Markdown. For example, issue links shouldn't be supported as these tend to be quite expensive to retrieve / redact / etc.
The page that displays project lists should not use numbered pagination, instead only using "Next" and "Previous" buttons; again for performance reasons.
Caching Stars
In the above setup there's a subtle bug: removing a user with a cascading delete would remove the row from the project list stars table, but it wouldn't reduce the cached counter. Doing this manually could be potentially very expensive, especially if the user has starred thousands of projects. As such the cached counter should be an approximate count, not an exact count. The next time somebody stars the list the counter will be updated correctly, so this shouldn't be a huge issue.
An alternative is to take the number of starred projects and schedule a Sidekiq job to refresh their counts. This ensures the counter is more or less accurate, but again there's some time between removing the user and updating the counter. A downside of this is that frequently removing users can lead to many Sidekiq jobs being scheduled.