CommonMark Migration Plan
This issue is to focus on how to migrate GitLab to using CommonMark for all Markdown rendering.
Merge request https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14835 focuses on replacing our current markdown renderer, RedCarpet, with commonmarker, a CommonMark implementation. Thanks to @blackst0ne for all the work he's doing on this.
Unfortunately, we can't simply turn on CommonMark rendering, as it has the potential to break various aspects of how our current markdown is rendered.
Hopefully below we can flesh out the specifics of what we need to do and begin moving in that direction. Already some comments were made in the MR, starting around https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14835#note_48751971
Why?
Markdown was created by John Gruber as a plain text formatting syntax, a
text-to-HTML conversion tool for web writers. Markdown allows you to write using an easy-to-read, easy-to-write plain text format, then convert it to structurally valid XHTML (or HTML).
Over the years, several implementations and "flavors" have been created: Redcarpet, kramdown, pandoc, php-markdown, GitHub Flavored Markdown, etc. You can get a list of many of the different flavors and parsers at the babelmark-registry, which powers the Babelmark3 site for comparing Markdown implementations.
The CommonMark website says it best:
In the absence of a spec, early implementers consulted the original
Markdown.pl
code to resolve these ambiguities. ButMarkdown.pl
was quite buggy, and gave manifestly bad results in many cases, so it was not a satisfactory replacement for a spec.Markdown.pl
was last updated December 17th, 2004.Because there is no unambiguous spec, implementations have diverged considerably over the last 10 years. As a result, users are often surprised to find that a document that renders one way on one system (say, a GitHub wiki) renders differently on another (say, converting to docbook using Pandoc). To make matters worse, because nothing in Markdown counts as a “syntax error,” the divergence often isn’t discovered right away.
There’s no standard test suite for Markdown; MDTest is the closest thing we have. The only way to resolve Markdown ambiguities and inconsistencies is Babelmark, which compares the output of 20+ implementations of Markdown against each other to see if a consensus emerges.
CommonMark's goal is to create "a strongly defined, highly compatible specification of Markdown":
We propose a standard, unambiguous syntax specification for Markdown, along with a suite of comprehensive tests to validate Markdown implementations against this specification. We believe this is necessary, even essential, for the future of Markdown.
By adopting CommonMark as a standard, we move closer to having Markdown files that are consistently rendered across applications. In general, a Markdown file should be able to be rendered the same on GitHub, GitLab, or any other other Markdown aware application.
Pros
- Supporting Markdown files that adhere to a defined, well tested specification
- Compatibility with the original Markdown implementation has been a major focus of CommonMark, so the majority of documents should be rendered the same
- GitHub has standardized on CommonMark. This means that documents imported from GitHub (and vice versa) will be rendered the same - this is good for the user.
- Various GitLab issues will be resolved by upgrading to CommonMark: #3301 (closed), #30087 (closed), #804 (closed), #3470 (closed), #5873 (closed), #8112 (closed), #13257 (closed), #13344 (closed), #17647 (closed), #18073 (closed), #18389 (closed), #18888 (closed), #18889 (closed), #19557 (closed), #20006 (closed), #20784 (closed), #25749 (closed), #27797 (closed), #30152 (closed), #31872 (closed), #32672 (closed), #32695 (closed), #33430 (closed), #33471 (closed), #35187 (closed), #35249 (closed), #36858 (closed), #39575 (closed), #26375 (closed)
- Performance will be improved:
Redcarpet
is a pure Ruby library.commonmarker
utilizes GitHub's version ofcmark
, the C implementation of CommonMark. - As pointed out by @yuchi, "Most tooling (e.g. Prettier) and editors (e.g. Typora) now output CommonMark and they break miserably on GitLab..."
Cons
- Some extensions (tables, task lists, etc) are not currently part of the initial CommonMark specification. However due to the popularity of GitHub Flavored Markdown, many of these have been implemented and have specifications defined.
- The syntax is more strict regarding spacing and various edge cases, so some documents will would require changes in order to render the same. GitHub has found that approximately 1% of documents required changes. While this is a Con, it is also a Pro due to the low percentage
- Will require processes that scan our database to upgrade/normalize existing Markdown content. But we feel any impact of this will be manageable.
Proposal
The end goal is to replace our current Redcarpet rendering with CommonMark, with the least amount of user/customer disruption.
The current thinking is:
-
add the ability, for now, to support both RedCarpet and CommonMark running at the same time -
add a feature flag that allows a new installation to be run completely with CommonMark -
turn on CommonMark rendering for all new content for issues, merge request, and notes (feature flag?) -
write a tool that scans the database and normalizes existing markdown content (comment/notes, etc.) so that they render correctly -
rendering repository content
add the ability, for now, to support both RedCarpet and CommonMark running at the same time
In order to roll this out in a controlled fashion, we want the ability to run both versions at the same time. This would allow us the ability to turn on CommonMark for new projects as a first step.
https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/14835 should be enhanced to support both Redcarpet and CommonMark, as opposed to simply replacing one with the other.
The cached_markdown_version
field in the database can be used to determine which renderer is used. CacheMarkdownField
currently sets this value to 3
:
# Increment this number every time the renderer changes its output
CACHE_VERSION = 3
We can choose the renderer by checking that cached_markdown_version >= 10
- it's 10 or higher, we use CommonMark.
For reference, here are the current tables that have the cached_markdown_version
column defined.
abuse_reports
, appearances
, application_settings
, broadcast_messages
, epics
, issues
, labels
, merge_requests
, milestones
, namespaces
, notes
, projects
, releases
, snippets
add a feature flag that allows a new installation to be run completely with CommonMark
This new feature flag would enable new installations to run fully with CommonMark mode - from issue content to repository content. All new content would be created with a cached_markdown_version
of 10
, and repository content would also be rendered with CommonMark
turn on CommonMark rendering for all new content for issues, merge requests, and notes (feature flag?)
All new markdown content (issues, notes, milestones, etc) would be created and rendered with CommonMark. This would put a limit on the amount of conversion that will need to be done for existing content.
write a tool that scans the database and normalizes existing markdown content (comment/notes, etc.) so that they render correctly
Our hope is that the majority of content already stored in the system will already render correctly with CommonMark. This was apparently the experience of GH, as outlined in their blog post[^ghspec]. However we need to process and normalize each markdown field to ensure that is the case, and mark each field as having been upgraded.
Our migration tool needs to grab each markdown field, normalize it (run a set of rules that convert it to conform to CommonMark rules), then compare the Redcarpet HTML (already stored in the _html
field?) to a newly rendered version with CommonMark. If it's equivalent, then store the new version (both markdown and HTML?) and update the cached_markdown_version
. If not, then we do something yet to be determined.
This process can run in the background, updating the DB over time (days maybe?). Ideally, there will only be a small percentage of content that cannot be safely converted and rendered.
rendering repository content
We can't do any normalization/changes to user data, stored in repositories. We should, in general, be able to enable rendering for repository data, with only a small percentage of users affected.
I do think there is an issue with our own GitLab documentation. It is all written in markdown, and a rough scan of some to the files show rendering issues. Will be looking more closely at this as we go forward.