Migrating WebPlatform.org MediaWiki into Git as Markdown files

Tags

The WebPlatform project was running from a about 20 VMs and the availability, in part, managed by Monit. At the end of the project, the infrastructure "had to go away"

On July 1st 2015, The WebPlatform project has been discontinued, by having its sponsors retracting from the project.

I was hired by the W3C solely to maintain the WebPlatform project. Without funding, and no budget at W3C/MIT to transfer me, my contract was going to end.

With me as the only full-time person on the project with knowledge of the infrastructure, and also as a believer of “don’t break the web”, I wanted to keep everything online.

When I inherited the project, I went through migrating from one cloud provider to another the full infrastructure, it has been running from about 20 Linux VMs, worked on a few projects such as an attempt at having Single-SignOn, and better compatibility tables.

Since the project was being closed, most of what I’ve worked lost their use. The infrastructure was also going to become a burden.

While planning my last weeks, I’ve agreed with the W3C systems team "w3t-systeam" to convert as much as possible as a static site.

The following are notes about the software I’ve written to migrate a WordPress blog and multiple MediaWiki namespaces into a Git repository.

Priorities

The priorities were to keep the documentation pages up, preserve the contribution history and attributions, also keep the blog contents, and have everything to be served as static HTML hosted on GitHub pages.

I wanted to have this done before I leave the W3C.

Since the SysTeam would keep control of the webplatform.org domain, they’ve decided to support some redirect from the original domain to webplatform.github.io.

Once everything was migrated, we’ve added a note;

The WebPlatform project, supported by various stewards between 2012 and 2015, has been discontinued

Source: webplatform.github.io

Outcome

The conversion work took the two last months.

All links to pages showing the WebPlatform logo from this article is the result of what I’m describing here, as the migration and all servers were shut down and decommissioned in 2016.

The source of the site was generated from webplatform/generator-docs, and we’ve created webplatform/webplatform.github.io to host the site on GitHub pages. Eventualy, the W3C SysTeam set the repository as read-only.

There were other things that we’ve initially planned, but couldn’t (see Requirements we couldn’t meet):

Here is what I could migrate

FromToRepositoryCommentCommitsDocumentsDeleted
docs.webplatform.org/wiki/*webplatform.github.io/docs/*webplatform/docsThe main docs pages37,0004,675418
docs.webplatform.org/wiki/Meta:*webplatform.github.io/docs/Meta/*webplatform/docs-metaArchived content that needed to be moved during initial mass imports.30058170
docs.webplatform.org/wiki/WPD:*webplatform.github.io/docs/WPD/*webplatform/docs-wpdCommunity and notes section. Example /wiki/WPD:Infrastructure into /docs/WPD/Infrastructure (source)5,700358323
blog.webplatform.orgwebplatform.github.io/blog/*webplatform/blogThe blog content253N/A

Migration

While building the solution, a few requirements emerged, they’re listed in Requirements

Blog

For migrating the blog, it was made manually, I’ve just converted the HTML into Markdown using Pandoc.

MediaWiki

For migrating MediaWiki, the proces was a bit more complex.

To migrate, we needed an history of each pages, from creation to the last edit.

With MediaWiki, we have Manual:DumpBackup.php, that creates an XML file ("MediaWiki XML dump") of all history of all pages. One per namespace, for WebPlatform Docs’ wikis, we had: /wiki/Meta:*, /wiki/WPD:*, /wiki/*.

A big part of processing and converting history is about the data model so we could read from one format and use the data model as data-source for conversion.

For this, I’ve created webplatform/content-converter, an abstract library specialized for data manipulation to help "Transform CMS content from a format into another [format]". It’s written in PHP, and might still work even though it hasn’t been touched since 2015.

webplatform/content-converter takes care of manipulating data for:

  • Date of contribution and author information

    Content management systems like MediaWiki has a way to tell the user and date of an edition.

  • Take each edit contents as they were

    Do not change any of the contents for each edits, even though they are in a format understood by the original engine, here being MediaWiki.

To convert each page history edit into Git commits, I’ve created webplatform/mediawiki-conversion.

This is the utility that takes care of making the conversion see comments made here.

  1. Handle deleted pages

    For each edit create a git revisions commit until the final file delete. At the end of that step, the Git repository should have an history, but no aparent files. At this step, we can also use the entries and make a redirect map.

  2. Handle pages that weren’t deleted in history

    Do the same as above. At this step, we should have a repository with text files where each file has exactly the same content as the source history. So we can add them on top of the deleted pages following the same process as above. We can also use those as a list of current pages we’ll want to be converted into Markdown.

  3. Convert content

    For all pages that weren’t deleted in history. Query MediaWiki API to retrieve the full HTML, including Transclusions pass it to Pandoc, so we get it converted into Markdown. Commit the rewritten file contents with Markdown.

By design webplatform/mediawiki-conversion allows to re-run the 3rd step so we could always get all edits until we Lock the database (i.e. cut-off) so we could keep the production site active until a cut-off date.

Requirements

Along with what's described in the Migration steps, we wanted also to ensure we properly support the following.

  1. Keep author attributions but have them anonymized

    We didn’t want to leak all contributor’s accounts email addresses to be published without their consent. So, I’ve setup a .mailmap so that contributors could add a PR to advertize publicly on GitHub.

  2. Keep code examples in Markdown so we can colourize them

  3. Golden standard pages should look the same

    The "Golden Standard" pages were ones we were regularly referring to see how they would look and see if things are broken.

    Also, since we were removing infrastructure, we should see features that were enabled at the time of Web.Archive.org snapshot, but would be omitted in the Static version (i.e. webplatform.github.io)

    • Compatibility data was removed
    • Contents on the right should be removed
    • Overview table should look different
    • On static version code blocks should be more colorized differently (it's using a different process)

    Examples:

  4. Support MediaWiki special URL patterns

    MediaWiki is pretty relax about what it allows in its URLs.

    Since we’re migrating into a filesystem, we wanted to have only valid filesystem file names and decided to normalize all paths into ASCII characters.

    For example, the Namespaces (e.g. /wiki/WPD:*) adds a column character, that would be URLEncoded into %3A

    There were many other discrepancies like:

    • Case insensitivity — Because some URLs in pages were not in the same casing
    • Supporting spaces — Sometimes with underscore "_", other times with %20
    • Normalize URL Path (spec) (e.g. anything that would be URLEncoded such as ()!@:)
  5. Support MediaWiki special URL patterns: A page in another language should be migrated

  6. Support MediaWiki special URL patterns: A Page with () in URL should be normalized

  7. Support MediaWiki special URL patterns: Pages within WPD namespace should be migrated

    The image comes from /WPD/assets/ which means the image is hosted in the webplatform/docs-wpd in the assets/ folder.

  8. Support MediaWiki special URL patterns: Pages in Meta namespace should be migrated

Requirements we couldn’t meet

  1. Ensure ALL assets uploads are displayed properly

    Before, we would use GlusterFS to store images for MediaWiki. There has been work made to store on static.webplatform.org, all the pages uses the link, but the SysTeam decided not to keep the redirects.

    Now that the site is archived, I’m unsure this can be fixed.

    If you want to see them, add this bookmarklet to your bookmarks bar WebPlatform.github.io Show images

    The project http://webplatform.github.io/docs-assets/ was meant to be used as an origin for static.webplatform.org

    Should we want to fix the issue, we could setup an HTTP redirect from static.webplatform.org to webplatform.github.io/docs-assets/

    The following (should be) the same file;

  2. Make sure page links with different casing are redirected properly

    For example the page for "Internet_and_Web" is sometimes linked with different casing, e.g. at "Internet_and_Web/The_History_of_the_Web" or "internet_and_web/The_History_of_the_Web", it is migrated once; docs.webplatform.org/wiki/concepts/Internet_and_Web/The_History_of_the_Web to webplatform.github.io/docs/concepts/Internet_and_Web/The_History_of_the_Web

    Some page has links to with different casing "internet_and_web", we can see from docs.webplatform.org/wiki/concepts to webplatform.github.io/docs/concepts

    MediaWiki would redirect, but on GitHub pages, nothing has been put in place for those. If we wanted to support, we could have used the same entries used to create NGINX rewrite rules, and add empty HTML pages like GitHub pages would. In the case of WebPlatform project, it's been decided to leave as-is.

  3. Enforce all redirects from *.webplatform.org to properly redirect to webplatform.github.io.

    This migration process supported creating NGINX rewrite rules, I’m unsure if they’ve been used.


Screenshots

WebPlatform/mediawiki-conversion is made to run in 3 steps, here we see it running at the 3rd step. Where we would take the latest version, get the HTML version from the MediaWiki API, and pass to Pandoc to get Markdown.

Making sure during development that the conversion was effectively the same

We can see directly on GitHub.com the converted page's Git history

We can see directly on GitHub.com the converted Markdown contents.

The final output should support code blocks with

MediaWiki history converted into Git Commits

The end result is that for each MediaWiki XML dump can be converted into text files where each edits are made as Git commits keeping author and date of edit.

Not implemented Screenshot of an experiment to display the page's Git history calling GitHub API. There is no trace of this to my recollection, and It was effectively the first time I used Vue.js.