Converting a dynamic site into static HTML documents

Screenshot of webat25.org taken in May 2015

Its been two times now that I’ve been asked to make a website that was running on a CMS and make it static.

This is an useful practice if you want to keep the site content for posterity without having to maintain the underlying CMS. It makes it easier to migrate sites since the sites that you know you won’t add content to anymore becomes simply a bunch of HTML files in a folder.

My end goal was to make an EXACT copy of what the site is like when generated by the CMS, BUT now stored as simple HTML files. When I say EXACT, I mean it, even as to keep documents at their original location from the new static files. It means that each HTML document had to keep their same value BUT that a file will exist and the web server will find it. For example, if a link points to /foo, the link in the page remain as-is, even though its now a static file at /foo.html, but the web server will serve /foo.html anyway.

Here are a few steps I made to achieve just that. Notice that your mileage may vary, I’ve done those steps and they worked for me. I’ve done it once for a WordPress blog and another on the [email protected] website that was running on ExpressionEngine.

Steps

1. Browse and get all pages you think could be lost in scraping

We want a simple file with one web page per line with its full address.
This will help the crawler to not forget pages.

npm install underscore-cli
cat site.har | underscore select ‘.entries .request .url’ > workfile.txt

2. Let’s scrape it all

We’ll do two scrapes. First one is to get all assets it can get, then we’ll go again with different options.

The following are the commands I ran on the last successful attempt to replicate the site I was working on.
This is not a statement that this method is the most efficient technique.
Please feel free to improve the document as you see fit.

First a quick TL;DR of wget options

Notice that we’re sending headers as if we were a web browser. Its up to you.

wget -i list.txt -nc --random-wait --mirror -e robots=off --no-cache -k -E --page-requisites \ --user-agent='User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.94 Safari/537.36' \ --header='Accept-Language: fr-FR,fr;q=0.8,fr-CA;q=0.6,en-US;q=0.4,en;q=0.2' \ --header='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' \ http://www.example.org/ 

Then, another pass

wget -i list.txt --mirror -e robots=off -k -K -E --no-cache --no-parent \ --user-agent='User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.94 Safari/537.36' \ --header='Accept-Language: fr-FR,fr;q=0.8,fr-CA;q=0.6,en-US;q=0.4,en;q=0.2' \ --header='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' \ http://www.example.org/ 

3. Do some cleanup on the fetched files

Here are a few commands I ran to clean the files a bit

CategoriesPortfolio, Techniques Tagsbest-practices, Favourites, operations, Projets, Techniques