Converting a dynamic site into static HTML documents

Wednesday, May 20, 2015

Tags

archiving
procedure
webplatform
techniques
📁 techniques

Web 25th anniversary web site screenshot — In March 2014, the W3C and the Web Foundation celebrated the World Wide Web 24th anniversary. As a W3C Team Member, I was asked to help the systems team and host the event’s web site. After the event, I was asked to make the web site to become static HTML documents so the systems team wouldn’t have to maintain the CMS it was using.

Its been two times now that I've been asked to make a website that was running on a CMS and make it static.

This is an useful practice if you want to keep the site content for posterity without having to maintain the underlying CMS. It makes it easier to migrate sites since the sites that you know you won't add content to anymore becomes simply a bunch of HTML files in a folder.

My end goal was to make an EXACT copy of what the site is like when generated by the CMS, BUT now stored as simple HTML files. When I say EXACT, I mean it, even as to keep documents at their original location from the new static files. It means that each HTML document had to keep their same value BUT that a file will exist and the web server will find it. For example, if a link points to /foo, the link in the page remain as-is, even though its now a static file at /foo.html, but the web server will serve /foo.html anyway.

Here are a few steps I made to achieve just that. Notice that your mileage may vary, I've done those steps and they worked for me.

I've done this procedure a few times with WordPress blogs along with webat25.org that is now hosted as w3.org/webat25/ website that was running on ExpressionEngine.

Steps

1. Browse and get all pages you think could be lost in scraping

We want a simple file with one web page per line with its full address. This will help the crawler to not forget pages.

Use a web browser developer tool Network inspector, keep it open with "preserve log".
Once you browsed the site a bit, from the network inspector tool, list all documents and then export using the "Save as HAR" feature.
Extract urls from har file using underscore-cli

npm install underscore-cli cat site.har | underscore select '.entries .request .url' > workfile.txt

Remove first and last lines (its a JSON array and we want one document per line)
Remove the trailing remove hostname from each line (i.e. start by /path), in vim you can do %s/http:\/\/www\.example.org//g
Remove " and ", from each lines, in vim you can do %s/",$//g
At the last line, make sure the " is removed too because the last regex missed it
Remove duplicate lines, in vim you can do :sort u
Save this file as list.txt for the next step.

2. Let's scrape everything

We'll do two scrapes. First one is to get all assets it can get, then we'll go again with different options.

The following are the commands I ran on the last successful attempt to replicate the site I was working on. This is not a statement that this method is the most efficient technique. Please feel free to improve the document as you see fit.

First a quick TL;DR of wget options

-m is the same as --mirror
-k is the same as --convert-links
-K is the same as --backup-converted which creates .orig files
-p is the same as --page-requisites makes a page to get ALL requirements
-nc ensures we dont download the same file twice and end up with duplicates (e.g. file.html AND file.1.html)
--cut-dirs would prevent creating directories and mix things around, do not use.

Notice that we're sending headers as if we were a web browser. Its up to you.

export UA='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.94 Safari/537.36'
export ACCEPTL='Accept-Language: fr-FR,fr;q=0.8,fr-CA;q=0.6,en-US;q=0.4,en;q=0.2'
export ACCEPTT='Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
wget -i list.txt -nc --random-wait \
     --mirror \
     -e robots=off \
     --no-cache \
     -k -E --page-requisites \
     --user-agent="$UA" \
     --header="$ACCEPTT" \
     http://www.example.org/

Then, another pass

wget -i list.txt --mirror \
     -e robots=off \
     -k -K -E --no-cache --no-parent \
     --user-agent="$UA" \
     --header="$ACCEPTL" \
     --header="$ACCEPTT" \
     http://www.example.org/

3. Do some cleanup on the fetched files

Here are a few commands I ran to clean the files a bit

Remove empty lines in every .orig files. They're the ones we'll use in the end after all

find . -type f -regextype posix-egrep -regex '.*\.orig$' -exec sed -i 's/\r//' {} \;

Rename the .orig file into html

find . -name '*orig' | sed -e "p;s/orig/html/" | xargs -n2 mv

find . -type f -name '*\.html\.html' | sed -e "p;s/\.html//" | xargs -n2 mv

Many folders might have only an index.html file in it. Let's just make them a file without directory
```
find . -type f -name 'index.html' | sed -e "p;s/\/index\.html/.html/" | xargs -n2 mv
```
Remove files that has a .1 (or any number in them), they are most likely duplicates anyway
```
find . -type f -name '*\.1\.*' -exec rm -rf {} \;
```

Run a NodeJS process through forever from within a Docker container

Add OpenStack instance meta-data info in your salt grains