Replacing Postlight’s Mercury scraping service with your self-hosted copy

Mercury by Postlight is a tool for scraping web pages. It “transforms web pages into clean text. Publishers and programmers use it to make the web make sense, and readers use it to read any web article comfortably.”
This was a fast, free web-based service, accessible via an API which worked rather beautifully, without a fuss.

Then, in February 2019, Postlight announced they were discontinuing the hosted service and open-sourcing the code. A mixed blessing, as, to keep using the service, this now required users of the service to set up their own version of Mercury, on their own, or Amazon’s servers.
Some guidance was provided, but few of those posting in the discussion group were (easily?) able to resolve the issues surrounding setting up their own instance. For me, Mercury using technology I’m not very familiar with, this was a bit of a hit-and-miss process that, eventually, did pay off.

Stephen Bradley wrote up a guide on how to get Mercury running on an Amazon server, but I like to stay clear from Amazon.

Here’s what I did to get Mercury running on my own server. Your mileage may vary.

  1. I’m using DreamHost. When setting up a (sub-)domain, I can select the option ‘Passenger’, which is for running Ruby, NodeJS and Python apps.
  2. SSH to the new domain and install NVM as described here. 
Specifically:
    curl -o- https://raw.githubusercontent.com/creationix/nvm/v0.34.0/install.sh | bash
  3. After running that script, a message claims that you only have to close and open the terminal to have nvm running. That did not work for me and I had to run this (in the terminal):
    export NVM_DIR="$HOME/.nvm"
    [ -s "$NVM_DIR/nvm.sh" ] && \. "$NVM_DIR/nvm.sh" # This loads nvm
    [ -s "$NVM_DIR/bash_completion" ] && \. "$NVM_DIR/bash_completion" # This loads nvm bash_completion
  4. Update nvm to the minimum required version, in the terminal (based on this post):
    nvm install 8.10
  5. Install the mercury parser in the terminal (based on this):
    npm install @postlight/mercury-parser
  6. Create a javascript file, for example with:
    const Mercury = require('@postlight/mercury-parser');
    const url = 'https://en.wikipedia.org/wiki/John_von_Neumann';
    Mercury.parse(url).then(result => { console.log(result); } );
  7. Execute the file from the command line with:
    node myfile.js
    But, this does not make the result available through the web.
  8. Install expressjs as detailed here.
  9. Create /myapp/app.js with this:
    const express = require('express')
;
    const app = express()
;
    const port = 8888

;
    app.get('/myapp/', function (req, res) {
    const Mercury = require('@postlight/mercury-parser');
    const url = req.query.url;
    Mercury.parse(url).then(result => {
 res.send(result);
 } );

})
    app.listen(port)
  10. Run the app from the command line:
    node app.js
  11. Visit the page in the browser:
    http://mysite.com:8888/myapp/?url=https://en.wikipedia.org/wiki/John_von_Neumann
  12. But, you want this app to run forever. Install forever:
    npm install forever -g
  13. Start forever:
    forever start app.js

Now, I can replace this:

Related:  On the Thessaloniki boulevard

$postlightUrl = "https://mercury.postlight.com/parser?url=".$url;

With this:

$postlightUrl = "https://mysite.com:8888/myapp/?url=".$url;

Broken

This worked, but, a few weeks later, it no longer did. Calling a script on the command line while logged in to my server via SSH worked fine. The problem was related to Mercury.
As opposed to trying to find the cause, I threw together a quick PHP-based solution that essentially does the same as the Mercury parser: extracting basic details from a webpage.