TL;DR: This article describes the wiring of a tool to turn a webpage into a self-sufficient epub (for reading offline). If you want to try the tool, you can grab a binary version from GitHub
To oversimplify my need, I will quote this from the Readability Project
Reading anything on the Internet has become a full-on nightmare. As media outlets attempt to eke out as much advertising revenue as possible, we’re left trying to put blinders on to mask away all the insanity that surrounds the content we’re trying to read.
It’s almost like listening to talk radio, except the commercials play during the program in the background. It’s a pretty awful experience. Our friend to date has been the trusty “Print View” button. Click it and all the junk goes away. I click it all the time and rarely print. It’s really become the “Peace & Quiet” button for many.
In a recent post, I blogged about a tool I am building for my reMarkable. In this post, I will describe a new tool that converts any webpage into an ePub file.
The goals of this tool are:
- to keep track of the articles I like without fearing any broken links
- to extract the content, and read the articles without distraction
- to be able to read the articles offline on devices such as ebook readers or my reMarkable
This feature already exists if you are using a Kobo and the getPocket service. The problem is that it is that the offline experience is tidily linked with my Kobo device. On top of that, getPocket does not offer any way to download the cleaned version of the articles.
We, as developers, have superpowers: we can build the tools we want.
Let’s explain the features I am building step by step.
Disclaimer at the time this post is written, the tool results from various experiments, but not the architecture or the code is clean and maintainable. Take this post as a validation of a proof of concept.
First part: extracting the content
The most important part of this journey is the tool’s ability to extract the content of a webpage. The first idea would be to query the getPocket service that does this, but the documentation of their API mentions that:
Pocket’s Article View API will return article content and relevant meta data on any provided URL.
The Pocket Article View API is currently only open to partners that are integrating Pocket specific features or full-fledged Pocket clients. For example, building a Pocket client for X platform.
If you are looking for a general text parser or to provide “read now” functionality in your app - we do not currently support that. There are other companies/products that provide that type of API, for example: Diffbot.
They mention Diffbot, but it is a web service that requires a subscription; I’d like to build a simple tool, free of charge, for my usage, and therefore this is not an option.
Readability / Arc90
I looked into open source initiatives that empower the reading modes of the browsers (I am/was a fan of the safari reading mode), and I found that some of them were based on an experiment made by Arc90. This experiment led to the (discontinued) service readability.
Feel free to skip this part if you are not interested in the code
The API of the readability library is straightforward.
First, there is a need to create a
Readability object with an HTML parser that reads and extracts relevant content.
Then, calling the
Parse method on this object, feeding it with an
io.Reader that contains the page to analyze.
The result is an object of type
Article that contains some metadata and the cleaned content. This content is an HTML tree and is accessible via a top-level
The problem with reactive content and Medium articles
When the Arc90 project made this experiment, there were not many reactive contents.
The picture below is a screenshot of a reader view of the page with Safari:
The code below is the code extracter by a curl request:
<figure> element, we can see that the first image (https://miro.medium.com/max/60/1*RSH2vh_xgQtjB68Zb7oBaA.jpeg?q=20) is a thumbnail and it acts as a placeholder.
<noscript> tag is also present and exposes the complete sources of the image.
As the Arc90 library removes all the
<noscript> elements, the only options are:
- to pre-process the HTML file before feeding the Arc90 algorithm
- to amend the Arc90 library
So far, the behavior we are addressing seems particular to articles hosted on medium. Amending the Arc90 Algo to handle this specific use-case does not seem to be a good idea.
So let’s go for a pre-processing step of the document before feeding the Arc90 algo. It is beyond the scope of this article to show and comment on the complete code to do that.
In a glimpse, the HTML content is extracted into a tree of
*html.Node elements; then, the processing step walks the tree via a recursive function seeking
Then, within the
processFigure, we once again walk through the subtree, seeking the primary
img node, and replacing its attributes with those from
You can find a complete code in this gist
Once the HTML tree is adapted, it can go through the Arc90 Algorithm.
Note: as of today, the tree is rendered into HTML to match the API of Arc90. This is unoptimized. I will eventually submit a PR or fork the project to add a new API that applies the Acr90 Algo directly to an
Second part: generating the ePub
Now that we have proper content, let’s turn it into an ePub.
An ePub is a set of XHTML files carrying content, along with images and local files. All of the content is self-sufficient and packaged in a zip file.
To generate the ePub in the tool, I rely on the
go-epub library. This library is stable, and the author welcomes contributions.
The ePub generation is made in two steps:
- building an Epub structure holding the content of the epub;
- generating the epub file with self-sufficient content.
First step: crafting the ePub
In the first step, we create the HTML content. The content is the HTML tree processed previously by the Arc90 algorithm.
The content is added as a single section in the ePub for commodity. A better way would be to parse the HTML tree and create a section for each
But as the target is downloading a single page, there should typically be a single
h1 tag inside de page.
To be self-sufficient, there is a need to parse this tree, seeking remote content (in essence, the images) and downloading it locally.
The go-epub library provides a set of methods to handle the content to do this task smoothly. The
AddImage method, for example, creates an entry in a map that references online content and provides a reference to a local file.
This code, from the doc, shows how it works:
We need to call this method for every image element in order to populate the image map. On top of that, every
src attribute must be changed to use the local file.
We use the same system as before and use a recusrive function applied to the root node of the HTML tree:
Back to Medium’s image problem
img source we have set in the HTML tree relies on the
getURL function, we will implement a logic that will set the default source value present in the
If it finds a
srcset attribute, it will parse it and sort it, so the first element holds the largest picture (we want the best possible resolution).
We implement the
sort.Sort interface on a newly created structure
I will not display the whole getURL function as its implementation is straightforward and present on the project’s GitHub.
Second step: creating the ePub
Now the structure of the epub is correct, simply call the
Write method that will:
- download the assets listed in the Epub structure;
- add some metadata;
- create the zip file.
This method ends the process and produces the expected ePub file.
Third part: adding fancy features
Now we have an epub file, let’s add some features to improve the reader experience.
Grabbing meta information
Article structure produced by the Arc90 parser references a title, an author, and a front cover for the site.
But, as explained before, Arc90 is quite old, and those pieces of informations are provided nowadays by OpenGraph elements.
Arc90 cleans those elements; therefore, we will grab them in the pre-processing step.
We rely on the
opengraph library in Go to create a
The opengraph’s entry point reads the content from an
To optimize the memory, we will implement the
getOpenGraph method as a middleware.
It will read the HTML file from the io.Reader, process it, and
Tee the original into another readeri thanks to an
The signature of the method is:
Generating a cover
Now that we have some information, we can generate a cover for the ePub. A cover is an XHTML file that references a single picture.
On the picture, we would like to see:
- the front image of the article as displayed on the social media;
- the title of the article;
- the author of the article;
- the origin of the article;
image/draw package of the standard library, we create an RGB image and compose the front cover.
The code of the cover generation is here. Then, the methods of the go-epub library add it to the ePub.
To complete the work, we can create a GetPocket integration to grab all the elements from the GetPocket reading list and convert them to ePub. The integration is straightforward as the API of getPocket allows retrieving a structure holding:
- the original URL
- the title of the file
- the front image
- the authors
But, a target could be to run a daemon on the eReader (for example, a reMarkable); therefore, the internal library is handling a daemon mode to fetch the articles on a regularly (as well as when the device wakes up).
Dealing with MathJax
Another feature that is missing from the getPocket integration on my kindle is the ability to render LaTeX formulas. I add one more processing step to find a mathjax content, and create a png image of the formula.
To do that, I use the github.com/go-latex/latex package.
The principle is to find a TextNode holding a MathJax element thanks to a regular expression:
processMathTex function analyzes the formulae and renders them into a png encoded file. Then the file is inserted in the HTML tree in an
img tag. The
src attribute references an inline content of the formula, encoded with the dataURL principle.
Conclusion and future work
I don’t use the getPocket integration very often, but I use the
toEpub tool to convert a web page daily.
The getPocket integration will be helpful once I have encoded the output file to a format suitable for the remarkable. It sounds pretty straightforward, but I have not taken the time to do it yet.
So far, my workflow is:
- grabbing the URL on my laptop
- running the toEpub locally
- sending the result to the remarkable with
rmapi(and now gdrive)
The problem is that it requires a laptop and the tool installed on it. I am currently hacking the go-epub library, so it will no longer need a filesystem, allowing a compilation into webassembly to ease the deployment.