Introduction

I don’t like reading on the computer screen.

So I frequently convert web pages or other texts to pdf, to read in e-readers. And this program helps me to do that.

The program divides the job in three, very uncontroversial, stages:

  1. Download html

  2. Extract and format the contents to suit for small paged media

  3. Convert to pdf

And it basically does what people expect it to do, using the most general technologies.

The major constraint is, as the legend of the program states, to get the job done 'in a few minutes'.

I certainly don’t want to spend more time, but I find, it doesn’t work fine if I simplify things more.

So the program should be against 'in a few seconds' solutions, and 'in a few hours' solutions.

Some points to consider:

  • The objective is not to produce beautiful pdf. It is to manage to make a marginally readable pdf, from the least work as possible.

  • Many related applications seem to concentrate on major new sites, parsing specific article pages or RSS for the target htmls. This program doesn’t have special functionalities for these sites. It is rather for more static reading contents, like personal blogs and technical documents.

  • Extraction is the most important part of the program. But it is done by very simple and predetermined method.

    1. Parse html text as DOM elements tree (by lxml).

    2. select the elements you want, discarding others.

    3. exclude undesirable sub-elements from the selected elements.

    4. process the resultant tree (apply arbitrary functions against the tree).

    5. clean the resultant tree (strip some tags, attributes, javascript and css).

    So the program is only suitable for htmls as complex as this simple method can take.

  • I’ve been using KOReader, in recent Kobo e-readers.

Installation

It is an ordinary pure python package, so you should be able to do:

$ pip install tosixinch

Or rather:

$ pip install --user tosixinch

The command will only install tosixinch package. (You will need other external libraries, but it is not done automatically).

Python 3.6 and above are supported.

Note

  • Windows filesystems are not supported.

  • UTF-8 encoding in system and files is presupposed.

  • I don’t hesitate to change APIs. But please feel free to email me if changes break your configuration, and I haven’t provided the clearest documentation. I’d like to know and help.

Requirements

Technically it has no library requirements, because each action in this program is independent, so optional.

But in general, if you are in the mood to try this program, installing lxml and at least one of pdf converters is recommended. (That way you can do all -1, -2 and -3 below).

  • lxml is used for html DOM manipulations.

Converters are:

Personally, I use mainly prince, and (a semblance of) software testing tends to be only concerned with prince. It is free of charge for non-commercial use.

weasyprint has some limitations, notably it is unbearably slow (For our usage, it is not rare that a pdf consists of hundreds or thousands of pages). But it is written in Python, by great authors. I want to keep it rather as a reference.

Basic Usage

The main comandline options of the program are:

  • -i INPUT, --input INPUT

  • -f FILE, --file FILE (read file to get INPUTS)

  • -1, --download

  • -2, --extract

  • -3, --convert

INPUT:

either URL or local system path.

(Except for commandline, it is referred to as rsrc (resource))

-1:

If INPUTS are URLs, downloads them, and creates dfiles (downloaded files).

-2:

reads and edits dfiles, and creates new efiles (extracted files).

-3:

converts efiles, and creates one pdf file.

Note -1, -2 and -3 take the same INPUT as argument. You don’t need to change that part of the commandline (see Example below).

The files the program creates are always in current directory, for dfiles and efiles, always in '_htmls' sub directory.

Samples

The program includes a sample ini file (site.sample.ini), and reads it into configuration.

https://*.wikipedia.org/wiki/*
https://*.wikibooks.org/wiki/*
https://wiki.mobileread.com/wiki/*
https://news.ycombinator.com/item*
https://news.ycombinator.com/threads?*
https://old.reddit.com/r/*
https://stackoverflow.com/questions/*
https://docs.python.org/*
https://www.python.org/dev/peps/*
https://bugs.python.org/issue*
https://github.com/* (for https://github.com/*/README*)
https://github.com/*/issues/*
https://github.com/*/pull/*
https://github.com/*/wiki/*
https://gist.github.com/*

For URLs that match one of them, you can test the program without preparing the configuration.

An example:

$ tosixinch -i https://en.wikipedia.org/wiki/XPath -123

Note

  • You need to set the converter if not the default (prince). See Programs.

$ [...] --weasyprint
  • If you installed the converter in unusual places (not in PATH), you need to set the fullpath. See cnvpath.

$ [...] --cnvpath /home/john/build/bin/prince
  • The sample css uses DejaVu Sans and Dejavu Sans Mono fonts if installed, and is optimized for them. Otherwise generic sans-serif and monospace are used. You may need to adjust fonts and layout configuration.

  • These commands may create temporary files other than the pdf file in current directory. You can delete them as you like.

Besides sample sites, some non html texts may work fine with default configuration, local or remote.

$ tosixinch -i https://raw.githubusercontent.com/python/cpython/master/Lib/textwrap.py -123

Example

You are browsing some website, and you want to bundle some articles in a pdf file.

Move to some working directory.

$ cd ~/Downloads/tosixinch    # an example

Test for one rsrc. If it is URL like this one, you have to download it first.

$ tosixinch -i https://somesite.com/article/aaa.html -1

Look into the site structure, using e.g. the browser’s development tools, and write extraction settings for the site.

# in '~/.config/tosixinch/site.ini'
[somesite]
match=    https://somesite.com/article/*
selecet=  //div[@id="main"]
exclude=  //div[@class="sidemenu"]
          //div[@class="comment"]

Note

The values of select and exclude are XPaths. In software, html tag structure is made into objects tree (DOM or Elements). One way to get parts of them is XPath.

The value above means e.g. get from anywhere ('//'), div tags whose id attributes are 'main' (including every sub-elements inside them).

Multiple lines are interpreted as connected with '|' (equivalent to 'or').

Generate a new (extracted) html, applying the site config to the local html.

$ tosixinch -i https://somesite.com/article/aaa.html -2

Optionally, Check the extracted html in the browser.

$ tosixinch -i https://somesite.com/article/aaa.html -b
  • '-b' or '--browser' opens efile.

Try -2 several times if necessary, editing and changing the site configuration (It overwrites the same efile).

And

$ tosixinch -i https://somesite.com/article/aaa.html -3
  • It generates ./somesite-aaa.pdf.

Next, Build an rsrcs list, by some means.

# in './rsrcs.txt'
https://somesite.com/article/aaa.html
https://somesite.com/article/bbb.html
https://somesite.com/article/zzz.html

And

$ tosixinch -123
  • If inputs are not specified (no -i and no -f), it defaults to 'rsrcs.txt' in current directory.

  • It generates ./somesite.pdf, with three htmls as each chapter.

Additionally, if you configured so:

$ tosixinch -4
  • it opens the pdf with a pdf viewer.

Features

rsrc strings can be pre-processed by regular expressions before mainline processing. Replace.

You can specify multiple encodings for documents, including html5prescan encoding declaration parser, and ftfy UTF-8 encoding fix. option: encoding.

The program has vary basic headless browser downloading functions using Selenium. So if you are lucky, you may get javascript generated html contents. option: headless. (Note Selenium requires selenium and firefox or chrome webdrivers).

Users can define additional instructions for browsers. option: dprocess, but I recommend you read process first.

As already mentioned, you can manipulate html elements, by adding arbitrary functions. option: process.

One custom XPath syntax is added, to select class attributes easier. double equals.

If you install Pygments, and ctags (Universal Ctags or Exuberant Ctags), you can add pdf bookmarks and links for source codes definitions. _pcode.

As builtin, it has similar but simpler capabilities, only for python source code. code.

It can convert man pages. _man.

For other texts, It can also convert them with some formatting (experimental). Text Format. See also option: ftype.

It has simple TOC (table of contents) rebounding feature, adding one level of structure. So if you have downloaded e.g. the entire contents of some blog site (sorry for the guy), you might be able to get a pdf with annual chapters like 2011, 2012, 2013, and articles are inside them. TOC.

Users can create their own css files with simple templates, expanding configuration values. CSS Template Values.

As already mentioned, it can open the pdf with a pdf viewer. Viewcmd.

It has pre and post hooks for each (sequential) actions. For each, users can call external commands or python modules, adding or bypassing some of the program’s capabilities. Hookcmds.

As a last resort, it can print out file names to be created. They are determined mostly uniquely given rsrc inputs. So that users can do some of the program’s jobs outside of the program. commandline: printout.

A basic bash completion script is included. _tosixinch.bash.