Overview¶
Actions¶
The program consists of several, independent actions (subcommands).
An action is either a sequential action or a single action.
sequential actionsare:
You can call more than one sequential actions in one invocation.
Irrespective of the arguments order,
the program executes actions in the above order.
So, the three invocations below make no difference.
$ tosixinch -1 # download
$ tosixinch -2 # extract
$ tosixinch -3 # convert
$ tosixinch -123
$ tosixinch -321
single actionsare:appcheck (
'-a')Print application settings, and exit.
They are global, with commandline evaluation, but without site-specific option evaluation.
browser (
'-b')check (
'-c')Print matched
rsrcsettings and, exit.You have to supply
rsrcsome way (-ior-f). If inputrsrcis only one, print all the option values, with site-specific option evaluation. Otherwise, print only section name andmatchoption.-
Do something, according to site config option inspect
The program executes only the first single action
(if there are many, or mixed with sequential ones),
and exits.
download¶
Downloads rsrc, and saves it in a local file (dfile).
If force_download is False (default),
the program skips downloading if the file already exists.
If rsrc is a local file path, it also does nothing.
dfile is the same as rsrc.
For the actual downloading, it just uses urllib.request (python standard library). user-agent and encoding are configurable.
If headless option is True,
The program uses selenium instead of urllib
(See Features).
extract¶
Opens dfile, and generates a new file (efile).
It always writes, overwriting existing efile if any.
(If rsrc is local file path, dfile is not created,
but efile is created).
Extraction procedure is predetermined, and according to options in site.ini. See select, exclude, process and clean options.
convert¶
Opens efile, and generates PDF_File.
It always writes, overwriting existing PDF_File if any.
The converter option
decides which converter to use.
Target Files¶
- rsrc¶
Input resource location. URL or system path. Only
http,httpsandfileschemes are supported for URL.Example:
https://en.wikipedia.org/wiki/XPath
Note
file urls are only for regular local files, starting from one of:
file:/
file://localhost/
file:///
They are converted immediately to regular filepaths,
so e.g. --printout 0 returns the latter.
- rfile¶
The required argument of the commandline option
-for--file. It should be a file containingrsrcs.rfiledefaults to ‘rsrcs.txt’.The file’s syntax is:
Each line is parsed as
rsrc.When action is not
toc, the lines starting with'#'or';'are ignored.When action is
toc, the lines starting with'#'are interpreted as chapters. the lines starting with';'are ignored.When there are multiple
rsrcs, if arsrchas an extension that looks like binary, thisrsrcis ignored (according to add_binary_extensions option).Note if input
rsrcis single, whether-ior-f, thisadd_binary_extensionsfilter is not applied.
- dfile¶
If
rsrcis URL,dfileis created inside_htmlsdirectory, with URLauthorityandpath segmentsas subdirectories.Example:
./_htmls/en.wikipedia.org/wiki/XPath
Note
As an exception, if original
URLis too long for file name conversion (a path segment more than 255 characters), the wholeURLis sha1-hashed, and the name takes a_htmls/_hash/<sha1-hexdigit>form.
- efile¶
If
rsrcis URL,efileis the same asdfile, butdfileitself is renamed with suffix'.orig'.Example:
./_htmls/en.wikipedia.org/wiki/XPath.orig (dfile) ./_htmls/en.wikipedia.org/wiki/XPath (efile)
If
rsrcis a local file path, The path components ofefileare created by the same process asdfile.Example:
/home/john/script/aaa.txt (rsrc) /home/john/script/aaa.txt (dfile) ./_htmls/home/john/script/aaa.txt (efile)
- PDF_File¶
When
--pdfnameoption is not provided, the program auto-creates the pdf filename. The name is made up fromrsrc’s last path, query, section name and host name of the first rsrc.Example:
./wikipedia-XPath.pdf (from single input) ./wikipedia.pdf (from multiple input)
Even if
rsrcsare from multiple domains (e.g. wikipedia and reddit), the filename of the pdf is named after the first one (just wikipedia). So, it is not always appropriate.
Config Files¶
- rsrcs.txt¶
It is the default filename for
--file, and used when no other file or inputrsrcis specified.
- tocfile¶
It is the
tocversion of rfile.It is generated automatically in current directory, when action is
toc, and processed automatically when action isconvert.see TOC for details.
- userdir¶
user configuration directory is specified by environment variable:
TOSIXINCH_USERDIR. For example:export TOSIXINCH_USERDIR=~/etc/tosixinch # (in ~/.bashrc)
Reloading files or rebooting system might be needed. For example:
$ source ~/.bashrc
If the program cannot find the variable, a basic search is done for the most common configuration directories.
Mac:
~/Library/Application Support/tosixinch
Others:
$XDG_CONFIG_HOME/tosixinch ~/.config/tosixinch
(So, if this is OK for you, you don’t have to export the environment variable).
If this also fails, no user directory is set, and just default application config and sample site config are read.
If commandline argument
--userdiris given, it overrides all the above.
- tosixinch.ini¶
if there are files that glob match
tosixinch*.iniinuserdir, it reads all of them in alphabetical order, and sets application settings accordingly.
- site.ini¶
if there are files that glob match
site*.iniinuserdir, it reads all of them in alphabetical order, and sets site specific settings accordingly.
- css directory¶
userdirshould havecsssub directory. For example~/.config/tosixinch/css
- css files¶
The program searches css files (
'*.css') incss directory(or current directory) whenconvert.Each file name must be specified for each converter in
tosixinch.ini(see option css.By default, the program uses
sample.cssfor all converters. It is generated from the templatesample.t.css(see below).
- css template files¶
If css file names match
'*.t.css', they are rendered by a template engine templite.py (included).(for the syntax and values, see CSS Template Values).
When
convert, the program always renders them, and resultantcss filesare placed incss directory(or current directory), overwriting older one, if any.The css filenames are made by stripping
'.t'from the template. (For example,sample.t.cssgeneratessample.css.)
- process directory¶
userdircan also haveprocesssub directory. For example~/.config/tosixinch/process
- process files¶
When action is
extract, you can apply arbitrary functions to the html DOM elements, before writing toefile.(For the details, see process option).
The program searches process functions in python files (
'*.py') inprocess directory.If it cannot find the one, it searches in application’s
tosixinch.processdirectory.
Config Format¶
Configuration files are parsed by a customized version of configparser (Python standard library). So in general, the syntax follows it.
[section]
option= value
more_option= more value
Comment¶
Comment markers are '#' or ';', in the first non-whitespace column.
Inline comments are not possible.
But if option function is [CMD], it is parsed by
shlex
(Python standard library),
so in the option value, you can use inline comments
(only '#' character). For example:
[section]
command= find . -name '*.py' # TODO: more suitable command example
ConfigParser reads the entire string after '=',
but it is passed to shlex, and it strips '#' and after.
Structure¶
There are two types of configuration files.
tosixinch.ini(application config)site.ini(sites configs).
tosixinch.ini consists of three types of sections.
generalstyleeach converter sections (
princeandweasyprint).
site.ini consists of sections for each specific website,
and they all have the same options.
site.ini has some common options as tosixinch.ini,
and overrides the latter values if specified.
commandline also has some common options as tosixinch.ini,
and overrides site.ini and tosixinch.ini values if specified.
Common commandline options can be obtained
by replacing '_' to '-'.
E.g., the commandline option of the config option user_agent is --user-agent.
Section Inheritance¶
In site.ini, you can use simple section inheritance syntax.
' : ' in section names is specially handled,
so that [aa : bb] means [aa],
but falls back to [bb]. For example:
[aa : bb]
x=aaa
[bb]
x=bbb
y=bbb
In this config, in aa section,
x option is ‘aaa’, and y option is ‘bbb’.
aa doesn’t have y option,
so it searches the parent section (bb).
(If even the parent section doesn’t have the option,
then it falls back to ordinary mechanism.
(DEFAULT section search or NoOptionError).
It is to omit duplicate options. For example, wiki pages of mobileread.com use the same layout as wikipedia.org. So the options for the program are also the same.
[wikipedia]
match= https://*.wikipedia.org/wiki/*
select= ...
exclude= ...
...
[mobileread : wikipedia]
match= http://wiki.mobileread.com/wiki/*
Value Functions¶
Each option value field has predetermined transformation rules. Users have to fill the value accordingly, if setting.
- (Nothing)¶
If nothing is specified, it is an ordinary
ConfigParservalue. String value as you write it. Leading and ending spaces are stripped. Newlines are preserved if indented.
- BOOL¶
'1','yes','true'and'on'are interpreted asTrue.'0','no','false'and'off'are interpreted asFalse.case insensitive.
- INT¶
Integer number string, no dot
- FLOAT¶
Float number string
- COMMA¶
Values are comma separated list. For example:
[section] ... comma_option= one, two, three
Leading and ending spaces and newlines are stripped. So the value is a list of
'one','two'and'three'. Single value with no commas is OK.
- LINE¶
Values are line separated list. For example:
[section] ... line_option= one two, three four five,
Leading and ending spaces and commas are stripped. So the value is a list of
'one','two, three'and'four five'. Single line with no newlines is OK.
- CMD¶
Value is for a commandline string. You write value string as you would write in the shell. So words with spaces need quotes, and special characters need escapes.
- CMDS¶
list of
CMD, separated by newlines as inLINE.
- PLUS¶
Values are comma separated list as
COMMA, and add to or subtract from some default values. If first character of an item is'+', it is aplus item. If'-', it is aminus item.For example, if initial value is
'one, two, three':+four -> (one, two, three, four) -two, -three, +five -> (one, four, five)
If already added or no items to subtract, it does nothing.
+one, -six -> (one, four, five)
As a special case, if all items are neither
plus itemnorminus item, the list itself overwrites previous value.six, seven -> (six, seven)
So items must be either some combination of
plus itemsandminus items, or none of them. Mixing these raises Error.You can pass
minus itemin the same way in commandline.... --plus-option -one
Multiple items in commandline should be quoted.
... --plus-option '-two, -three, +four'
Clean¶
When extract, the program clean s html head and body
according to clean option.
clean head¶
If clean option is either both or head,
before select, exclude and process,
the program clean s the html head.
This means the program replaces the original head with the minimal short head (the same one for all htmls).
clean body¶
If clean option is either both or body,
after select, exclude and process,
the program clean s the resultant html body.
- tags:
According to add_clean_tags.
- attributes:
According to add_clean_attrs.
- javascript:
All inline javascript and javascript source references are unconditionally stripped.
- css:
All
styleattributes and css source references are stripped, with one exception.If a tag has
'tsi-keep-style'in class attributes,styleattributes are kept intact. It can be used in process functions. If you want to keep or create some inlinestyle, add this class attribute.# removed (becomes just '<div>') <div style="font-weight:bold;"> # not removed <div class="tsi-keep-style other-values" style="font-weight:bold;">
- skip tags:
According to elements_to_keep_attrs. The program skips cleaning the matched elements (and all sub-elements), if the elements are not already removed by
add_clean_tags.
CSS Template Values¶
In css template files,
you can look up option values in style section.
Syntax¶
{{ option }} is replaced with value.
For example, {{ font_size }} becomes 9px.
Conditional block {% if option %} ... {% endif %}
is rendered if the option is evaluated to True
(not None, False, 0, '', or []).
For example, you can write prince specific css rules
inside {% if prince %} ... {% endif %} block.
Values¶
Some extra values are defined for convenience.
size variable is added.
It is automatically set from either
portrait_size
or landscape_size,
according to the value of
orientation.
width and height variables are added (derived from size).
font_scale option is made into scale function.
Use it like {{ font_serif|scale }}.
percent80, percent81 … percent99 functions are added.
Use it like {{ height|percent98 }} (98 % of the height length).
It is OK if the previous value, here height, includes units like px or mm.
Bool variables prince, weasyprint are added.
They are True or False
according to the currently selected converter.
toc_depth is transformed to variables
bm1, bm2, bm3, bm4, bm5 and bm6.
For example, if toc_depth is 3,
they are 1, 2, 3, none, none and none.
In sample.t.css, it is used like:
h1 { prince-bookmark-level: {{ bm1 }} }
h2 { prince-bookmark-level: {{ bm2 }} }
h3 { prince-bookmark-level: {{ bm3 }} }
h4 { prince-bookmark-level: {{ bm4 }} }
h5 { prince-bookmark-level: {{ bm5 }} }
h6 { prince-bookmark-level: {{ bm6 }} }
lxml.html.HtmlElement¶
The program uses a lightly customized version of lxml.html.HtmlElement,
which means mainly two things.
It prints out a bit more helpful Error message.
You can use a custom XPath syntax,
double equals.
Double Equals¶
When using XPath, it is inconvenient to select elements from class attributes.
For example, if you want to select <div class="aa bb cc"> using 'aa',
you cannot simply write '@class="aa"'
(XPath sees 'aa' as literal strings,
and 'aa bb cc' and 'aa' are different strings).
So you have to write:
div[contains(concat(" ", normalize-space(@class), " "), " aa ")]
(See e.g. When selecting by class, be as specific as necessary, for the explanation).
To ease this problem, the program introduces a custom syntax double equals ('==').
In configuration options expecting XPath strings,
or arguments in .xpath() method in user python modules,
if the string matches:
<tag>[@class==<value>]
in which
<tag> is some tag name or '*'
<value> is some value with optional quotes (' or ")
It is rewritten to:
<tag>[contains(concat(" ", normalize-space(@class), " "), " <value> ")]'
So you can write e.g.:
[somesite]
...
select= //div[@class=="aa"]