Config Options¶
Note
Default Value is specified by parenthesis in the first lines.
Value Function is specified by bracket in the first lines.
tosixinch.ini¶
Note
Options with star * are common options with site.ini.
You can use them to override application-wide configuration.
General Section¶
- downloader *¶
(
urllib)Specify default downloader. Either
urlliborheadless.
- extractor *¶
(
lxml)Specify default extractor. Currently the only option is
lxml.
- converter¶
(
prince)Specify default converter.
princeorweasyprint.
- user_agent *¶
(some arbitrary browser user agent. Run
'tosixinch -a'to actually see.)Specify user agent for downloader (only for
urllib).
- browser_engine *¶
(
selenium-firefox)Specify browser engine when
headlessoption is True,selenium-chromeorselenium-firefox.
- selenium_chrome_path *¶
- (None)
Specify the path of chromedriver for selenium (passed to
executable_pathargument). Normally unnecessary.
- selenium_firefox_path *¶
- (None)
Specify the path of geckodriver for selenium (passed to
executable_pathargument). Normally unnecessary.
- encoding *¶
- (
utf-8, cp1252, latin_1)[COMMA]Specify preferred encoding or encodings. First successful one is used. Encoding names are as specified in codecs library, or ‘html5prescan’, or ‘ftfy’ if they are installed.
If the name is
html5prescan,html5prescantries to get a valid encoding declaration from html. (The library strictly follows html5 spec and usually it is not necessary nor useful. It is intended for occasional debug purpose.)After successful encoding by one of the encodings, if the list includes
ftfy,ftfy.fixes.fix_encodingmethod is called with the decoded text. It may be able to fix some ‘mojibake’. (So it is always called last, the place in the list is irrelevant).
Note
The included bash completion only completes canonical codec names (with underline changed to dash). But you can put any other name or alias names, as long as they are legal in your Python environment.
- encoding_errors *¶
- (
strict)Specify codec Error Handler.
If you can’t run
extractbecause of decoding errors, one solution is to change this option to ‘replace’ or ‘backslashreplace’.
- parts_download *¶
- (
True)[BOOL]Web pages may have some component content. Most important ones are images, and currently the program only concerns images (in html tag
<img src=...>). The value specifies whether it downloads these components whenextract.Note downloading may occur anyway by pdf converters.
If this option is
True, download links are rewritten to point to localdfiles. So downloading doesn’t happen whenconvert.In general, pre-downloading is useful for multiple trials and layout checking.
If force_download is
False(default), the program skips downloading if the file already exists.
- force_download *¶
- (
False)[BOOL]By default, The program does not download if the destination file exists.
If this options is
True:In case of
-1, it (re-) downloadsURLeven ifdfileexists.In case of
-2, it (re-) downloads component files (images etc.) even if they exist.But in one invocation, this re-downloading is always once for one
URL. (The program doesn’t download the same icon files again and again).
- guess¶
- (
//div[@itemprop="articleBody"]//div[@role="main"]//div[@id="main"]//div[@id="content"]//div[@class=="body"]//article)[LINE]If html
rsrcdoesn’t match any match option insite.ini,selectis done according to this value.The procedure:
The XPaths in this value are searched in order, line by line.
If match is found and match is a single element (not multiple occurrences), the program
selects the element.
- defaultprocess *¶
- (
add_h1, youtube_video_to_thumbnail, convert_permalink_sign)[LINE]Before site specific
processfunctions, the program applies defaultprocessfunctions to all htmlrsrc, according to this value.The syntax is the same as process option, in
site.ini.About default functions:
add_h1: If there is no<h1>, make<h1>tag from<title>tag text. It is to make better pdf bookmarks (TOC).youtube_video_to_thumbnail: Change embedded youtube video object to thumbnail image.convert_permalink_sign: Remove permalink sign (’¶’), for a few class (‘headerlink’ etc.). Python documents tend to use them, and On pdf, they are always visible, rather noisy.
When the default functions is undesirable in some site, please override this option in user
site.ini.
- full_image *¶
- (
200)[INT]If width or height of image pixel size is equal or above this value, class attribute
tsi-tallortsi-wideis added to the image tag,tsi-tallif height/width ratio is greater than the ratio of the e-reader display,tsi-wideif the opposite.By itself, it does nothing. However, In
sample.css, it is used to make medium sized images expand almost full display size, with small images (icon, logo, etc.) intact. The layout gets a bit uglier, but I think it is necessary for small e-reader displays.
- add_binary_extensions¶
(
3dm3ds3g23gp7zaaacadpaiaifaiffalzapeapkappimageararjasfauavibakbamlbhbinbkbmpbtifbz2bzip2cabcafcgmclasscmxcpiocr2curdatdcmdebdexdjvudlldmgdngdocdocmdocxdotdotmdraDS_Storedskdtsdtshddvbdwgdxfecelp4800ecelp7470ecelp9600eggeoleotepubexef4vfbsfhflaflacflatpakfliflvfpxfstfvtg3ghgifgrafflegzgziph261h263h264icnsicoiefimgipaisojarjpegjpgjpgvjpmjxrkeyktxlhaliblvplzlzhlzmalzom3um4am4vmarmdimhtmidmidimj2mkamkvmmrmngmobimovmoviemp3mp4mp4ampegmpgmpgamxunefnpxnumbersnupkgoodpodsodtogaoggogvotfottpagespbmpcxpdbpdfpeapgmpicpngpnmpotpotmpotxppappamppmppsppsmppsxpptpptmpptxpsdpyapycpyopyvqtrarrasrawresourcesrgbriprlcrmfrmvbrpmrtfrzs3ms7zscptsgisharsnapsilsketchslksmvsnksostlsuosubswftartbztbz2tgatgzthmxtiftifftlzttcttftxzudfuvhuviuvmuvpuvsuvuvivvobwarwavwaxwbmpwdpwebawebmwebpwhlwimwmwmawmvwmxwoffwoff2wrmwvxxbmxifxlaxlamxlsxlsbxlsmxlsxxltxltmxltxxmxmindxpixpmxwdxzzzipzipx)[PLUS]The program ignores
rsrcswith binary like looking extensions, only when multiplersrcsare provided.This option value adds to or subtracts from the default
add_binary_extensionslist above.The list is taken from Sindre Sorhus’ binary-extensions.
This is for user convenience. If you copy and paste many
rsrcs, checking strange extensions is a bit of work. But I’m afraid sometimes it gets in the way.(An example I found: some old unix software uses
docextension for text (likeREADME.doc).
- add_clean_tags *¶
- (None)
[PLUS]After
select,excludeandprocessinextract, the programcleans the resultant html.The tags in this option are stripped. The current default is none.
- add_clean_attrs *¶
- (
color, width, height)[PLUS]After
select,excludeandprocessinextract, the programcleans the resultant html.The attributes in this option are stripped. The current default is color, width and height.
Most e-readers are black and white. Colors just make fonts harder to read.
Width and height conflict with user css rules.
- elements_to_keep_attrs *¶
- (
self::mathself::svgself::node()[starts-with(@class, "MathJax")])[LINE]After
select,excludeandprocessinextract, the programcleans the resultant html.The program skips cleaning attributes for the elements that matches one of the XPath in this option.
The default is
math,svgand someMathJaxrelated tags. They have inter-related width and height information, which we usually want to keep.Note XPaths are checked from each element, not from the root document. So the selectors are like above (not like e.g.
'//math').
- styles_to_retain *¶
- (
content:;display: none;text-decoration-line: line-through;text-decoration: line-through;)[LINE]Specify particular inline styles you want to retain (all other styles are removed).
If only the property is provided (no values after
':'), then all values are retained as is.‘;’ before each line end is optional.
Note: This is for the simplest css manipulations. The program gives up for any indication of complexity (currently the existence of any characters of ‘,’, ‘/’, ‘(‘).
- ftype¶
- (None)
Specify file type when
extract.Valid values are:
'html', 'prose', 'nonprose', 'python'
It needs improvement.
- textindent¶
(
' --> ')Set logical line continuation marker for
nonprosetexts.See nonprose.
ConfigParserstrips leading and ending whitespaces. So if you want actual whitespaces, quote them as the default does. Quotes are stripped by the program in turn.
- trimdirs *¶
- (
3)[INT]Shorten PDF table of contents title, if it is a local text file.
PDF toc title for local text file is made from their full path. If this trimdirs option value is with no sign, remove that number from leading path segments. If it is with minus sign, remove leading path segments to make the segments to that number.
--trimdirs 0 aaa/bbb/ccc/ddd/eee/fff --trimdirs 2 # remove two segments ccc/ddd/eee/fff --trimdirs -2 # reduce to two segments eee/fff # c.f. no bounding errors --trimdirs 100 fff --trimdirs -100 aaa/bbb/ccc/ddd/eee/fff
Note html files always use html title (actual, or placeholder
notitle). Remote text (non-html) files use the URL with scheme (’https://’) stripped.C.f. –check commandline option prints out this shortened names for local files. They include local html files, so it is not perfect, but it can be useful for checking and adjusting this
trimdirsoption.
- raw¶
- (
False)[BOOL]If
True, whenconvert, the program processesrsrcs. Normally (if it isFalse), it processesefile.
- css *¶
- (
sample)[COMMA]CSS file names to be used in order. The names are referenced, in order, in
efiles('<link ... rel="stylesheet">').you can only use the filenames (not full paths).
The filenames are searched in
css directory,application css directoryand current directory in order.The program includes sample css
sample.t.css, and as a special case, it can be abbreviated assample(default).
- pdfname¶
- (None)
Specify output PDF file name. If not provided (default), the program makes up some name. see PDF_File.
—
Note
For hookcmds below, see Hookcmds.
- precmd1¶
- (None)
[LINE][CMDS]Run arbitrary commands before
downloadaction.
- postcmd1¶
- (None)
[LINE][CMDS]Run arbitrary commands after
downloadaction.
- precmd2¶
- (None)
[LINE][CMDS]Run arbitrary commands before
extractaction.
- postcmd2¶
- (None)
[LINE][CMDS]Run arbitrary commands after
extractaction.
- precmd3¶
- (None)
[LINE][CMDS]Run arbitrary commands before
convertaction.
- postcmd3¶
- (None)
[LINE][CMDS]Run arbitrary commands after
convertaction.
- viewcmd¶
- (None)
[LINE][CMDS]Run arbitrary commands when specified in commandline options (
-4or--view).
- pre_each_cmd1¶
- (None)
[LINE][CMDS]Run arbitrary commands before each
download.
- post_each_cmd1¶
- (None)
[LINE][CMDS]Run arbitrary commands after each
download.
- pre_each_cmd2¶
- (None)
[LINE][CMDS]Run arbitrary commands before each
extract.
- post_each_cmd2¶
- (None)
[LINE][CMDS]Run arbitrary commands after each
extract.
- browsercmd¶
- (None)
[CMD]When action is
--browser, the default is just call Python stdlibwebbrowserto open a browser. If it is not desirable, specify the open command here, e.g.:firefox 'site.slash_efile'
You have to use the magic word
site.slash_efilefor the filename. It evaluates to the intended URL version ofefile(percent encoding etc.).
Style Section¶
The options in style section are used for css template files.
Note that users can always choose (static) css files
rather than css template files.
In that case, the style options have no effect.
So, the options themselves have no meaning.
In the following, the roles in the sample file
(sample.t.css) are explained.
- orientation¶
(
portrait)Specify page orientation, portrait or landscape.
- portrait_size¶
(
90mm 118mm)Specify portrait page size (width and height). The program uses this value when
orientationisportrait.The display size of common 6-inch e-readers seems around 90mm x 120mm. Here the default thinly clips on height, for versatility.
- landscape_size¶
(
118mm 90mm)Specify landscape page size (width and height). The program use this value when
orientationislandscape.
- toc_depth¶
- (
3)[INT]Specify (max) tree level of pdf bookmarks (Table of Contents). It uses html headings for structuring, so valid values are 0 to 6.
- font_family¶
(
"DejaVu Sans", sans-serif)Specify default font to use.
- font_mono¶
(
"Dejavu Sans Mono", monospace)Specify default monospaced font to use.
- font_serif¶
(None)
Not used.
- font_sans¶
(None)
Not used.
- font_size¶
(
9px)Specify default font size.
- font_size_mono¶
(
8px)Specify default monospaced font size.
- font_scale¶
(
1.0)Specify scaling factor for css
font_sizeandfont_size_mono.It is to make easier to test font sizes.
- line_height¶
(
1.3)Specify default line height.
Converter Sections¶
Section prince and weasyprint are converter sections.
They have common options.
When convert, only one converter is active,
and only the options of that converter’s section are active.
commandline has the same options, to override.
Note
To see the current values for each converter:
$ tosixinch -a --prince
$ tosixinch -a --weasyprint
- cnvpath¶
(
prince)The name or full path of the command as you type it in the shell. For ordinary installed ones, only the name would suffice.
- css2¶
- (None)
[COMMA]Extra css files just to pass to converter commandline options.
It may be useful for converter specific features or troubles. Although, normally, you can do that better with
cssoption and the template.You can only use the filenames (not full paths).
The filenames are searched in
css directoryand current directory in order.
- cnvopts¶
- (None)
[CMD]Additional options to pass to the command, besides css file option (which is added by
css2option above if it is specified).
site.ini¶
site.ini should have many sections,
each is the settings for some specific site or a part of the site.
They all have the same options,
in which the common options (the same ones as in tosixinch.ini)
are not described here.
Each section must have match option.
It is this option that is used as glob string to match input rsrcs,
and consequently select which section to use.
So section names themselves can be arbitrary.
- match¶
(None)
Glob string to match against input
rsrc.Path separator (
'/') is not special for wildcards (*?[]!). So, e.g.'*'matches any strings including all subdirectories. (Actually, it uses fnmatch module, not glob module).The program tries the values of this option from all the sections. The section whose
matchoption matches thersrcis used for the settings.If there are multiple matches, the one with the most path separator characters (
'/') is used (scheme separator'//'in'https?://'are not counted). If there are multiple matches still, the last one is used.If there is no match, default settings are used, and guess option is tried. In this case, a placeholder value
http://tosixinch.example.comis set. (This imaginary site is used to make file paths indownloadandextract).
- select¶
- (None)
[LINE]XPath strings to select elements from
dfilewhenextract. Only selected elements are included in the<body>tag of the newefile, discarding others.Each line in the value will be connected with a bar string (
'|') when evaluating.
- exclude¶
- (None)
[LINE]XPath strings to remove elements from the new
efileafterselect. So you don’t need to exclude already excluded elements byselect. As inselect, each line in the value will be connected with a bar string ('|').
- process¶
- (None)
[LINE]After
selectandexclude, arbitrary functions can be called if this option is specified.Selection:
The functions must be top level ones.
It is searched in user process directory and application process directory, in order.
If the function name is found in multiple modules in user process directory, the program raises Error.
In that case, you can use dot notation. If the function name includes one dot (
'.'), the program interprets it as<module name>.<function name>. Two or more dots are not supported.Invocation:
The first argument of the functions is always
doc, which the program provides. It islxml.htmlDOM object (HtmlElement), corresponding to the resultantefileafterselectandexclude.The function can have additional arguments. String after
'?'(and before next'?') is interpreted as an argument.For example,
'aaa?bb?cc'is made into codeif
'aaa.py'is found in user process directory:process.aaa(doc, bb, cc)
or if it is found in application process directory:
tosixinch.process.aaa(doc, bb, cc)
You don’t have to
returnanything, just manipulatedocas you like. The program uses the resultantdocsubsequently.See process.sample for included sample functions.
Example:
Let’s say you want to change
h3tag todivfor http://somesite.com.First, create a file in process directory e.g.
~/.config/tosixinch/process/myprocess.py.Second, create a top level function e.g.
def heading_to_div(doc, heading): """Change some heading to div from argument e.g. 'h3'.""" for el in doc.xpath('//' + heading): el.tag = 'div'
Third, write configuration accordingly.
[somesite] match= http://somesite.com/* select= ... process= myprocess.heading_to_div?h3
- cookie¶
- (None)
[LINE]Some sites require confirmation before providing the documents. (‘Are you over 18?’, ‘Agree to terms of service?’)
And
urllibcannot handle these interactive communications.By adding cookie data here (e.g. from your browsers), you may be able to bypass them.
Note it is not secure. Do not provide sensitive data.
- dprocess¶
- (None)
[LINE]When
download, the program runs functions specified by this option after getting http response, and before serializing to html text.For completeness, it also runs when downloader is
urllib, but the supposed usage is for other headless browsers.For example, some webpages have folded contents which users need to click and run javascript to expand.
The mechanism is similar to
process, Users define a function in a python file in userdprocessdirectory, withagentas the first argument, and modify it. If necessary, they can define other arguments by using'?'(see process).But what comes as
agentis dependent on what is actuallydownloadernow:urllib http.client.HTTPResponse selenium selenium.webdriver.remote.webdriver.WebDriver
So user should be careful. (For example, when you define
dprocessinsite.ini, it is advisable to also definedownloader).Example:
def sitefoo_click(agent): # for selenium path = '//div[@class="see_details"]' elements = agent.find_elements_by_xpath(path) for element in elements: element.click() time.sleep(1)
- inspect¶
- (
get_links)[LINE]When action is
inspect, the program runs functions this option specifies.This is similar to
extractaction’sprocess, butinspectdoes not do anything before and after (select, exclude …, write to file).Create Python functions in the same folder as
process, original non-extracted html object is provided, as the first argumentdoc, and user do something, mostly print something.See process.inspect_sample for a few sample functions.