Config Options¶
Note
Default Value
is specified by parenthesis in the first lines.
Value Function is specified by bracket in the first lines.
tosixinch.ini¶
Note
Options with star *
are common options with site.ini.
You can use them to override application-wide configuration.
General Section¶
- downloader *¶
(
urllib
)Specify default downloader. Either
urllib
orheadless
.
- extractor *¶
(
lxml
)Specify default extractor. Currently the only option is
lxml
.
- converter¶
(
prince
)Specify default converter.
prince
orweasyprint
.
- user_agent *¶
(some arbitrary browser user agent. Run
'tosixinch -a'
to actually see.)Specify user agent for downloader (only for
urllib
).
- browser_engine *¶
(
selenium-firefox
)Specify browser engine when
headless
option is True,selenium-chrome
orselenium-firefox
.
- selenium_chrome_path *¶
- (None)
Specify the path of chromedriver for selenium (passed to
executable_path
argument). Normally unnecessary.
- selenium_firefox_path *¶
- (None)
Specify the path of geckodriver for selenium (passed to
executable_path
argument). Normally unnecessary.
- encoding *¶
- (
utf-8, cp1252, latin_1
)[COMMA]
Specify preferred encoding or encodings. First successful one is used. Encoding names are as specified in codecs library, or ‘html5prescan’, or ‘ftfy’ if they are installed.
If the name is
html5prescan
,html5prescan
tries to get a valid encoding declaration from html. (The library strictly follows html5 spec and usually it is not necessary nor useful. It is intended for occasional debug purpose.)After successful encoding by one of the encodings, if the list includes
ftfy
,ftfy.fixes.fix_encoding
method is called with the decoded text. It may be able to fix some ‘mojibake’. (So it is always called last, the place in the list is irrelevant).
Note
The included bash completion only completes canonical codec names (with underline changed to dash). But you can put any other name or alias names, as long as they are legal in your Python environment.
- encoding_errors *¶
- (
strict
)Specify codec Error Handler.
If you can’t run
extract
because of decoding errors, one solution is to change this option to ‘replace’ or ‘backslashreplace’.
- parts_download *¶
- (
True
)[BOOL]
Web pages may have some component content. Most important ones are images, and currently the program only concerns images (in html tag
<img src=...>
). The value specifies whether it downloads these components whenextract
.Note downloading may occur anyway by pdf converters.
If this option is
True
, download links are rewritten to point to localdfiles
. So downloading doesn’t happen whenconvert
.In general, pre-downloading is useful for multiple trials and layout checking.
If force_download is
False
(default), the program skips downloading if the file already exists.
- force_download *¶
- (
False
)[BOOL]
By default, The program does not download if the destination file exists.
If this options is
True
:In case of
-1
, it (re-) downloadsURL
even ifdfile
exists.In case of
-2
, it (re-) downloads component files (images etc.) even if they exist.But in one invocation, this re-downloading is always once for one
URL
. (The program doesn’t download the same icon files again and again).
- guess¶
- (
//div[@itemprop="articleBody"]
//div[@role="main"]
//div[@id="main"]
//div[@id="content"]
//div[@class=="body"]
//article
)[LINE]
If html
rsrc
doesn’t match any match option insite.ini
,select
is done according to this value.The procedure:
The XPaths in this value are searched in order, line by line.
If match is found and match is a single element (not multiple occurrences), the program
select
s the element.
- defaultprocess *¶
- (
add_h1, youtube_video_to_thumbnail, convert_permalink_sign
)[LINE]
Before site specific
process
functions, the program applies defaultprocess
functions to all htmlrsrc
, according to this value.The syntax is the same as process option, in
site.ini
.About default functions:
add_h1
: If there is no<h1>
, make<h1>
tag from<title>
tag text. It is to make better pdf bookmarks (TOC).youtube_video_to_thumbnail
: Change embedded youtube video object to thumbnail image.convert_permalink_sign
: Remove permalink sign (’¶’), for a few class (‘headerlink’ etc.). Python documents tend to use them, and On pdf, they are always visible, rather noisy.
When the default functions is undesirable in some site, please override this option in user
site.ini
.
- full_image *¶
- (
200
)[INT]
If width or height of image pixel size is equal or above this value, class attribute
tsi-tall
ortsi-wide
is added to the image tag,tsi-tall
if height/width ratio is greater than the ratio of the e-reader display,tsi-wide
if the opposite.By itself, it does nothing. However, In
sample.css
, it is used to make medium sized images expand almost full display size, with small images (icon, logo, etc.) intact. The layout gets a bit uglier, but I think it is necessary for small e-reader displays.
- add_binary_extensions¶
(
3dm
3ds
3g2
3gp
7z
a
aac
adp
ai
aif
aiff
alz
ape
apk
appimage
ar
arj
asf
au
avi
bak
baml
bh
bin
bk
bmp
btif
bz2
bzip2
cab
caf
cgm
class
cmx
cpio
cr2
cur
dat
dcm
deb
dex
djvu
dll
dmg
dng
doc
docm
docx
dot
dotm
dra
DS_Store
dsk
dts
dtshd
dvb
dwg
dxf
ecelp4800
ecelp7470
ecelp9600
egg
eol
eot
epub
exe
f4v
fbs
fh
fla
flac
flatpak
fli
flv
fpx
fst
fvt
g3
gh
gif
graffle
gz
gzip
h261
h263
h264
icns
ico
ief
img
ipa
iso
jar
jpeg
jpg
jpgv
jpm
jxr
key
ktx
lha
lib
lvp
lz
lzh
lzma
lzo
m3u
m4a
m4v
mar
mdi
mht
mid
midi
mj2
mka
mkv
mmr
mng
mobi
mov
movie
mp3
mp4
mp4a
mpeg
mpg
mpga
mxu
nef
npx
numbers
nupkg
o
odp
ods
odt
oga
ogg
ogv
otf
ott
pages
pbm
pcx
pdb
pdf
pea
pgm
pic
png
pnm
pot
potm
potx
ppa
ppam
ppm
pps
ppsm
ppsx
ppt
pptm
pptx
psd
pya
pyc
pyo
pyv
qt
rar
ras
raw
resources
rgb
rip
rlc
rmf
rmvb
rpm
rtf
rz
s3m
s7z
scpt
sgi
shar
snap
sil
sketch
slk
smv
snk
so
stl
suo
sub
swf
tar
tbz
tbz2
tga
tgz
thmx
tif
tiff
tlz
ttc
ttf
txz
udf
uvh
uvi
uvm
uvp
uvs
uvu
viv
vob
war
wav
wax
wbmp
wdp
weba
webm
webp
whl
wim
wm
wma
wmv
wmx
woff
woff2
wrm
wvx
xbm
xif
xla
xlam
xls
xlsb
xlsm
xlsx
xlt
xltm
xltx
xm
xmind
xpi
xpm
xwd
xz
z
zip
zipx
)[PLUS]
The program ignores
rsrcs
with binary like looking extensions, only when multiplersrcs
are provided.This option value adds to or subtracts from the default
add_binary_extensions
list above.The list is taken from Sindre Sorhus’ binary-extensions.
This is for user convenience. If you copy and paste many
rsrcs
, checking strange extensions is a bit of work. But I’m afraid sometimes it gets in the way.(An example I found: some old unix software uses
doc
extension for text (likeREADME.doc
).
- add_clean_tags *¶
- (None)
[PLUS]
After
select
,exclude
andprocess
inextract
, the programclean
s the resultant html.The tags in this option are stripped. The current default is none.
- add_clean_attrs *¶
- (
color, width, height
)[PLUS]
After
select
,exclude
andprocess
inextract
, the programclean
s the resultant html.The attributes in this option are stripped. The current default is color, width and height.
Most e-readers are black and white. Colors just make fonts harder to read.
Width and height conflict with user css rules.
- elements_to_keep_attrs *¶
- (
self::math
self::svg
self::node()[starts-with(@class, "MathJax")]
)[LINE]
After
select
,exclude
andprocess
inextract
, the programclean
s the resultant html.The program skips cleaning attributes for the elements that matches one of the XPath in this option.
The default is
math
,svg
and someMathJax
related tags. They have inter-related width and height information, which we usually want to keep.Note XPaths are checked from each element, not from the root document. So the selectors are like above (not like e.g.
'//math'
).
- ftype¶
- (None)
Specify file type when
extract
.Valid values are:
'html', 'prose', 'nonprose', 'python'
It needs improvement.
- textindent¶
(
' --> '
)Set logical line continuation marker for
nonprose
texts.See nonprose.
ConfigParser
strips leading and ending whitespaces. So if you want actual whitespaces, quote them as the default does. Quotes are stripped by the program in turn.
- trimdirs *¶
- (
3
)[INT]
Shorten PDF table of contents title, if it is a local text file.
PDF toc title for local text file is made from their full path. If this trimdirs option value is with no sign, remove that number from leading path segments. If it is with minus sign, remove leading path segments to make the segments to that number.
--trimdirs 0 aaa/bbb/ccc/ddd/eee/fff --trimdirs 2 # remove two segments ccc/ddd/eee/fff --trimdirs -2 # reduce to two segments eee/fff # c.f. no bounding errors --trimdirs 100 fff --trimdirs -100 aaa/bbb/ccc/ddd/eee/fff
Note html files always use html title (actual, or placeholder
notitle
). Remote text (non-html) files use the URL with scheme (’https://’) stripped.C.f. –check commandline option prints out this shortened names for local files. They include local html files, so it is not perfect, but it can be useful for checking and adjusting this
trimdirs
option.
- raw¶
- (
False
)[BOOL]
If
True
, whenconvert
, the program processesrsrcs
. Normally (if it isFalse
), it processesefile
.
- css *¶
- (
sample
)[COMMA]
CSS file names to be used in order. The names are referenced, in order, in
efiles
('<link ... rel="stylesheet">'
).you can only use the filenames (not full paths).
The filenames are searched in
css directory
,application css directory
and current directory in order.The program includes sample css
sample.t.css
, and as a special case, it can be abbreviated assample
(default).
- pdfname¶
- (None)
Specify output PDF file name. If not provided (default), the program makes up some name. see PDF_File.
—
Note
For hookcmds
below, see Hookcmds.
- precmd1¶
- (None)
[LINE][CMDS]
Run arbitrary commands before
download
action.
- postcmd1¶
- (None)
[LINE][CMDS]
Run arbitrary commands after
download
action.
- precmd2¶
- (None)
[LINE][CMDS]
Run arbitrary commands before
extract
action.
- postcmd2¶
- (None)
[LINE][CMDS]
Run arbitrary commands after
extract
action.
- precmd3¶
- (None)
[LINE][CMDS]
Run arbitrary commands before
convert
action.
- postcmd3¶
- (None)
[LINE][CMDS]
Run arbitrary commands after
convert
action.
- viewcmd¶
- (None)
[LINE][CMDS]
Run arbitrary commands when specified in commandline options (
-4
or--view
).
- pre_each_cmd1¶
- (None)
[LINE][CMDS]
Run arbitrary commands before each
download
.
- post_each_cmd1¶
- (None)
[LINE][CMDS]
Run arbitrary commands after each
download
.
- pre_each_cmd2¶
- (None)
[LINE][CMDS]
Run arbitrary commands before each
extract
.
- post_each_cmd2¶
- (None)
[LINE][CMDS]
Run arbitrary commands after each
extract
.
- browsercmd¶
- (None)
[CMD]
When action is
--browser
, the default is just call Python stdlibwebbrowser
to open a browser. If it is not desirable, specify the open command here, e.g.:firefox 'site.slash_efile'
You have to use the magic word
site.slash_efile
for the filename. It evaluates to the intended URL version ofefile
(percent encoding etc.).
Style Section¶
The options in style section are used for css template files.
Note that users can always choose (static) css files
rather than css template files
.
In that case, the style options have no effect.
So, the options themselves have no meaning.
In the following, the roles in the sample file
(sample.t.css
) are explained.
- orientation¶
(
portrait
)Specify page orientation, portrait or landscape.
- portrait_size¶
(
90mm 118mm
)Specify portrait page size (width and height). The program uses this value when
orientation
isportrait
.The display size of common 6-inch e-readers seems around 90mm x 120mm. Here the default thinly clips on height, for versatility.
- landscape_size¶
(
118mm 90mm
)Specify landscape page size (width and height). The program use this value when
orientation
islandscape
.
- toc_depth¶
- (
3
)[INT]
Specify (max) tree level of pdf bookmarks (Table of Contents). It uses html headings for structuring, so valid values are 0 to 6.
- font_family¶
(
"DejaVu Sans", sans-serif
)Specify default font to use.
- font_mono¶
(
"Dejavu Sans Mono", monospace
)Specify default monospaced font to use.
- font_serif¶
(None)
Not used.
- font_sans¶
(None)
Not used.
- font_size¶
(
9px
)Specify default font size.
- font_size_mono¶
(
8px
)Specify default monospaced font size.
- font_scale¶
(
1.0
)Specify scaling factor for css
font_size
andfont_size_mono
.It is to make easier to test font sizes.
- line_height¶
(
1.3
)Specify default line height.
Converter Sections¶
Section prince
and weasyprint
are converter sections.
They have common options.
When convert
, only one converter is active,
and only the options of that converter’s section are active.
commandline has the same options, to override.
Note
To see the current values for each converter:
$ tosixinch -a --prince
$ tosixinch -a --weasyprint
- cnvpath¶
(
prince
)The name or full path of the command as you type it in the shell. For ordinary installed ones, only the name would suffice.
- css2¶
- (None)
[COMMA]
Extra css files just to pass to converter commandline options.
It may be useful for converter specific features or troubles. Although, normally, you can do that better with
css
option and the template.You can only use the filenames (not full paths).
The filenames are searched in
css directory
and current directory in order.
- cnvopts¶
- (None)
[CMD]
Additional options to pass to the command, besides css file option (which is added by
css2
option above if it is specified).
site.ini¶
site.ini
should have many sections,
each is the settings for some specific site or a part of the site.
They all have the same options,
in which the common options (the same ones as in tosixinch.ini
)
are not described here.
Each section must have match
option.
It is this option that is used as glob string to match input rsrcs
,
and consequently select which section to use.
So section names themselves can be arbitrary.
- match¶
(None)
Glob string to match against input
rsrc
.Path separator (
'/'
) is not special for wildcards (*?[]!
). So, e.g.'*'
matches any strings including all subdirectories. (Actually, it uses fnmatch module, not glob module).The program tries the values of this option from all the sections. The section whose
match
option matches thersrc
is used for the settings.If there are multiple matches, the one with the most path separator characters (
'/'
) is used (scheme separator'//'
in'https?://'
are not counted). If there are multiple matches still, the last one is used.If there is no match, default settings are used, and guess option is tried. In this case, a placeholder value
http://tosixinch.example.com
is set. (This imaginary site is used to make file paths indownload
andextract
).
- select¶
- (None)
[LINE]
XPath strings to select elements from
dfile
whenextract
. Only selected elements are included in the<body>
tag of the newefile
, discarding others.Each line in the value will be connected with a bar string (
'|'
) when evaluating.
- exclude¶
- (None)
[LINE]
XPath strings to remove elements from the new
efile
afterselect
. So you don’t need to exclude already excluded elements byselect
. As inselect
, each line in the value will be connected with a bar string ('|'
).
- process¶
- (None)
[LINE]
After
select
andexclude
, arbitrary functions can be called if this option is specified.Selection:
The functions must be top level ones.
It is searched in user process directory and application process directory, in order.
If the function name is found in multiple modules in user process directory, the program raises Error.
In that case, you can use dot notation. If the function name includes one dot (
'.'
), the program interprets it as<module name>.<function name>
. Two or more dots are not supported.Invocation:
The first argument of the functions is always
doc
, which the program provides. It islxml.html
DOM object (HtmlElement
), corresponding to the resultantefile
afterselect
andexclude
.The function can have additional arguments. String after
'?'
(and before next'?'
) is interpreted as an argument.For example,
'aaa?bb?cc'
is made into codeif
'aaa.py'
is found in user process directory:process.aaa(doc, bb, cc)
or if it is found in application process directory:
tosixinch.process.aaa(doc, bb, cc)
You don’t have to
return
anything, just manipulatedoc
as you like. The program uses the resultantdoc
subsequently.See process.sample for included sample functions.
Example:
Let’s say you want to change
h3
tag todiv
for http://somesite.com.First, create a file in process directory e.g.
~/.config/tosixinch/process/myprocess.py
.Second, create a top level function e.g.
def heading_to_div(doc, heading): """Change some heading to div from argument e.g. 'h3'.""" for el in doc.xpath('//' + heading): el.tag = 'div'
Third, write configuration accordingly.
[somesite] match= http://somesite.com/* select= ... process= myprocess.heading_to_div?h3
- clean¶
(Note there is no option named
clean
. here I’m just describing what it does).After
select
,exclude
andprocess
inextract
, the programclean
s the resultant html.- tags:
According to add_clean_tags.
- attributes:
According to add_clean_attrs.
- javascript:
All inline javascript and javascript source references are unconditionally stripped.
(In
download
, we occasionally need javascript, and in that case we might use headless browsers. Inextract
, javascript has already rendered the contents. So we shouldn’t need it any more).- css:
All
style
attributes and css source references are stripped, with one exception.If a tag has
'tsi-keep-style'
in class attributes,style
attributes are kept intact. It can be used in process functions. If you want to keep or create some inlinestyle
, add this class attribute.# removed (becomes just '<div>') <div style="font-weight:bold;"> # not removed <div class="tsi-keep-style other-values" style="font-weight:bold;">
- skip tags:
According to elements_to_keep_attrs. The program skips cleaning the matched elements (and all sub-elements), if the elements are not already removed by
add_clean_tags
.
- cookie¶
- (None)
[LINE]
Some sites require confirmation before providing the documents. (‘Are you over 18?’, ‘Agree to terms of service?’)
And
urllib
cannot handle these interactive communications.By adding cookie data here (e.g. from your browsers), you may be able to bypass them.
Note it is not secure. Do not provide sensitive data.
- dprocess¶
- (None)
[LINE]
When
download
, the program runs functions specified by this option after getting http response, and before serializing to html text.For completeness, it also runs when downloader is
urllib
, but the supposed usage is for other headless browsers.For example, some webpages have folded contents which users need to click and run javascript to expand.
The mechanism is similar to
process
, Users define a function in a python file in userdprocess
directory, withagent
as the first argument, and modify it. If necessary, they can define other arguments by using'?'
(see process).But what comes as
agent
is dependent on what is actuallydownloader
now:urllib http.client.HTTPResponse selenium selenium.webdriver.remote.webdriver.WebDriver
So user should be careful. (For example, when you define
dprocess
insite.ini
, it is advisable to also definedownloader
).Example:
def sitefoo_click(agent): # for selenium path = '//div[@class="see_details"]' elements = agent.find_elements_by_xpath(path) for element in elements: element.click() time.sleep(1)
- inspect¶
- (
get_links
)[LINE]
When action is
inspect
, the program runs functions this option specifies.This is similar to
extract
action’sprocess
, butinspect
does not do anything before and after (select, exclude …, write to file).Create Python functions in the same folder as
process
, original non-extracted html object is provided, as the first argumentdoc
, and user do something, mostly print something.See process.inspect_sample for a few sample functions.