Topics¶
‘Advanced’ subjects are discussed here.
Text Format¶
(Experimental)
When extract
, the program actually checks
if the content is really an html
.
(Before the main extract procedures:
select
, exclude
, process
, and clean
).
Currently, only the existence of '<html>'
tag is checked.
Even this is rather strict. I assume that loose or partial htmls are for presentation or software testing, and they are normally expected to be read as raw text.
If it judges that it is not html,
the html extraction is skipped.
The text extraction procedure begins instead,
which basically puts all text content inside a <pre>
tag in a html file.
The program separates it into three types:
prose
non-prose
code
And it adds some informative attributes
to the pre
tag it creates.
For prose
, class="tsi-text tsi-prose"
.
For nonprose
, class="tsi-text tsi-nonprose"
.
For code
, class="tsi-text tsi-code"
.
For pythoncode
, class="tsi-text tsi-code tsi-python"
.
In case of code
, It also adds the same attributes
to other new tags it creates. (h2
, h3
, and span
).
prose¶
prose
is supposed to be the general content type.
Paragraph should be the main unit.
Adding extra line wraps should not change major semantic structure.
In most cases, you should have a pre
tag rule like this in your css.:
pre {
white-space: pre-wrap;
}
non-prose¶
non-prose
is the one in which line breaks are significant
(source code, verse, play script etc.).
We should keep newlines, so css shouldn’t wrap long lines.
But this is not possible in many cases because e-reader screens are so small.
As a solution, the program pre-processes text to have exact line length, and attaches some label to wrapped lines, according to settings (textwidth and textindent respectively).
So that readers can tell source line breaks from editorial layout line breaks.
For a short example,:
nums = [1, 2, 3, 4, 5, 6]
becomes:
nums = [1, 2, 3, 4,
>>> 5, 6]
when textwidth
is around 20
and textindent
is ' >>> '
.
Note that you have to test and adjust textwidth
carefully,
dependent on font, margin, device etc..
If it is too short, e-reader display will have big blanks on the right. If it is too long, css either triggers auto-wrap, and introduces confusing line breaks, or, just puts the rest of the text outside of the display (invisible).
code¶
Note
Now _pcode
is the recommended method to format source codes.
See _pcode.
code
is a special case of non-prose
,
and currently only for Python source code.
It adds pdf bookmarks and the references for some identifiers.
For this purpose, class and function names are wrapped
in h2
, h3
and span
.
Since this is very special usage of tags,
you need to create very special css rules
(using 'tsi-code'
class attribute).
See sample.t.css
, for an example.
TOC¶
TOC
means Table of Contents, or pdf bookmarks.
Concept¶
Given the following rsrcs.txt
:
https://somesite.com/index.html (1)
# Alice's articles (2)
https://somesite.com/alice/article/aaa.html (3)
https://somesite.com/alice/article/bbb.html (4)
https://somesite.com/alice/article/ccc.html (5)
# Bob's articles (6)
https://somesite.com/bob/article/xxx.html (7)
https://somesite.com/bob/article/yyy.html (8)
The program ordinarily creates top level pdf bookmarks like this:
-- index
-- aaa
-- bbb
-- ccc
-- xxx
-- yyy
TOC
feature helps create one level more structured pdf bookmarks like this:
-- index
-- Alice's articles
-- aaa
-- bbb
-- ccc
-- Bob's articles
-- xxx
-- yyy
To do that, --toc
action creates
new htmls
h1
strings are made from hash comment lines (2 and 6).contents are made from children htmls (3, 4 and 5. And 7 and 8).
new
rfile
(tocfile
)made to refer to newly created htmls instead of now redundant children htmls.
--convert
action, in turn, read tocfile
instead of the original rfile
,
if tocfile
exists, and it’s mtime is newer.
So that if you run
$ tosixinch -12
$ tosixinch --toc
$ tosixinch -3
or
$ tosixinch -123 --toc
The program creates a more structured version of pdf file.
Rules¶
The toc
action treats '#'
as special chapter directive.
So comments in rfile
are lines only beginning ';'
.
(In other actions
, both '#'
and ';'
are comments).
The toc
action creates tocfile
in current directory, adding '-toc'
to rfile
.
(When --file
is 'rsrcs.txt'
(default),
the name of tocfile
is 'rsrcs-toc.txt'
).
So it is Error when rfile
is not provided.
(--file
or implicit rsrcs.txt
. No --input
).
The toc
action processes normally efiles
,
bundling some of them, and creating new htmls.
Table of Contents adjustments are done
simply by decreasing heading
numbers.
It first reads rsrcs.txt
.
If there is a line starting with '#'
,
it is interpreted as a new chapter
(new html <title>
and new '<h1>'
text).
Following lines become sections of the chapter,
until next '#'
line begins.
To use the same example:
https://somesite.com/index.html (1)
# Alice's articles (2)
https://somesite.com/alice/article/aaa.html (3)
https://somesite.com/alice/article/bbb.html (4)
https://somesite.com/alice/article/ccc.html (5)
# Bob's articles (6)
https://somesite.com/bob/article/xxx.html (7)
https://somesite.com/bob/article/yyy.html (8)
toc
tracks or creates these files.
_htmls/somesite.com/index.html (11)
_htmls/tosixinch.example.com/alices-articles (12)
_htmls/tosixinch.example.com/bobs-articles (13)
tosixinch.example.com
is an imaginary placeholder host.
(11)
(1) is outside of new chapters structure, so it doesn’t create a file, just keeps track of (1)’s
efile
.(12)
it creates this new html, whose
<h1>
is line (2),<body>
consists of (3)(4)(5)’s (previous)<body>
, their<h1>
changed to<h2>
,<h2>
to<h3>
etc..<h6>
is kept as is.So three html files below would become the 4th file.
<html> <body> <h1>aaa</h1> <p>this is aaa.</p> </body> </html> <html> <body> <h1>bbb</h1> <p>this is bbb.</p> </body> </html> <html> <body> <h1>ccc</h1> <p>this is ccc.</p> </body> </html>
<html> <body> <h1>Alice's articles</h1> <div class='tsi-body-merged'> <h2>aaa</h2> <p>this is aaa.</p> </div> <div class='tsi-body-merged'> <h2>bbb</h2> <p>this is bbb.</p> </div> <div class='tsi-body-merged'> <h2>ccc</h2> <p>this is ccc.</p> </div> </body> </html>
(13)
the same as (12).
and it creates rsrcs-toc.txt
, which contains:
https://somesite.com/index.html (21)
http://tosixinch.example.com/alices-articles (22)
http://tosixinch.example.com/bobs-articles (23)
(21)(22)(23) are the names of rsrcs
,
corresponding to (11)(12)(13) (efiles
).
So, convert
doesn’t do anything special for rsrcs-toc.txt
,
just processes pre-built htmls.
Note
The new html collects all the css files referenced in children htmls in order. So, if you are using different css files for sites, you should control the effects carefully.
Replace¶
If there is a file 'replace.txt'
in userdir,
it is used for regex rsrc
preprocess.
The rsrcs
matching the pattern are internally changed to replacement rsrcs
,
and processed accordingly.
If there are lines in the file:
https://www\.reddit\.com/
https://old.reddit.com/
the first line is a regex pattern, the second line is a regex replacement (for Python re.sub()). So that
$ tosixinch -i https://www.reddit.com/aaa.html -123
downloads, extracts and creates the pdf file
from 'https://old.reddit.com/aaa.html'
.
The format of the file is:
the file consists of zero or more units.
the unit consists of:
one regex pattern line
one regex replacement line
one or more blank lines or EOF
So if there are lines, they are always two consecutive lines, separated by blank lines. (blank lines in the very first line and the very last line of the file are optional).
The lines starting with '#'
are ignored (comments).
You can put them in any line in units.
Hookcmds¶
Precmds and Postcmds¶
Before and after main actions ('-1'
, '-2'
and '-3
),
The program calls arbitrary commands,
according to precmds and postcmds options in tosixinch.ini
.
One useful use case of postcmds
is notification,
since download
and convert
sometimes take time.
For example:
postcmd1= notify-send -t 3000 'Done -- tosixinch.download'
should bring some notification balloon
when download
is complete.
Variables:
script directory is inserted in the head of $PATH
.
So you can call your custom scripts only by filenames (not fullpath),
if they are in there.
If a word in the statement begins with 'conf.'
,
and the rest is dot-separated identifier ([a-zA-Z_][a-zA-Z_0-9]+
),
it is evaluated as the object conf
. For example:
postcmd1= echo conf._configdir conf._userdir
will print application config directory name and user config directory name.
(For more advanced usage, you need to peek in the source code.
It uses eval
, so be careful.)
Running Module:
If a command consists of one word, without ‘dot’,
and the '.py'
extension file actually exists in script directory,
the program runs the command as Python module internally
(as opposed to running it as an external system subprocess).
That is, if a cmd is ['foo']
, for example:
precmd1= foo
and there is a file foo.py
in script directory
,
the program does roughly:
import script.foo
script.foo.run(conf, site)
So the module must have run
function with this signature.
(In this context, site
should be None
.
Whole action hookcmds
only have application level configuration.
each action hookcmds
(see below) are given site
).
userdir
is inserted to sys.path
(sys.path[0]
).
So if you want to import sibling modules in the program file,
refer them from script
package, e.g.
import script.bar
from script import baz
The difference from running subprocess is that
it should be a bit faster, and conf
and site
are writable.
Note
If you want to run a python file as subprocess, put in the actual filename:
precmd1= foo.py
Multiple Commands:
Their value function signatures are [LINE][CMDS]
, that is,
you can run multiple commands in a hookcmd, one command for each line.
If the return code of a command is 0, the program runs the next command, if any.
If the return code of a command is 100, the program skips the following commands, if any.
If the return code of a command is 101, and the command is one of precmds (not postcmds), the program skips the following commands, and the following action altogether. The following postcmd are executed.
If the return code of a command is 102, the program skips the following postcmd in addition.
precmd: cmd, cmd, cmd, cmd, cmd...
| 100 | 101 | 102
| | |
action | |
| |
postcmd |
|
(to next action group)
In running subprocess, other return codes (not 0, 100, 101, 102) aborts the program.
In running module, any other return codes and values (not 0, 100, 101, 102)
are interpreted as 0.
(It is to permit normal Python return value of None
.
Python itself will abort the program if something goes wrong).
Viewcmd¶
A special case of hookcmds
is viewcmd
.
viewcmd
triggers when -4
or --view
option is supplied.
But actually there is no action called 4
or view
.
It is intended to open a pdf viewer,
after pdf generation is done (-3
).
So, if you are using okular as pdf viewer,
# in tosixinch.ini
viewcmd= okular conf.pdfname
$ tosixinch -4
will open the viewer with the generated pdf file.
Also, the program includes a sample file _viewer.py. (It does basically the same thing as above, but cancels duplicate openings).
Pre_Each_Cmds and Post_Each_Cmds¶
An action group consists of precmd
, action
and postcmd
.
But when download
or extract
,
action
itself is a collection of jobs, one job for each rsrc
.
For this job, there are corresponding pre- and post- hookcmds.
precmd pre_each_cmd
action (rsrcs) ---+--- job (an rsrc)
postcmd | post_each_cmd
|
| pre_each_cmd
+--- job (an rsrc)
| post_each_cmd
|
: ...
The specification (return codes etc.) is the same as precmds and postcmds.
In this context, there are rsrc
specific configurations,
in addition to application level configuration.
So you can use site
variable, in addition to conf
:
If a word in the statement begins with 'site.'
,
and the rest is dot-separated identifier ([a-zA-Z_][a-zA-Z_0-9]+
),
it is evaluated as the object site
. For example:
post_each_cmd1= echo site.efile site.match
will print each efile
and match
option value.
Also, the following environment variables are exposed (in running subprocess case).
TOSIXINCH_RSRC: rsrc
TOSIXINCH_DFILE: dfile
TOSIXINCH_EFILE: efile
Scripts¶
A few sample script files are included in the application.
They are in tosixinch/script
directory in the installation.
You can refer them in user configurations
_viewer¶
Intended to be used in viewcmd
option in tosixinch.ini
.
It opens a pdf viewer. But if there is a same pdf application opened with the same pdf file, if does nothing (cancels duplicate openings).
It uses unix command ps
.
It can be used without full path.:
viewcmd= _viewer.py --command okular --check --null conf.pdfname
--command
accepts arbitrary commands with some options, but you need to quote. (e.g.--command 'okular --page 5'
).--check
is the option flag to do above duplicate check.--null
is to suppress this command’s stdout and stderr.
And one way to see the help is:
$ tosixinch -4 --viewcmd '_viewer.py --help' -i aaa
(This doesn’t work if rsrc
is not supplied,
so you have to supply something, like the above -i aaa
.)
_man¶
A sample hook extractor for man pages.
If you want to use it, add this command to pre_each_cmd2
in user configuration.
When extract
,
if the filename matches r'^.+\.[1-9]([a-z]+)?(\.gz)?$'
(e.g. grep.1, grep.1.gz, grep.1p.gz),
run man program with 'man -Thtml'
,
skipping the main extraction.
Note
pre_each_cmd2
is aLINE
option, so multiple commands must be separated with newline and indent e.g.:pre_each_cmd2= echo foo _man
If you supply multiple
rsrcs
, it triggers the binary-extension filter, and the default includesgz
. In this case, you have to subtractgz
from the list. (see add_binary_extensions).# in rsrcs.txt /usr/share/man/man1/cp.1.gz /usr/share/man/man1/grep.1.gz $ tosixinch -123 --add-binary-extensions -gz
_pcode¶
A sample hook extractor for source codes (pcode: short of ‘Pygments
code’).
It formats (html-wraps) some Pygments
tokens.
The purpose is to make them pdf bookmarks items,
and create references to them.
If you want to use it, add this command to pre_each_cmd2
in user configuration.
Note pre_each_cmd2
is a LINE
option, see the above note for _man
.
You need to install
Pygments,
and ctags
(Universal Ctags
or Exuberant Ctags).
It creates working files tsi.tags
and tsi.tags.checksum
in current directory
(The script skips tag creation,
if command, files and (max) mtime are the same as the previous run).
As stated in code,
it uses h2
and h3
tags very unusual way.
You need special css rules (See sample.t.css
, for an example).
Language names:
Pygments
is a code highlighter, the script uses to find identifiers.
But Pygments
doesn’t tell where the identifier definition is,
so we also need Ctags
to find the definitions.
Pygments
and Ctags
have slightly different language names,
e.g. ‘reStructuredText’ (Pygments
) and ‘ReStructuredText’ (Ctags
).
So they must be mapped to third common names the script defines (ftype
).
ftypes
are all lower cases,
so the names, which happen to be case-insensitively the same, are already mapped.
But the other names must be explicitly mapped to the same ftype
in configuration,
using c2ftype
and p2ftype
sections.
To see actual mappings, run tosixinch with some text input and with --verbose
:
tosixinch -i <some-text-file> -12 --verbose
Configuration:
You can specify some configuration
if you create pcode.ini
in userdir.
(See the default config file tosixinch/data/pcode.ini
, for the example).
arguments
option in the default pcode.ini
above,
are currently like this.
It is commandline arguments to run Ctags
,
and most are required to work as the script is supposed to work.
--options=NONE # reset
--format=2 # only support extended format
--sort=no # toc is normally in appearance order
--excmd=number # only support line number, not pattern
--file-scope=yes # only support in-file tags now
--fields=fkl # need them, filename, kind, and language
--kinds-python=cfm # define which kinds to use
# in each language
# ('--<lang>-kinds' in Exuberant Ctags)
kindmap
option in the default pcode.ini
are like this:
kindmap= h2=cf, h3=m
Which means: wrap Ctags
kinds cf
(class and function) in html <h2>
tag,
and wrap m
(method) in <h3>
tag.
But since Ctags
kinds are greatly differ for each language,
you have to customize them for each of your languages.
One *
is allowed, to mean any other kinds, so you can write the same as above
kindmap= h2=*, h3=m
Customization:
You can write custom module using the class tosixinch.script.pcode._pygments.PygmentsCode
.
Example:
# ~/.config/tosixinch/script/pcode/perl.py
from tosixinch.script.pcode import _pygments
class CustomCode(_pygments.PygmentsCode):
The entry point is format_entry
,
which has sub entry points check_def
, check_ref
, wrap_def
and wrap_ref
.
See tosixinch/script/pcode/python.py
for the example.
Relevant options are:
- (new section):
create new section as the same name as ftype
- module:
module name for your custom module. you have to create this module in
pcode
directory in script directory (e.g. ‘~/.config/tosixinch/script/pcode/perl.py’ above).Example:
[perl] module= perl
- class:
class name for you custom class in the module above. the default is
CustomCode
.
- start_token:
Pygments token type name. All other tokens (and their subclasses) are not touched by formatting. Normally
Token.Name
is suffice (default).
- kindmap:
Explained above.
Generic Ftypes
If Pygments
finds a language but the language is not mapped,
It returns without doing formatting,
but the script registers the rsrc
’s ftype
as nonprose
.
(It is an heuristic.
If Pygments
finds a language, it is better to treat the text as code-like,
not prose
).
If Pygments
name is mapped,
but Ctags
doesn’t find a language, or the language is not mapped, or not mapped to the same name,
It does formatting only with Pygments
tokens.
As a special case, if Pygments
name is mapped to the name 'prose'
,
It does not do formatting, but registers the rsrc
’s ftype
as prose
(The default is: reStructuredText
and markdown
).
_tosixinch.bash¶
A basic bash completion script.
If you are using bash, it should be useful.
Source it in your .bashrc
. For example:
source [...]/site-packages/tosixinch/data/_tosixinch.bash
Vendored Libraries¶
The program uses a few vendored (included) libraries.
- templite.py¶
This is a module of Ned Batchelder’s Coverage.py, and described extensively in a chapter of ‘500 Lines or Less’.
It is a general template engine, used for css template rendering here.
- imagesize.py¶
This is a rewrite of Phuslu’s imgsz.
I wanted a simple image format metadata reader, (
Pillow
or other graphic libraries are too big), and I found his was the best to copy.
- configfetch.py¶
Simplify parsing commandline and config options. (configfetch).
- zconfigparser.py¶
Implement section inheritance in
site.ini
. (zconfigparser).