API reference

A few modules are documented here.

  • process.sample

  • process.inspect_sample

  • templite

process.sample

Sample functions to use in process option.

Divided in three parts:

  • Helper Functions

  • General Functions (applicable for many websites)

  • Site Specific Functions

tosixinch.process.sample.make_tag(tag='div', text='')[source]

Make element (HtmlElement) from tag and string.

>>> el = make_tag('p', 'aaa')
>>> tostring(el)
'<p>aaa</p>'
tosixinch.process.sample.wrap_tag(el, tag='div')[source]

Wrap element in a tag.

>>> el = fromstring('<p>aaa</p>')
>>> parent = el.getparent()
>>> wrap_tag(el, 'div')
>>> tostring(parent[0])
'<div><p>aaa</p></div>'
tosixinch.process.sample.remove_tag(el)[source]

Remove element (and subelements) from parent element.

>>> doc = fromstring('<div><p>aaa</p><p>bbb</p></div>')
>>> el = doc.xpath('//p')[1]
>>> remove_tag(el)
>>> tostring(doc)
'<div><p>aaa</p></div>'
tosixinch.process.sample.replace_tag(el, replace)[source]

Replace element to another element.

>>> doc = fromstring('<div><p>aaa</p></div>')
>>> el = doc.xpath('//p')[0]
>>> repl = make_tag('h3', 'bbb')
>>> replace_tag(el, repl)
>>> tostring(doc)
'<div><h3>bbb</h3></div>'
tosixinch.process.sample.insert_tag(el, add, before=True)[source]

Insert element (‘add’) before or after element (‘el’).

See add_hr for doctest example.

tosixinch.process.sample.check_parent_tag(el, tag='div', generation=2)[source]

Check existance of tag in an element’s parent elements.

And returns it if found.

>>> doc = fromstring('<table><tr><td>aaa</td></tr></table>')
>>> el = doc.xpath('//td')[0]
>>> el = check_parent_tag(el, 'table')
>>> el.tag
'table'
tosixinch.process.sample.get_element_text(el, path='.')[source]

Return all texts in an element or elements.

Parameters
  • el – main elemant to search

  • path – xpath string for the element(s) you want

>>> el = fromstring('<h2>aaa<div>bbb</div></h2>')
>>> get_element_text(el, '//h2')
'aaabbb'
>>> el = fromstring('<div>no<h2>aaa<div>bbb</div><div>ccc<p>ddd</p></div></h2><h2>xxx</h2></div>')  # noqa: E501
>>> get_element_text(el, '//h2')
'aaabbbcccdddxxx'
tosixinch.process.sample.get_metadata(el)[source]

Get basic metadata from <meta name=... content=...>.

tosixinch.process.sample.add_h1(doc, force=False)[source]

If there is no <h1>, make <h1> from <title> tag text.

>>> s = '<html><head><title>aaa</title></head><body></body></html>'
>>> doc = fromstring(s)
>>> add_h1(doc)
>>> tostring(doc)
'<html><head><title>aaa</title></head><body><h1>aaa</h1></body></html>'
tosixinch.process.sample.add_h1_force(doc)[source]

Add title even if there are <h1> s already.

tosixinch.process.sample.delete_duplicate_br(doc, maxnum=2)[source]

Continuous <br> tags to maximum <br>, to save display space.

>>> el = fromstring('<div>aaa<br><br>  <br><br/><br>bbb<br><br></div>')
>>> delete_duplicate_br(el)
>>> tostring(el)
'<div>aaa<br><br>  bbb<br><br></div>'
tosixinch.process.sample.youtube_video_to_thumbnail(doc)[source]

Change embeded youtube video object to thumbnail image.

from: https://www.youtube.com/embed/(id)?feature=oembed
to: http://img.youtube.com/vi/(id)/hqdefault.jpg
tosixinch.process.sample.show_href(doc)[source]

Make <a href=...> links to visible text.

>>> el = fromstring('<div><a href="aaa">bbb</a></div>')
>>> show_href(el)
>>> tostring(el)
'<div><a href="aaa">bbb</a><span class="tsi-href-visible">\xa0 [[aaa]] \xa0</span></div>'
tosixinch.process.sample.lower_heading(doc, path=None)[source]

Decrease heading number except specified element (by xpath).

That is, <h1> becomes <h2>, … <h5> becomes <h6>. (<h6> is kept as is). It is for prettier Table of Contents, TOC is usually copied from heading structure. A basic use case is when the document has multiple <h1>. You don’t want those to clutter TOC tree, want only one of them on top.

>>> el = fromstring('<div><h1>aaa</h1><h1 class="b">bbb</h1><h2>ccc</h2></div>')  # noqa: E501
>>> lower_heading(el, './@class="b"')
>>> tostring(el)
'<div><h2>aaa</h2><h1 class="b">bbb</h1><h3>ccc</h3></div>'
tosixinch.process.sample.lower_heading_from_order(doc, tag=1, order=1)[source]

Decrease heading number except specified element (by order).

The purpose is the same as lower_heading, except you specify keep-element by heading number and order. So e.g. argument 'tag=2, order=3' means third <h2> tag element in the document.

>>> el = fromstring('<div><h1>aaa</h1><h1>bbb</h1><h2>ccc</h2></div>')
>>> lower_heading_from_order(el, 1, 2)
>>> tostring(el)
'<div><h2>aaa</h2><h1>bbb</h1><h3>ccc</h3></div>'
tosixinch.process.sample.lower_heading_from_order_auto(doc)[source]

Lower headings, except first <h1>, if multiple h1 headings found.

tosixinch.process.sample.split_h1(doc, seps=None, part='1')[source]

Remove unwanted parts from h1 string.

Headings or titles are often composed of multiple items, like ‘Murder! - Domestic News - The Local Paper’. You want just ‘Murder!’.

Selected items are whitespace stripped.

Parameters
  • seps – strings by which heading is separated. if None, default ' - ', ' : ', ' | ' is used.

  • part – which part to select. ‘1’ means first, or index 0. special number ‘-1’ selects last item.

>>> el = fromstring('<h1>aaa ~ bbb</h1>')
>>> split_h1(el, '~', '2')
>>> tostring(el)
'<h1>bbb</h1>'
>>> el = fromstring('<h1>aaa ~ bbb</h1>')
>>> split_h1(el, '~', '-1')
>>> tostring(el)
'<h1>bbb</h1>'
tosixinch.process.sample.replace_h1(el, pat, repl='')[source]

Change <h1> string by regular expression, pat to repl.

>>> el = fromstring('<h1>A boring article</h1>')
>>> replace_h1(el, 'A boring', 'An exciting')
>>> tostring(el)
'<h1>An exciting article</h1>'
tosixinch.process.sample.code_to_pre_code(doc)[source]

Wrap <code> with <pre>, when text includes newlines.

Sample css adds thin border style to <pre>, and not to <code>, which is to make multiline code marked out a little, and inline code not looking cluttered, in small black and white ebooks. But some sites use <code> indefinitely, also for multiline codes. in these cases, adding <pre> rather unconditionally is one of the solution.

As an arbirtary precaution, if parent or grandparent element tag is <pre>, adding another <pre> is skipped.

>>> el = fromstring('<code>aaabbb</code>')
>>> parent = el.getparent()
>>> code_to_pre_code(el)
>>> tostring(parent[0])
'<code>aaabbb</code>'
>>> el = fromstring(r'<code>aaa\nbbb</code>')
>>> parent = el.getparent()
>>> code_to_pre_code(el)
>>> tostring(parent[0])
'<pre><code>aaa\\nbbb</code></pre>'
tosixinch.process.sample.add_hr(doc, path)[source]

Add <hr> tag before some xpath element ('path') in the document.

>>> el = fromstring('<div><p>aaa</p><p>bbb</p></div>')
>>> path = '(//p)[2]'
>>> add_hr(el, path)
>>> tostring(el)
'<div><p>aaa</p><hr><p>bbb</p></div>'
tosixinch.process.sample.add_description(doc)[source]

Add description from <meta>.

tosixinch.process.sample._add_style(el, style)[source]

Add inline style strings (‘style’) to element (Note: no doc).

>>> el = fromstring('<p>aaa</p>')
>>> _add_style(el, 'font-size: larger;')
>>> tostring(el)
'<p class="tsi-keep-style" style="font-size: larger;">aaa</p>'
tosixinch.process.sample.add_style(doc, path, style)[source]

Add inline style strings (‘style’) to each xpath element (‘path’).

>>> el = fromstring('<div><p>aaa</p></div>')
>>> add_style(el, '//p', 'font-size: larger;')
>>> tostring(el)
'<div><p class="tsi-keep-style" style="font-size: larger;">aaa</p></div>'
tosixinch.process.sample.replace_tags(doc, path, tag='div')[source]

Change just the tagname while keeping anything inside.

>>> doc = fromstring('<div><p>aaa</p>bbb</div>')
>>> replace_tags(doc, '//div', 'h3')
>>> tostring(doc)
'<h3><p>aaa</p>bbb</h3>'
tosixinch.process.sample.add_noscript_image(doc)[source]

Move element inside <noscript> to outside.

>>> doc = fromstring('<h3><noscript><div><img src="a.jpg"></div></noscript></h3>')  # noqa: E501
>>> add_noscript_image(doc)
>>> tostring(doc)
'<h3><noscript><div></div></noscript><img src="a.jpg"></h3>'

Change permalink sign to some text ('repl').

Most python documents use this (U+00B6 or pilcrow sign or ‘¶’). On pdf, these marks are always visible, rather noisy.

cf. in sample css, ‘headerlink’ is already made invisible (‘display:none;’).

>>> el = fromstring(r'<div><h1>tosixinch<a class="headerlink">¶</a></h1></div>')  # noqa: E501
>>> convert_permalink_sign(el, '\u2026')
>>> tostring(el)
'<div><h1>tosixinch<a class="headerlink">…</a></h1></div>'
tosixinch.process.sample.hackernews_indent(doc)[source]

Narrow default indent widths, they are too wide for e-readers.

tosixinch.process.sample.reddit_indent(doc)[source]

Narrow default indent widths, they are too wide for e-readers.

tosixinch.process.sample.github_self_anchor(doc)[source]

Discard self anchors in <h3>.

We stripped referents, and weasyprint warns it.

tosixinch.process.sample.github_issues_comment_header(doc)[source]

Change comment header blocks from <h3> to <div>.

<h3> is too big here, clutters TOC.

Also discard self anchors in date part of headers e.g. ‘href=”#issuecomment-223857939”’. We stripped referents, and weasyprint warns it.

Also delete the repetetive sentence ‘This comment…’ (display: none).

process.inspect_sample

Sample functions to use in inspect action.

Print <a href> links, if regex string match matches.

usage example:

# print jpg files
inspect=    get_links?jpg$
tosixinch.process.inspect_sample.hackernews_topstories(doc)[source]

Print hackernews top stories and some data, all commented out.

Querying https://github.com/HackerNews/API. (So it’s not using doc argumant).

2022/04/30: quite slow (the API server itself is that way, I guess)

usage example:

# only when input is exactly the site home, no glob ('*')
[hackernews_home]
match=      https://news.ycombinator.com
inspect=    hackernews_topstories

templite

Note

This module is copied from Ned Batchelder’s Coverage.py, including docstrings here. see templite.py

A simple Python template renderer, for a nano-subset of Django syntax.

For a detailed discussion of this code, see this chapter from 500 Lines: http://aosabook.org/en/500L/a-template-engine.html

exception tosixinch.templite.TempliteSyntaxError[source]

Raised when a template has a syntax error.

exception tosixinch.templite.TempliteValueError[source]

Raised when an expression won’t evaluate in a template.

class tosixinch.templite.CodeBuilder(indent=0)[source]

Build source code conveniently.

add_line(line)[source]

Add a line of source to the code.

Indentation and newline will be added for you, don’t provide them.

add_section()[source]

Add a section, a sub-CodeBuilder.

indent()[source]

Increase the current indent for following lines.

dedent()[source]

Decrease the current indent for following lines.

get_globals()[source]

Execute the code, and return a dict of globals it defines.

class tosixinch.templite.Templite(text, *contexts)[source]

A simple template renderer, for a nano-subset of Django syntax.

Supported constructs are extended variable access:

{{var.modifier.modifier|filter|filter}}

loops:

{% for var in list %}...{% endfor %}

and ifs:

{% if var %}...{% endif %}

Comments are within curly-hash markers:

{# This will be ignored #}

Lines between {% joined %} and {% endjoined %} will have lines stripped and joined. Be careful, this could join words together!

Any of these constructs can have a hyphen at the end (-}}, -%}, -#}), which will collapse the whitespace following the tag.

Construct a Templite with the template text, then use render against a dictionary context to create a finished string:

templite = Templite('''
    <h1>Hello {{name|upper}}!</h1>
    {% for topic in topics %}
        <p>You are interested in {{topic}}.</p>
    {% endif %}
    ''',
    {'upper': str.upper},
)
text = templite.render({
    'name': "Ned",
    'topics': ['Python', 'Geometry', 'Juggling'],
})
render(context=None)[source]

Render this template by applying it to context.

context is a dictionary of values to use in this rendering.