API reference¶
A few modules are documented here.
process.sample
process.inspect_sample
templite
process.sample¶
Sample functions to use in process
option.
Divided in three parts:
Helper Functions
General Functions (applicable for many websites)
Site Specific Functions
- tosixinch.process.sample.make_tag(tag='div', text='')[source]¶
Make element (
HtmlElement
) from tag and string.>>> el = make_tag('p', 'aaa') >>> tostring(el) '<p>aaa</p>'
- tosixinch.process.sample.wrap_tag(el, tag='div')[source]¶
Wrap element in a tag.
>>> el = fromstring('<p>aaa</p>') >>> parent = el.getparent() >>> wrap_tag(el, 'div') >>> tostring(parent[0]) '<div><p>aaa</p></div>'
- tosixinch.process.sample.remove_tag(el)[source]¶
Remove element (and subelements) from parent element.
>>> doc = fromstring('<div><p>aaa</p><p>bbb</p></div>') >>> el = doc.xpath('//p')[1] >>> remove_tag(el) >>> tostring(doc) '<div><p>aaa</p></div>'
- tosixinch.process.sample.replace_tag(el, replace)[source]¶
Replace element to another element.
>>> doc = fromstring('<div><p>aaa</p></div>') >>> el = doc.xpath('//p')[0] >>> repl = make_tag('h3', 'bbb') >>> replace_tag(el, repl) >>> tostring(doc) '<div><h3>bbb</h3></div>'
- tosixinch.process.sample.insert_tag(el, add, before=True)[source]¶
Insert element (‘add’) before or after element (‘el’).
See add_hr for doctest example.
- tosixinch.process.sample.check_parent_tag(el, tag='div', generation=2)[source]¶
Check existance of tag in an element’s parent elements.
And returns it if found.
>>> doc = fromstring('<table><tr><td>aaa</td></tr></table>') >>> el = doc.xpath('//td')[0] >>> el = check_parent_tag(el, 'table') >>> el.tag 'table'
- tosixinch.process.sample.get_element_text(el, path='.')[source]¶
Return all texts in an element or elements.
- Parameters
el – main elemant to search
path – xpath string for the element(s) you want
>>> el = fromstring('<h2>aaa<div>bbb</div></h2>') >>> get_element_text(el, '//h2') 'aaabbb' >>> el = fromstring('<div>no<h2>aaa<div>bbb</div><div>ccc<p>ddd</p></div></h2><h2>xxx</h2></div>') # noqa: E501 >>> get_element_text(el, '//h2') 'aaabbbcccdddxxx'
- tosixinch.process.sample.get_metadata(el)[source]¶
Get basic metadata from
<meta name=... content=...>
.
- tosixinch.process.sample.add_h1(doc, force=False)[source]¶
If there is no
<h1>
, make<h1>
from<title>
tag text.>>> s = '<html><head><title>aaa</title></head><body></body></html>' >>> doc = fromstring(s) >>> add_h1(doc) >>> tostring(doc) '<html><head><title>aaa</title></head><body><h1>aaa</h1></body></html>'
- tosixinch.process.sample.delete_duplicate_br(doc, maxnum=2)[source]¶
Continuous
<br>
tags to maximum<br>
, to save display space.>>> el = fromstring('<div>aaa<br><br> <br><br/><br>bbb<br><br></div>') >>> delete_duplicate_br(el) >>> tostring(el) '<div>aaa<br><br> bbb<br><br></div>'
- tosixinch.process.sample.youtube_video_to_thumbnail(doc)[source]¶
Change embeded youtube video object to thumbnail image.
from:https://www.youtube.com/embed/(id)?feature=oembed
to:http://img.youtube.com/vi/(id)/hqdefault.jpg
- tosixinch.process.sample.show_href(doc)[source]¶
Make
<a href=...>
links to visible text.>>> el = fromstring('<div><a href="aaa">bbb</a></div>') >>> show_href(el) >>> tostring(el) '<div><a href="aaa">bbb</a><span class="tsi-href-visible">\xa0 [[aaa]] \xa0</span></div>'
- tosixinch.process.sample.lower_heading(doc, path=None)[source]¶
Decrease heading number except specified element (by xpath).
That is,
<h1>
becomes<h2>
, …<h5>
becomes<h6>
. (<h6>
is kept as is). It is for prettier Table of Contents, TOC is usually copied from heading structure. A basic use case is when the document has multiple<h1>
. You don’t want those to clutter TOC tree, want only one of them on top.>>> el = fromstring('<div><h1>aaa</h1><h1 class="b">bbb</h1><h2>ccc</h2></div>') # noqa: E501 >>> lower_heading(el, './@class="b"') >>> tostring(el) '<div><h2>aaa</h2><h1 class="b">bbb</h1><h3>ccc</h3></div>'
- tosixinch.process.sample.lower_heading_from_order(doc, tag=1, order=1)[source]¶
Decrease heading number except specified element (by order).
The purpose is the same as lower_heading, except you specify keep-element by heading number and order. So e.g. argument
'tag=2, order=3'
means third<h2>
tag element in the document.>>> el = fromstring('<div><h1>aaa</h1><h1>bbb</h1><h2>ccc</h2></div>') >>> lower_heading_from_order(el, 1, 2) >>> tostring(el) '<div><h2>aaa</h2><h1>bbb</h1><h3>ccc</h3></div>'
- tosixinch.process.sample.lower_heading_from_order_auto(doc)[source]¶
Lower headings, except first <h1>, if multiple h1 headings found.
- tosixinch.process.sample.split_h1(doc, seps=None, part='1')[source]¶
Remove unwanted parts from h1 string.
Headings or titles are often composed of multiple items, like ‘Murder! - Domestic News - The Local Paper’. You want just ‘Murder!’.
Selected items are whitespace stripped.
- Parameters
seps – strings by which heading is separated. if
None
, default' - ', ' : ', ' | '
is used.part – which part to select. ‘1’ means first, or index 0. special number ‘-1’ selects last item.
>>> el = fromstring('<h1>aaa ~ bbb</h1>') >>> split_h1(el, '~', '2') >>> tostring(el) '<h1>bbb</h1>' >>> el = fromstring('<h1>aaa ~ bbb</h1>') >>> split_h1(el, '~', '-1') >>> tostring(el) '<h1>bbb</h1>'
- tosixinch.process.sample.replace_h1(el, pat, repl='')[source]¶
Change
<h1>
string by regular expression,pat
torepl
.>>> el = fromstring('<h1>A boring article</h1>') >>> replace_h1(el, 'A boring', 'An exciting') >>> tostring(el) '<h1>An exciting article</h1>'
- tosixinch.process.sample.code_to_pre_code(doc)[source]¶
Wrap
<code>
with<pre>
, when text includes newlines.Sample css adds thin border style to
<pre>
, and not to<code>
, which is to make multiline code marked out a little, and inline code not looking cluttered, in small black and white ebooks. But some sites use<code>
indefinitely, also for multiline codes. in these cases, adding<pre>
rather unconditionally is one of the solution.As an arbirtary precaution, if parent or grandparent element tag is
<pre>
, adding another<pre>
is skipped.>>> el = fromstring('<code>aaabbb</code>') >>> parent = el.getparent() >>> code_to_pre_code(el) >>> tostring(parent[0]) '<code>aaabbb</code>' >>> el = fromstring(r'<code>aaa\nbbb</code>') >>> parent = el.getparent() >>> code_to_pre_code(el) >>> tostring(parent[0]) '<pre><code>aaa\\nbbb</code></pre>'
- tosixinch.process.sample.add_hr(doc, path)[source]¶
Add
<hr>
tag before some xpath element ('path'
) in the document.>>> el = fromstring('<div><p>aaa</p><p>bbb</p></div>') >>> path = '(//p)[2]' >>> add_hr(el, path) >>> tostring(el) '<div><p>aaa</p><hr><p>bbb</p></div>'
- tosixinch.process.sample._add_style(el, style)[source]¶
Add inline style strings (‘style’) to element (Note: no doc).
>>> el = fromstring('<p>aaa</p>') >>> _add_style(el, 'font-size: larger;') >>> tostring(el) '<p class="tsi-keep-style" style="font-size: larger;">aaa</p>'
- tosixinch.process.sample.add_style(doc, path, style)[source]¶
Add inline style strings (‘style’) to each xpath element (‘path’).
>>> el = fromstring('<div><p>aaa</p></div>') >>> add_style(el, '//p', 'font-size: larger;') >>> tostring(el) '<div><p class="tsi-keep-style" style="font-size: larger;">aaa</p></div>'
- tosixinch.process.sample.replace_tags(doc, path, tag='div')[source]¶
Change just the tagname while keeping anything inside.
>>> doc = fromstring('<div><p>aaa</p>bbb</div>') >>> replace_tags(doc, '//div', 'h3') >>> tostring(doc) '<h3><p>aaa</p>bbb</h3>'
- tosixinch.process.sample.add_noscript_image(doc)[source]¶
Move element inside <noscript> to outside.
>>> doc = fromstring('<h3><noscript><div><img src="a.jpg"></div></noscript></h3>') # noqa: E501 >>> add_noscript_image(doc) >>> tostring(doc) '<h3><noscript><div></div></noscript><img src="a.jpg"></h3>'
- tosixinch.process.sample.convert_permalink_sign(doc, repl='')[source]¶
Change permalink sign to some text (
'repl'
).Most python documents use this (
U+00B6
orpilcrow sign
or ‘¶’). On pdf, these marks are always visible, rather noisy.cf. in sample css, ‘headerlink’ is already made invisible (‘display:none;’).
>>> el = fromstring(r'<div><h1>tosixinch<a class="headerlink">¶</a></h1></div>') # noqa: E501 >>> convert_permalink_sign(el, '\u2026') >>> tostring(el) '<div><h1>tosixinch<a class="headerlink">…</a></h1></div>'
- tosixinch.process.sample.hackernews_indent(doc)[source]¶
Narrow default indent widths, they are too wide for e-readers.
- tosixinch.process.sample.reddit_indent(doc)[source]¶
Narrow default indent widths, they are too wide for e-readers.
- tosixinch.process.sample.github_self_anchor(doc)[source]¶
Discard self anchors in <h3>.
We stripped referents, and weasyprint warns it.
- tosixinch.process.sample.github_issues_comment_header(doc)[source]¶
Change comment header blocks from <h3> to <div>.
<h3> is too big here, clutters TOC.
Also discard self anchors in date part of headers e.g. ‘href=”#issuecomment-223857939”’. We stripped referents, and weasyprint warns it.
Also delete the repetetive sentence ‘This comment…’ (display: none).
process.inspect_sample¶
Sample functions to use in inspect
action.
- tosixinch.process.inspect_sample.get_links(doc, match='')[source]¶
Print <a href> links, if regex string
match
matches.usage example:
# print jpg files inspect= get_links?jpg$
- tosixinch.process.inspect_sample.hackernews_topstories(doc)[source]¶
Print hackernews top stories and some data, all commented out.
Querying https://github.com/HackerNews/API. (So it’s not using
doc
argumant).2022/04/30: quite slow (the API server itself is that way, I guess)
usage example:
# only when input is exactly the site home, no glob ('*') [hackernews_home] match= https://news.ycombinator.com inspect= hackernews_topstories
templite¶
Note
This module is copied from Ned Batchelder’s Coverage.py, including docstrings here. see templite.py
A simple Python template renderer, for a nano-subset of Django syntax.
For a detailed discussion of this code, see this chapter from 500 Lines: http://aosabook.org/en/500L/a-template-engine.html
- exception tosixinch.templite.TempliteSyntaxError[source]¶
Raised when a template has a syntax error.
- exception tosixinch.templite.TempliteValueError[source]¶
Raised when an expression won’t evaluate in a template.
- class tosixinch.templite.CodeBuilder(indent=0)[source]¶
Build source code conveniently.
- class tosixinch.templite.Templite(text, *contexts)[source]¶
A simple template renderer, for a nano-subset of Django syntax.
Supported constructs are extended variable access:
{{var.modifier.modifier|filter|filter}}
loops:
{% for var in list %}...{% endfor %}
and ifs:
{% if var %}...{% endif %}
Comments are within curly-hash markers:
{# This will be ignored #}
Lines between {% joined %} and {% endjoined %} will have lines stripped and joined. Be careful, this could join words together!
Any of these constructs can have a hyphen at the end (-}}, -%}, -#}), which will collapse the whitespace following the tag.
Construct a Templite with the template text, then use render against a dictionary context to create a finished string:
templite = Templite(''' <h1>Hello {{name|upper}}!</h1> {% for topic in topics %} <p>You are interested in {{topic}}.</p> {% endif %} ''', {'upper': str.upper}, ) text = templite.render({ 'name': "Ned", 'topics': ['Python', 'Geometry', 'Juggling'], })