Supported Formats¶
snapper classifies text into prose regions (reflowed at sentence boundaries) and structure regions (passed through unchanged).
The classification depends on the format.
Org-mode (--format org)¶
Structure regions (preserved)¶
#+BEGIN_*…
#+END_* blocks (source, example, quote, etc.)
:PROPERTIES:…
:END: drawers
#+KEYWORD:directives (TITLE, AUTHOR, DATE, OPTIONS, etc.)Table rows (lines starting with
|)Comment lines (starting with
#but not#+)Headline stars and TODO keywords
List item markers (
-,+,1.)
Prose regions (reflowed)¶
Paragraph text
Headline text (after the stars and keyword)
List item text (after the marker)
Inline tokens (kept atomic)¶
These tokens within prose are not split across lines:
Links:
[[url][description]]Inline code:
~code~,==verbatim==
LaTeX (--format latex)¶
Structure regions (preserved)¶
Preamble (everything before
\begin{document})Non-prose environments: equation, align, figure, table, tabular, lstlisting, verbatim, minted, tikzpicture, and their starred variants
Display math:
\[...\]Comment lines (starting with
%)\end{document}
Prose regions (reflowed)¶
Body text between structural elements
Markdown (--format markdown)¶
Structure regions (preserved)¶
Fenced code blocks (
```or~~~)Front matter (
---or+++delimited at file start)Heading markers (
#,##, etc.)List item markers (
-,\*,+,1.)
Prose regions (reflowed)¶
Paragraph text
Heading text (after the marker)
List item text (after the marker)
reStructuredText (--format rst)¶
Structure regions (preserved)¶
Directives (
.. code-block::,.. math::,.. image::, etc.) and their indented bodiesLiteral blocks (text after
::with indented content)Section titles and underlines (
===,-----, etc.)Field lists (
:Author:,:Date:, etc.)Comments (
..without a directive)Grid and simple tables (lines starting with
|or+)
Prose regions (reflowed)¶
Paragraph text between structural elements
Auto-detection¶
Extensions: .rst, .rest
Plaintext (--format plaintext)¶
Everything is prose. Blank lines are preserved as paragraph separators.
Sentence Detection¶
snapper uses Unicode UAX #29 sentence boundary detection as a baseline, then merges false splits caused by known abbreviations:
Titles¶
Mr., Mrs., Ms., Dr., Prof., Sr., Jr., St., Rev., Gen., etc.
Academic¶
Fig., Figs., Eq., Eqs., Ref., Refs., Tab., Sec., Ch., Vol., No., Thm., Lem., Prop., Def., Cor., Rem., Ex.
Latin¶
e.g., i.e., et al., cf., etc., viz., ibid., ca., approx.
Single initials¶
A., B., C., … Z.
Date and time¶
Jan., Feb., …, Dec., Mon., Tue., …, Sun., a.m., p.m.