Supported Formats

snapper classifies text into prose regions (reflowed at sentence boundaries) and structure regions (passed through unchanged). The classification depends on the format.

Org-mode (--format org)

Structure regions (preserved)

  • #+BEGIN_*

#+END_* blocks (source, example, quote, etc.)

  • :PROPERTIES:

:END: drawers

  • #+KEYWORD: directives (TITLE, AUTHOR, DATE, OPTIONS, etc.)

  • Table rows (lines starting with |)

  • Comment lines (starting with # but not #+ )

  • Headline stars and TODO keywords

  • List item markers (-, +, 1.)

Prose regions (reflowed)

  • Paragraph text

  • Headline text (after the stars and keyword)

  • List item text (after the marker)

Inline tokens (kept atomic)

These tokens within prose are not split across lines:

  • Links: [[url][description]]

  • Inline code: ~code~, ==verbatim==

LaTeX (--format latex)

Structure regions (preserved)

  • Preamble (everything before \begin{document})

  • Non-prose environments: equation, align, figure, table, tabular, lstlisting, verbatim, minted, tikzpicture, and their starred variants

  • Display math: \[...\]

  • Comment lines (starting with %)

  • \end{document}

Prose regions (reflowed)

  • Body text between structural elements

Markdown (--format markdown)

Structure regions (preserved)

  • Fenced code blocks (``` or ~~~)

  • Front matter (--- or +++ delimited at file start)

  • Heading markers (#, ##, etc.)

  • List item markers (-, \*, +, 1.)

Prose regions (reflowed)

  • Paragraph text

  • Heading text (after the marker)

  • List item text (after the marker)

reStructuredText (--format rst)

Structure regions (preserved)

  • Directives (.. code-block::, .. math::, .. image::, etc.) and their indented bodies

  • Literal blocks (text after :: with indented content)

  • Section titles and underlines (===, -----, etc.)

  • Field lists (:Author:, :Date:, etc.)

  • Comments (.. without a directive)

  • Grid and simple tables (lines starting with | or +)

Prose regions (reflowed)

  • Paragraph text between structural elements

Auto-detection

Extensions: .rst, .rest

Plaintext (--format plaintext)

Everything is prose. Blank lines are preserved as paragraph separators.

Sentence Detection

snapper uses Unicode UAX #29 sentence boundary detection as a baseline, then merges false splits caused by known abbreviations:

Titles

Mr., Mrs., Ms., Dr., Prof., Sr., Jr., St., Rev., Gen., etc.

Academic

Fig., Figs., Eq., Eqs., Ref., Refs., Tab., Sec., Ch., Vol., No., Thm., Lem., Prop., Def., Cor., Rem., Ex.

Latin

e.g., i.e., et al., cf., etc., viz., ibid., ca., approx.

Single initials

A., B., C., … Z.

Date and time

Jan., Feb., …, Dec., Mon., Tue., …, Sun., a.m., p.m.