Neural Sentence Detection

Overview

Snapper includes a neural sentence boundary detector via nnsplit, a byte-level LSTM model running on tract (pure Rust ONNX inference).

The rule-based detector (default) handles English academic text well: abbreviations (Dr., Fig., Eq.), inline tokens (links, math), and standard punctuation. The neural detector handles languages where rule-based splitting falls short.

When to use neural

  • Non-English text (German, French, Chinese, Russian, Turkish, etc.)

  • Text with unusual punctuation patterns

  • Mixed-language documents where abbreviation lists do not cover all languages

For English academic papers, the rule-based detector produces better results (faster, no false breaks on abbreviations like “Fig.”).

Basic usage

snapper --neural paper.org

On first run, the English model (~4MB) downloads and caches to ~/.cache/nnsplit/en/. Subsequent runs load from cache.

Non-English languages

Use --lang to select a language model:

snapper --neural --lang de paper_german.tex
snapper --neural --lang fr article_french.md
snapper --neural --lang zh document_chinese.org

Available languages: en, de, fr, no, sv, zh, tr, ru, uk.

Each model downloads on first use (~4MB each) and caches to ~/.cache/nnsplit/<lang>/.

Custom models

Load a custom ONNX model file:

snapper --neural --model-path /path/to/custom_model.onnx paper.org

Custom models must follow the nnsplit ONNX format (byte-level input, sigmoid output, split_sequence metadata).

Performance comparison

Mode

Speed

Abbreviation handling

Best for

Rule-based (default)

~5ms/file

Excellent (80+ rules)

English academic text

Neural (--neural)

~200ms/file

Model-dependent

Non-English, mixed text

The rule-based detector starts instantly. The neural detector loads the model on first invocation (~100-500ms), then processes text at ~200ms per typical academic file.

Combining with format-aware parsing

Neural detection replaces only the sentence splitting step. Format-aware parsing (code blocks, math, drawers, tables) still applies. Inline token protection (links, math, code) still applies.

# Neural splitting + Org-mode format awareness
snapper --neural --format org paper.org

# Neural splitting + LaTeX format awareness
snapper --neural --format latex paper.tex