Neural Sentence Detection¶
Overview¶
Snapper includes a neural sentence boundary detector via nnsplit, a byte-level LSTM model running on tract (pure Rust ONNX inference).
The rule-based detector (default) handles English academic text well: abbreviations (Dr., Fig., Eq.), inline tokens (links, math), and standard punctuation. The neural detector handles languages where rule-based splitting falls short.
When to use neural¶
Non-English text (German, French, Chinese, Russian, Turkish, etc.)
Text with unusual punctuation patterns
Mixed-language documents where abbreviation lists do not cover all languages
For English academic papers, the rule-based detector produces better results (faster, no false breaks on abbreviations like “Fig.”).
Basic usage¶
snapper --neural paper.org
On first run, the English model (~4MB) downloads and caches to ~/.cache/nnsplit/en/.
Subsequent runs load from cache.
Non-English languages¶
Use --lang to select a language model:
snapper --neural --lang de paper_german.tex
snapper --neural --lang fr article_french.md
snapper --neural --lang zh document_chinese.org
Available languages: en, de, fr, no, sv, zh, tr, ru, uk.
Each model downloads on first use (~4MB each) and caches to ~/.cache/nnsplit/<lang>/.
Custom models¶
Load a custom ONNX model file:
snapper --neural --model-path /path/to/custom_model.onnx paper.org
Custom models must follow the nnsplit ONNX format (byte-level input, sigmoid output, split_sequence metadata).
Performance comparison¶
Mode |
Speed |
Abbreviation handling |
Best for |
|---|---|---|---|
Rule-based (default) |
~5ms/file |
Excellent (80+ rules) |
English academic text |
Neural ( |
~200ms/file |
Model-dependent |
Non-English, mixed text |
The rule-based detector starts instantly. The neural detector loads the model on first invocation (~100-500ms), then processes text at ~200ms per typical academic file.
Combining with format-aware parsing¶
Neural detection replaces only the sentence splitting step. Format-aware parsing (code blocks, math, drawers, tables) still applies. Inline token protection (links, math, code) still applies.
# Neural splitting + Org-mode format awareness
snapper --neural --format org paper.org
# Neural splitting + LaTeX format awareness
snapper --neural --format latex paper.tex