A modern replacement for spell(1)
Abhinav Upadhyay <abhinav@NetBSD.org>
AsiaBSDCon 2017
Problems with old spell
Problems with old spell
Very ancient - dates back to Unix version 7
Problems with old spell
Very ancient - dates back to Unix version 7
Uses inflection rules to do spell check
Problems with old spell
Very ancient - dates back to Unix version 7
Uses inflection rules to do spell check
Rules not 100% accurate, for example mis-spellings like
appled
,
coffeed
, Â
undoubt
,
repremanded, undoubtedlys
 are not caught
Problems with old spell
Very ancient - dates back to Unix version 7
Uses inflection rules to do spell check
Rules not 100% accurate, for example mis-spellings like
appled
,
coffeed
, Â
undoubt
,
repremanded, undoubtedlys
 are not caught
Rules only work for English language
Problems with old spell
Very ancient - dates back to Unix version 7
Uses inflection rules to do spell check
Rules not 100% accurate, for example mis-spellings like
appled
,
coffeed
, Â
undoubt
,
repremanded, undoubtedlys
 are not caught
Rules only work for English language
No support for spell corrections
Problems with old spell
Very ancient - dates back to Unix version 7
Uses inflection rules to do spell check
Rules not 100% accurate, for example mis-spellings like
appled
,
coffeed
, Â
undoubt
,
repremanded, undoubtedlys
 are not caught
Rules only work for English language
No support for spell corrections
No library interface for other applications to add spell check support - shells, pkgin, pkg_add, apropos could benefit
A modern spell(1)
A modern spell(1)
Uses a expanded dictionary instead of inflection rules (size of new dictionary 5.1M compared to 2.4 M of the old dictionary)
A modern spell(1)
Uses a expanded dictionary instead of inflection rules
Levenshtein distance and soundex techniques used for finding possible corrections
A modern spell(1)
Uses a expanded dictionary instead of inflection rules
Levenshtein distance and soundex techniques used for finding possible corrections
Also support for n-gram models to do context sensitive corrections and grammar checks (experimental WIP)
A modern spell(1)
Uses a expanded dictionary instead of inflection rules
Levenshtein distance and soundex techniques used for finding possible corrections
Also support for n-gram models to do context sensitive corrections and grammar checks (experimental WIP)
A tool to parse any corpus and generate dictionary to do application specific spell check
A modern spell(1)
Uses a expanded dictionary instead of inflection rules
Levenshtein distance and soundex techniques used for finding possible corrections
Also support for n-gram models to do context sensitive corrections and grammar checks (experimental WIP)
A tool to parse any corpus and generate dictionary to do localized or application specific spell check
The core spell checking and correction functionality available as a reusable library
How spell correction works
Levenshtein distance - minimum number of edits required to convert one string into another
Generate all possible words at distance 1 or 2 and see which ones of them are in the dictionary
Lower weight to corrections involving a change in the 1st character or replacement of a character givenÂ
Higher weight to corrections having the same soundex code
On no match at distance 1, same process done at distance 2
If still no match, word having the same soundex code with minimum edit distance selected.
Support for other languages
Levenshtein distance is language agnostic
Dictionary for any language can be generated and used for spell checking
But before that some work needed to add support for wide chars
Comparison with GNU aspell
Total number of tests: 3945
Matches at first place: 91.33% (aspell 74%)
Matches at positions 1-5: 95.26% (aspell 96.6%)
Matches at positions 1-10: 95.59% (aspell 98.2%)
Matches at positions 1-25: 95.77% (aspell 99%)
Matches at positions 1-50: 95.84% (aspell 99.2%)
Matches at 1-100: 95.92% (aspell 99.2%)
Questions?
Thank you :-)