I’ve taken some time implementing Peter Norvig’s spelling corrector in an attempt to learn Groovy, a dynamic language that compiles to bytecode and is compatible with standard Java classes and libraries.
There are a couple differences (most likely deficiencies) with my implementation. First, I use a list instead of a set when constructing the candidate word list. Second, I created a separate occurrence function in order to provide the smoothing capability for our occurrence distribution. Third, I didn’t really care much for a low line count. It’s not the LOC that matter in the end, it’s how easily you can comprehend the code!
public class SpellingCorrector { def wordoccur = [:] def words(File file) { Scanner scanner = new Scanner(file) def words = scanner.findAll{ x -> x.toLowerCase() ==~ ~/[a-z]+/ } } def train(List words) { words.each { wordoccur[it] = wordoccur.containsKey(it) ? wordoccur[it] + 1 : 1 } } def edits1(String word) { def results = [] int n = word.length() //Deletion. Remove a character. for (i in 0..<n) results << word[0..<i] + word[i+1..<n] //Transposition. Swap adjacent characters. for (i in 0..<n-1) results << word[0..<i] + word[i+1] + word[i] + word[i+2..<n] //Alteration. Change one character for another letter. for (i in 0..<n) for (c in 'a'..'z') results << word[0..<i] + c + word[i+1..<n] //Insertion. Add a letter in between the others. for (i in 0..<n) for (c in 'a'..'z') results << word[0..<i] + c + word[i+1..<n] return results } def knownedits2(String word) { def candidates = [] edits1(word).each { candidates.addAll( edits1(it).findAll { wordoccur.containsKey(it) } ) } return candidates } /** * Smoothing distribution. If the word hasn't been encountered (novel words), * we give it an occurence value of 1. */ def int occurrence(String word) { return wordoccur[word] == null ? 1 : wordoccur[word]; } def List known(List words) { return words.findAll { wordoccur.containsKey(it.toLowerCase()) } } def correct(String word) { def candidates = [word] + known([word]) + known(edits1(word)) + knownedits2(word) return candidates.max { occurrence(it) } } }
First, we don’t attempt to split words into two sub-words. For example, a common typo may be “Ihave” rather than “I have”. Second, the training and known function can definitely be improved to with support for proper nouns, stemming, and more. I think it would be a fun exercise to try and to create a simple implementation of these features, much like the SpellingCorrector.
So, Groovy has great support for regular expressions, list construction and compositions and best of all, closures! I also had a chance to play with the Groovy NodeBuilder (on a separate program), which is a great way for constructing tree structures. All said and done, I would hate to implement this in Java.








