Yet Another Groovy Spelling Corrector

I’ve taken some time implementing Peter Norvig’s spelling corrector in an attempt to learn Groovy, a dynamic language that compiles to bytecode and is compatible with standard Java classes and libraries.

There are a couple differences (most likely deficiencies) with my implementation. First, I use a list instead of a set when constructing the candidate word list. Second, I created a separate occurrence function in order to provide the smoothing capability for our occurrence distribution. Third, I didn’t really care much for a low line count. It’s not the LOC that matter in the end, it’s how easily you can comprehend the code! :)

public class SpellingCorrector {
  def wordoccur = [:]
 
  def words(File file) {
    Scanner scanner = new Scanner(file)
    def words = scanner.findAll{ x -> x.toLowerCase() ==~ ~/[a-z]+/  }
  }
 
  def train(List words) {
    words.each {
      wordoccur[it] = wordoccur.containsKey(it) ? wordoccur[it] + 1 : 1
    }
  }
 
  def edits1(String word) {
    def results = []
    int n = word.length()
    //Deletion. Remove a character.
    for (i in 0..<n)
      results << word[0..<i] + word[i+1..<n]
 
    //Transposition. Swap adjacent characters.
    for (i in 0..<n-1)
      results << word[0..<i] + word[i+1] + word[i] + word[i+2..<n]
 
    //Alteration. Change one character for another letter.
    for (i in 0..<n)
      for (c in 'a'..'z')
        results << word[0..<i] + c + word[i+1..<n]
 
    //Insertion. Add a letter in between the others.
    for (i in 0..<n)
      for (c in 'a'..'z')
        results << word[0..<i] + c + word[i+1..<n]
 
    return results
  }
 
  def knownedits2(String word) {
    def candidates = []
    edits1(word).each {
      candidates.addAll( edits1(it).findAll { wordoccur.containsKey(it) } )
    }
    return candidates
  }
 
  /**
   * Smoothing distribution. If the word hasn't been encountered (novel words),
   *  we give it an occurence value of 1.
   */
  def int occurrence(String word) {
    return wordoccur[word] == null ? 1 : wordoccur[word];  
  }
 
  def List known(List words) {
    return words.findAll { wordoccur.containsKey(it.toLowerCase()) }
  }
 
  def correct(String word) {
    def candidates = [word] + known([word]) + known(edits1(word)) + knownedits2(word)
    return candidates.max {  occurrence(it) }
  }
}

First, we don’t attempt to split words into two sub-words. For example, a common typo may be “Ihave” rather than “I have”. Second, the training and known function can definitely be improved to with support for proper nouns, stemming, and more. I think it would be a fun exercise to try and to create a simple implementation of these features, much like the SpellingCorrector.

So, Groovy has great support for regular expressions, list construction and compositions and best of all, closures! I also had a chance to play with the Groovy NodeBuilder (on a separate program), which is a great way for constructing tree structures. All said and done, I would hate to implement this in Java.

Leave a Comment

(required)
(will not be published) (required)