Introduction

String Manipulations

[ word [ 0 : i ]+ c + word [ i :]

( string-append ( substring word 0 i ) ( string c ) ( substring word i ( string-length word )))

( concat word 0 i c word i _ )

The Spelling Corrector

Reading words from a file

> ( regexp-split #rx "[^a-zA-Z]" "foo bar baz" )

( "foo" "bar" "baz" )

( define ( words text )

( list-ec ( : word ( regexp-split #rx "[^a-zA-Z]" text ))

( string-downcase ( bytes->string/latin-1 word ))))



Counting words

( define ( train features )

( let ([ model ( make-hash-table ' equal )])

( do-ec ( : word features )

( hash-table-put! model word ( add1 ( hash-table-get model word 1 ))))

model ))



( define NWORDS ( train ( words ( open-input-file "small.txt" ))))



( define ( word-count word )

( hash-table-get NWORDS word 0 ))



( define ( known? word )

( hash-table-get NWORDS word #f ))



The set of strings with edit distance 1



( concat word 0 i word ( + i 1 ) _ )



( set-ec string-compare ( : i n ) ( concat word 0 i word ( + i 1 ) _ ))



( define ( edits1 word )

( define alphabet ( string-ec ( : c #\a #\z ) c ))

( define n ( string-length word ))

( union*

( set-ec string-compare ( : i n ) ( concat word 0 i word ( + i 1 ) _ ))

( set-ec string-compare ( : i ( - n 1 )) ( concat word 0 i word ( + i 1 ) word i word ( + i 2 ) _ ))

( set-ec string-compare ( : i n ) ( : c alphabet ) ( concat word 0 i c word ( + i 1 ) _ ))

( set-ec string-compare ( : i n ) ( : c alphabet ) ( concat word 0 i c word i _ ))))



( define ( union* . sets )

( foldl union ( empty string-compare ) sets ))



The set of strings with edit distance 2

( define ( edits2 word )

( set-ec string-compare

( : e1 ( elements ( edits1 word )))

( : e2 ( elements ( edits1 e1 )))

e2 ))



( define ( known words )

( set-ec string-compare

( :set w words ) ( if ( known? w )) w ))



( define ( known-edits2 word )

( set-ec string-compare

( :set e1 ( edits1 word ))

( :set e2 ( edits1 e1 ))

( if ( known? e2 ))

e2 ))



Maximum

( define ( maximum xs key )

( define max-key -inf.0 )

( define max-x #f )

( do-ec ( : x xs )

( :let k ( key x ))

( if ( > k max-key ))

( begin

( set! max-key k )

( set! max-x x )))

max-x )



Correcting a word

( define ( correct word )

( define ( falsify s ) ( if ( empty? s ) #f s ))

( let ([ candidates

( or ( falsify ( known ( set word )))

( falsify ( known ( edits1 word )))

( falsify ( known-edits2 word ))

( falsify ( list word )))])

( maximum ( elements candidates ) word-count )))

> ( correct "hose" )

"house"



Last Remarks

Remember to leave a comment

A Spelling Corrector in PLT Scheme

( require ( lib "42.ss" "srfi" )

( lib "67.ss" "srfi" )

( planet "set.scm" ( "soegaard" "galore.plt" 2 ( = 2 ))))



( define ( words text )

( list-ec ( : word ( regexp-split #rx "[^a-zA-Z]" text ))

( string-downcase ( bytes->string/latin-1 word ))))



( define ( train features )

( let ([ model ( make-hash-table ' equal )])

( do-ec ( : word features )

( hash-table-put! model word ( add1 ( hash-table-get model word 1 ))))

model ))



( define file "small.txt" )

( define NWORDS ( train ( words ( open-input-file file ))))



( define ( word-count word )

( hash-table-get NWORDS word 0 ))



( define ( known? word )

( hash-table-get NWORDS word #f ))



( define ( known words )

( set-ec string-compare

( :set w words ) ( if ( known? w )) w ))



( define ( concat-it spec )

( define ( underscore? o ) ( eq? o ' _ ))

( define ( subs spec )

( match spec

[() ' ()]

[(( ? string? s ) ( ? number? n1 ) ( ? number? n2 ) . spec )

( cons ( substring s n1 n2 ) ( subs spec ))]

[(( ? string? s ) ( ? number? n1 ) ( ? underscore? _ ) . spec )

( cons ( substring s n1 ( string-length s )) ( subs spec ))]

[(( ? string? s ) ( ? underscore? _ ) ( ? number? n2 ) . spec )

( cons ( substring s 0 n2 ) ( subs spec ))]

[(( ? string? s ) ( ? number? n ) . spec )

( cons ( string ( string-ref s n )) ( subs spec ))]

[(( ? symbol? s ) . spec )

( cons ( symbol->string s ) ( subs spec ))]

[(( ? char? c ) . spec )

( cons ( string c ) ( subs spec ))]

[(( ? string? s ) . spec )

( cons s ( subs spec ))]

[ else ( error )]))

( apply string-append ( subs spec )))



( define-syntax ( concat stx )

( syntax-case stx ( _ )

[( string-it spec ...)

# ` ( concat-it

( list # ,@ ( map ( lambda ( so )

( syntax-case so ( _ )

[ _ # ' ' _ ]

[ else so ]))

( syntax->list # ' ( spec ...)))))]))



( define ( union* . sets )

( foldl union ( empty string-compare ) sets ))



( define ( edits1 word )

( define alphabet ( string-ec ( : c #\a #\z ) c ))

( define n ( string-length word ))

( union*

( set-ec string-compare ( : i n ) ( concat word 0 i word ( + i 1 ) _ ))

( set-ec string-compare ( : i ( - n 1 )) ( concat word 0 i word ( + i 1 ) word i word ( + i 2 ) _ ))

( set-ec string-compare ( : i n ) ( : c alphabet ) ( concat word 0 i c word ( + i 1 ) _ ))

( set-ec string-compare ( : i n ) ( : c alphabet ) ( concat word 0 i c word i _ ))))



( define ( edits2 word )

( set-ec string-compare

( : e1 ( elements ( edits1 word )))

( : e2 ( elements ( edits1 e1 )))

e2 ))



( define ( known-edits2 word )

( set-ec string-compare

( :set e1 ( edits1 word ))

( :set e2 ( edits1 e1 ))

( if ( known? e2 ))

e2 ))



( define ( maximum xs key )

( define max-key -inf.0 )

( define max-x #f )

( do-ec ( : x xs )

( :let k ( key x ))

( if ( > k max-key ))

( begin

( set! max-key k )

( set! max-x x )))

max-x )



( define ( correct word )

( define ( falsify s ) ( if ( empty? s ) #f s ))

( let ([ candidates

( or ( falsify ( known ( set word )))

( falsify ( known ( edits1 word )))

( falsify ( known-edits2 word ))

( falsify (list->set ( list word ))))])

( maximum ( elements candidates ) word-count )))



[Remember to leave a comment: Was this post silly? Enlightning? Old news?]Peter Norvig recently wrote a great piece on How to Write a Spelling Corrector . Since Norvig used Python, Shiro decided to write a version in Gauche Scheme. In the following I'll present a solution in PLT Scheme. But first let's look at string manipulations.Scheme offers the usual operations on strings: string concatenation (string-append), referencing a character in a string (string-ref), extracting a substring (substring) and converting characters to strings (string). Compared to other languages code for string manipulations tend to become verbose, unless one "cheats" and uses some sort of utility such as format. [The advantage of this verboseness is that Scheme compilers can generate efficient code.]As an example consider this expression from Norvig's spelling corrector:Here word[0:i] is the substring from index 0 (inclusive) to index i (exclusive).The c is a character, and word[i:] is the substring from index i to the end.The concatenation operator + converts automatically the character c to a string before the strings are concatenated.A literal translation to Scheme reads as follows:Ouch!A utility is clearly needed. Below follows my shot at one such utility, namely, concat. It allows us to write the expression as:A string followed by two numbers is a substring, so word 0 i is short for (substring word 0 i). An underscore is allowed instead of a number either to indicate the beginning or the end. Thus word i _ is short for (substring word i (string-length word)). Finally a character is automatically converted to a string, so c is short for (string c). Not used in the example, but useful in general: A string followed by a single number, like word i, is short for (string-ref word i).The code for concat is found at the bottom of this post.There are two parts of the spelling corrector. The first part reads in correctly spelled words from a corpus, and counts the number of occurences of each word. The second part uses the training data to check whether a given word is known, or if it is incorrectly spelled to find the intended word.We can split a string in words with the help of regexp-split.Taking advantage of the fact that PLT Scheme's regular expression functions work both on strings as well as directly on input ports, we can call regexp-split directly on the port.Now we'll count how many times each word is seen. We'll use a hash table to hold words and counts.Let's read in the data from the small corpus, and at the same time define word-count that returns the count of a word, and known? that checks whether a word was seen in the training set.When a word is incorrectly spelled, we want to suggest a "similar" word known to be correctly spelled.Given a word w, the set of words with edit distance 1 consists of the words, we can generate from w by either deleting one character, or transposing to character, or altering one character, or inserting one character.Deleting the character with index i from a word is easy with the help of concat:If n is the length of w, then the set of strings generated by deleting a single character is given by:Here (: i n) makes i run through the numbers 0, 1, ... n-1. For each i the string (concat word 0 i word (+ i 1) _) is calculated, and all strings are collected in a set.All words with edit distance 1 is given by:Given a word w, the set of words with edit distance 2 is the set generated by 2 deletions, transponations, alterations, or insertions. Luckily we have done the hard work in edits1.There is just one catch. The set is awfully large, so instead, we'll concentrate on correctly spelled (known) words with a edit distance of two:If we have more than one suggestion for an incorrectly spelled word, we want to find the most common of them - that is the one with the largest word count in the training set. Python's max has a nice feature, where given a list of xs, it finds the x that maximizes f(x), for a function f. Here is a Scheme version:To correct a word, we first check whether it is known. We try to find a known word with edit distance 1. If unsuccessful we try to find one with edit distance 2.Let's try it:Were it not for the concat macro, the function edits1 would have been clumsy. I am not sure concat is the answer. Thanks to macros, one can at least experiment with language extensions in Scheme.Remember to read Norvig's piece, where he explains the math behind.[Hmm. Should I turn it into a math exercise for my class?]: Was this post silly? Enlightning? Old news?]