Today’s post discusses performing word stemming with F#. This will be an expansion on a previous post, Comparing Quora question intent. As a result, it will also address some feature engineering.

For those not familar with word stems, in this context it basically refers to word bases, excluding suffixes. Stems are helpful when doing text compares, especially when dealing with data of a content-based nature. This aligns well with the Quora question comparisons. The Annytab.Stemmer library meets the needs well.

Before getting started, everything here will be an enhancement of existing code from the Kaggle Quora duplicate questions post.

First, add the Annytab.Stemmer package to the project by adding it to paket.dependencies . Then open the namespaces and create a stemmer object.

1

nuget Annytab.Stemmer



1

2

3

4

5

6

#r "../packages/Annytab.Stemmer/lib/netstandard1.4/Annytab.Stemmer.dll"



open Annytab

open Annytab.Stemmer



let stemmer = EnglishStemmer()



Now that the basic components are in place, I can provide a simple stem example.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

let sentence1 = "When birds fly, they are soaring above the trees while people are watching and talking"

let sentence2 = "When birds are flying, they soar above the trees while people watch and talk"



let sentenceToWords (s:string) = s.Split([|' '|])

let sentence1Words = sentenceToWords sentence1

let sentence2Words = sentenceToWords sentence2



let matches = Set.intersect (set sentence1Words) (set sentence2Words)

printfn "Matches: %A" (Set.count matches)



let sentenceToStemWords (s:string) =

sentenceToWords s

|> stemmer.GetSteamWords



let sentence1StemWords = sentenceToStemWords sentence1

let sentence2StemWords = sentenceToStemWords sentence2



let stemMatches = Set.intersect (set sentence1StemWords) (set sentence2StemWords)

printfn "Stem Matches: %A" (Set.count stemMatches)



printfn "sentence1: %A" sentence1StemWords

printfn "sentence2: %A" sentence2StemWords

printfn "Matches : %d

Stem Matches: %d" (Set.count matches) (Set.count stemMatches)



Here are the results. Notice in the stemmed word list only the bases are listed birds -> bird and watching to watch , etc. This allows for concepts to be matched better.

1

2

3

4

5

6

7

8

9



sentence1: [| "when" ; "bird" ; "fly," ; "they" ; "are" ; "soar" ; "abov" ; "the" ; "tree" ; "while" ;

"peopl" ; "are" ; "watch" ; "and" ; "talk" |]

sentence2: [| "when" ; "bird" ; "are" ; "flying," ; "they" ; "soar" ; "abov" ; "the" ; "tree" ;

"while" ; "peopl" ; "watch" ; "and" ; "talk" |]



> printfn "Matches : %d

Stem Matches: %d" (Set.count matches) (Set.count stemMatches)

Matches : 10

Stem Matches: 13



Time to update the feature generation. A valuable reminder is that feature generation is part art, part science. Often it is an iterative, and experimental, process. Don’t worry, intuition of what a good feature might be grows with time and experience. Using the now defined sentenceToStemWords to extract words from the questions, a comparison can be doing using a Set.intersect .

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

let rowFeatures (metadata:Metadata) (input:CsvProvider<TrainFilename>.Row) =

let question1Words = sentenceToStemWords input.Question1

let question2Words = sentenceToStemWords input.Question2

let wordShareCount =

Set.intersect (set question1Words) (set question2Words)

|> Set.count



let wordShareFeature = ((float32 wordShareCount) * 2. f) / (float32 question1Words.Length + float32 question2Words.Length)



[|

float32 input.Question1.Length;

float32 input.Question2.Length;

(wordCount >> float32) input.Question1;

(wordCount >> float32) input.Question2;

(abs >> float32) (wordCount input.Question1 - wordCount input.Question2);

wordShareFeature

|]



Add matching word stems between questions as a feature has improved the accuracy by about 8%. That is a decent ROI for adding a feature.

1

2

3

4

5

6

7

8

9

10

11

> evaluatePredictionResults finalModel trainInput trainOutput

Accuracy: 0.755652

T\P T F

T 84299 43153

F 35877 160103



> evaluatePredictionResults finalModel validationInput validationOutput

Accuracy: 0.704828

T\P T F

T 18281 13061

F 10806 38710



There is one downside to this approach, common words like “a”, “and”, “the” are included in the matching word feature. This can result in a deceptively high percentage word match. To get a more representative match, these “stop words” can be excluded. Time to make another feature change. I built a stopwords list, here is a sample. The full file is here.

1

2

3

4

5

6

7

i

a

about

after

all

also

an



Then alter sentenceToFilteredStemWords to be sentenceToFilteredStemWords that excludes stop words. This will get me to where I want to be.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32



let StopWordsFilename = "../data/stopwords.txt"



let stopWords =

File.ReadAllLines StopWordsFilename

|> Array.map ( fun x -> (x, 1 ))

|> Map.ofArray



let sentenceToFilteredStemWords s =

sentenceToStemWords s

|> Array.filter ( fun w -> not (Map.containsKey w stopWords))



let rowFeatures (metadata:Metadata) (input:CsvProvider<TrainFilename>.Row) =

let question1Words = sentenceToFilteredStemWords input.Question1

let question2Words = sentenceToFilteredStemWords input.Question2

let wordShareCount =

Set.intersect (set question1Words) (set question2Words)

|> Set.count



let wordShareFeature =

if question1Words.Length + question2Words.Length = 0

then 0. f

else ((float32 wordShareCount) * 2. f) / (float32 question1Words.Length + float32 question2Words.Length)



[|

float32 input.Question1.Length;

float32 input.Question2.Length;

(wordCount >> float32) input.Question1;

(wordCount >> float32) input.Question2;

(abs >> float32) (wordCount input.Question1 - wordCount input.Question2);

wordShareFeature;

|]



Filtering out stop words gained another 3%. Admittedly I expected a bit more, but still upwards.

1

2

3

4

5

6

7

8

9

10

11

> evaluatePredictionResults finalModel trainInput trainOutput

Accuracy: 0.777598

T\P T F

T 88157 39913

F 32019 163343



> evaluatePredictionResults finalModel validationInput validationOutput

Accuracy: 0.730577

T\P T F

T 19373 12071

F 9714 39700



As you can see, using word stems and stop words to extend the features can be a useful tactic. This also serves as a good reminder that F# has the tools for interesting analysis. I hope you found this post useful. Until next time.