Today’s topic will be to demonstrate tackling a Kaggle problem with XGBoost and F#. Comparing Quora question intent offers a perfect opportunity to work with XGBoost, a common tool used in Kaggle competitions. Luckily there is a .NET wrapper around the XGBoost library, XGBoost.Net.

Before going too far, let’s break down the data formats. First, Kaggle provides a train.csv which is used for training models. This contains question pairs and the ground truth regarding their duplicated-ness. Second, test.csv is questions pairs with no ground truth. This is used for generating the submission file to Kaggle. Third, submission.csv are the results to submit to Kaggle for judging. is_duplicate represents a percentage likelihood of being a duplicate. Below are example rows from each dataset.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

// train.csv

"id","qid1","qid2","question1","question2","is_duplicate"

"0","1","2","What is the step by step guide to invest in share market in india?","What is the step by step guide to inves

t in share market?","0"

"1","3","4","What is the story of Kohinoor (Koh-i-Noor) Diamond?","What would happen if the Indian government stole the K

ohinoor (Koh-i-Noor) diamond back?","0"



// test.csv

"test_id","question1","question2"

0,"How does the Surface Pro himself 4 compare with iPad Pro?","Why did Microsoft choose core m3 and not core i3 home Surface Pro 4?"

1,"Should I have a hair transplant at age 24? How much would it cost?","How much cost does hair transplant require?"



// submission.csv

test_id,is_duplicate

0,0.425764

1,0.212075



Now that the data is out of the way, time to get started. Using Paket, here is a sample paket.dependencies file.

1

2

3

4

source https:



nuget FSharp.Data

nuget PicNet.XGBoost



Here is the boilerplate and initial variables. Most of this is self-explanatory, although I want to call out a couple things specifically. As expected, TypeProviders will be used to load the csv datasets. When I get to the model training section, there will be hyperparameters. This object will be managed by ModelParameterType and ModelParameter . Feature extraction will use dataset-level metadata. Since this is meant to be a simple example, the only metadata will be the average number of words in a question. As shown above, the train and test files are slightly different formats. Whatever method I use, I want to be able to run the same code against train and test. StandardRow enables this by standardizing the input row format for transformation.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

System.IO.Directory.SetCurrentDirectory(__SOURCE_DIRECTORY__)

#r "../packages/FSharp.Data/lib/net40/FSharp.Data.dll"

#r "../packages/PicNet.XGBoost/lib/net40/XGBoost.dll"



open System

open System.IO

open FSharp.Data

open XGBoost









let TrainPct = 0.8







let TrainFilename = "../data/train.csv"







let TestFilename = "../data/test.csv"







let SubmissionFilename = "../data/submission.csv"





type ModelParameterType = | Int | Float32



type ModelParameter = { Name: string; Type: ModelParameterType; Value: float }



type Metadata = { AverageWordCount: float32 }



type StandardRow = { QuestionId: int; Label: float32; Features: float32[] }





type TrainData = CsvProvider<TrainFilename>



type TestData = CsvProvider<TestFilename>



To ensure proper model training, the provided train.csv will be broken into a train and validation set. This method could be more advanced, but take the first x% for training and 100-x% for validation works well enough in this case. Since the train and test files are different, a conversion function is needed.

1

2

3

4

5

6

7

8

9

10

11

12



let sample (input:CsvProvider<TrainFilename>) trainPct =

let trainRows = int (float (input.Rows |> Seq.length) * trainPct)

let trainData = input.Rows |> Seq.take trainRows |> Seq.toArray

let validatationData = input.Rows |> Seq.skip trainRows |> Seq.toArray

(trainData, validatationData)







let convertTestToTrainFormat (input:CsvProvider<TestFilename>.Row []) :(CsvProvider<TrainFilename>.Row []) =

input

|> Array.map ( fun x -> new CsvProvider<TrainFilename>.Row(x.Test_id, 0 , 0 , x.Question1, x.Question2, false ))



Here are the feature generating, and supporting, functions. For pedagogical reasons the feature set is going to be overly simplistic. This won’t result in a great prediction result, but proper feature creation can be involved. More advanced feature extraction will be addressed in a later post. For now, this will be enough to get some results, without losing the primary goal in a forest of feature extraction code.

Some features will/may need aggregate information about the dataset. This is commonly used to for scaling or comparison for averages. This will be stored in a dataset metadata object that all rows will have access to during row transformation and feature extraction. The row-specific features are length and wordcount for the two questions being compared. In addition, the difference in wordcount between the questions is considered.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38



let wordCount (s:string) = Array.length (s.Split([| ' ' |]))





let abs (x:int) = Math.Abs(x)





let metadata (input:CsvProvider<TrainFilename>.Row []) =

let averageWordCount =

input

|> Array.collect ( fun row -> [|

Array.length (row.Question1.Split([| ' ' |]));

Array.length (row.Question2.Split([| ' ' |])) |])

|> Array.sum

|> ( fun total -> float32 total / float32 (input.Length * 2 ))



{ Metadata.AverageWordCount = averageWordCount }





let rowFeatures (metadata:Metadata) (input:CsvProvider<TrainFilename>.Row) =

[|

float32 input.Question1.Length;

float32 input.Question2.Length;

(wordCount >> float32) input.Question1;

(wordCount >> float32) input.Question2;

(abs >> float32) (wordCount input.Question1 - wordCount input.Question2);

|]





let transform (metadata:Metadata) (input:CsvProvider<TrainFilename>.Row []) =

input

|> Array.map( fun row ->

{

StandardRow.QuestionId = row.Id;

Label = if row.Is_duplicate then float32 1. else float32 0. ;

Features = rowFeatures metadata row

}

)



Now it is time to look at the XGBoost functionality. Generating a model is as simple as creating a classifier, applying a hyperparameter set, and then running .Fit using the training data (features, and labels). One small mention, as can be seen, the library uses float32[] for most of it’s numeric interations.

Once the model is trained, it can be applied using PredictProba against an array of features (that match the structure of the training data). The result is an array of probabilities per class. Since this is a binary classification, [0.34, 0.66] means there is a 34% chance the result is false, and 66% chance the result is true. For the final submission, a percentage is desired, but for training, it is useful to know the binary true/false regarding duplicate question status.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24



let buildXgClassModel (trainInput:float32[][]) (trainOutput:float32[]) (parameters:ModelParameter list) =

let model = XGBClassifier()







parameters

|> List.iter ( fun parameter ->

match parameter.Type with

| Int -> model.SetParameter(parameter.Name, (int parameter.Value))

| Float32 -> model.SetParameter(parameter.Name, (float32 parameter.Value)))



model.Fit(trainInput, trainOutput)

model



let predictionProbabilities (model:XGBClassifier) (inputs:float32[][]) =



model.PredictProba(inputs)



let predictionValues (model:XGBClassifier) (inputs:float32[][]) =





predictionProbabilities model inputs

|> Array.map ( fun x -> if x.[ 0 ] > x.[ 1 ] then 0 else 1 )



To faciliate debugging and improvement, a confusion matrix is very useful. This, along with an overall accuracy reporting will assign in future developmental interations.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49



let comparePredictions (target:float32[]) predicted =

(target, predicted)

||> Array.zip

|> Array.map ( fun (t, p) -> ((int t) - p) = 0 )

















let createConfusionMatrix (target:int[]) (predict:int[]) =

let combined = (target, predict) ||> Array.zip



let aggregateRow combined filter =

combined

|> Array.filter ( fun (_,p) -> p=filter)

|> Array.map ( fun (t,p) -> (( if t= 1 then 1 else 0 ), ( if t= 0 then 1 else 0 )))

|> Array.fold ( fun (a,b) (x,y) -> (a+x, b+y)) ( 0 , 0 )



let pTrue = aggregateRow combined 1

let pFalse = aggregateRow combined 0



[|

[| fst pTrue; snd pTrue |];

[| fst pFalse; snd pFalse |]

|]







let printConfusionMatrix targetValues predictedValues =

createConfusionMatrix targetValues predictedValues

|> ( fun m ->

printfn "T\P %6s %6s" "T" "F"

printfn "T %6d %6d" (m.[ 0 ].[ 0 ]) (m.[ 0 ].[ 1 ])

printfn "F %6d %6d" (m.[ 1 ].[ 0 ]) (m.[ 1 ].[ 1 ]))









let evaluatePredictionResults model input targetOutput =

let predictedValidationValues = predictionValues model input

let predictedValidationMatches = comparePredictions targetOutput predictedValidationValues

let pctValidationMatches = float (predictedValidationMatches |> Array.filter id |> Array.length) / float (predictedValidationMatches |> Array.length)



printfn "Accuracy: %f" pctValidationMatches

printConfusionMatrix (targetOutput |> Array.map int) predictedValidationValues



Since the submission file has specific criteria, there are some functions to create the submission file. This is primarily formatting the percents as Kaggle expects and then writing the dataset to a file.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27





let convertPredictionToProbability (probabilities: float32[]) =

if probabilities.[ 0 ] > probabilities.[ 1 ]

then 1. f - probabilities.[ 0 ]

else probabilities.[ 1 ]







let formatSubmissionData (rows:StandardRow[]) (predictions:float32[][]) =

(rows, predictions)

||> Array.zip

|> Array.map ( fun (input, prediction) ->

let questionId = input.QuestionId

let probability = convertPredictionToProbability prediction

(questionId, probability))





let writeSubmissionFile (submissionFilename:string) (submissionData: (int * float32)[]) =

let fileStream = new StreamWriter(submissionFilename)

fileStream.WriteLine( "test_id,is_duplicate" )

submissionData

|> Array.iter( fun (id, probability) ->

let line = sprintf "%d,%f" id probability

fileStream.WriteLine(line))

fileStream.Flush()

fileStream.Close()



Now that all the hard work is done, it is time to put it all together. The first step is data preparation. First, load the training data and split into train and validation sets. Second, build dataset level metadata. Third, run transformations (feature creation) against the datasets. Fourth, structure the data for model training by generating the appropriate label and features arrays.

1

2

3

4

5

6

7

8

9

10

11

12

13



let allData = TrainData.Load(TrainFilename)

let (trainData, validationData) = sample allData TrainPct



let trainMetadata = metadata trainData

let transformedTrainData = transform trainMetadata trainData

let transformedValidationData = transform trainMetadata validationData



let trainInput = transformedTrainData |> Array.map ( fun row -> row.Features)

let trainOutput = transformedTrainData |> Array.map ( fun row -> row.Label)



let validationInput = transformedValidationData |> Array.map ( fun row -> row.Features)

let validationOutput = transformedValidationData |> Array.map ( fun row -> row.Label)



Time to train the model. XGBoost supports the below parameters. The values shown are populated with some reasonable values for the dataset in question. Out of scope for this post, but hyperparameter optimization should be leveraged here to find the best training model. In a later post I’ll discuss a simple method to approach this topic.

Once trained, report on prediction capability against the original training set as well as the validation set (which the model hasn’t seen).

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20



let modelParameters = [

{ Name = "max_depth" ; Type = ModelParameterType.Int; Value = 10. };

{ Name = "learning_rate" ; Type = ModelParameterType.Float32; Value = 0.76 };

{ Name = "gamma" ; Type = ModelParameterType.Float32; Value = 1.9 };

{ Name = "min_child_weight" ; Type = ModelParameterType.Int; Value = 5. };

{ Name = "max_delta_step" ; Type = ModelParameterType.Int; Value = 0. };

{ Name = "subsample" ; Type = ModelParameterType.Float32; Value = 0.75 };

{ Name = "colsample" ; Type = ModelParameterType.Float32; Value = 0.75 };

{ Name = "reg_lambda" ; Type = ModelParameterType.Float32; Value = 4. };

{ Name = "reg_alpha" ; Type = ModelParameterType.Float32; Value = 1. } ]





let finalModel = buildXgClassModel trainInput trainOutput modelParameters





evaluatePredictionResults finalModel trainInput trainOutput





evaluatePredictionResults finalModel validationInput validationOutput



Here are the prediction results of train and test. The prediction capability isn’t great, but the validation set holds up comparatively well. At least overfitting isn’t a concern (for now). This also shows how more and better features have plenty of room for improvement.

1

2

3

4

5

6

7

8

9

10

11

> evaluatePredictionResults finalModel trainInput trainOutput

Accuracy: 0.680396

T\P T F

T 53352 36546

F 66824 166710



> evaluatePredictionResults finalModel validationInput validationOutput

Accuracy: 0.651030

T\P T F

T 11625 10755

F 17462 41016



Now it is time to create the final predictions and submission file for Kaggle. To do this, replicate the validate workflow, with a couple caveats. First, the test dataset is formatted slightly differently. Since this is data with no known classificaions, there is no class in the file. So I need to load the test data, then run the convert so the test data matches the format of the training data. Second, the submission file needs to be populated with a percent likelihood of the questions being duplicates (not with a straight classification). Lastly, write the id along with the result to the submission file.

1

2

3

4

5

6

let testData = TestData.Load(TestFilename).Rows |> Seq.toArray

let transformedTestData = transform trainMetadata (convertTestToTrainFormat testData)

let testInput = transformedTestData |> Array.map ( fun row -> row.Features)

let testPredictions = predictionProbabilities finalModel testInput

let submissionData = formatSubmissionData transformedTestData testPredictions

writeSubmissionFile SubmissionFilename submissionData



All that is left to do is submit the file for judging. Spolier alert, because this is an overly simplified model, it faired poorly. Like I mentioned in the beginning, the current feature set isn’t good. In addition, the hyper-parameters could benefit from some search of their own. These are both topics I plan on discussing in future posts. F# and .NET still have a couple more tricks up their sleeves to get these results even better. Hopefully this has provided a bit of inspiration to try F# in your own projects, perhaps even a Kaggle. Until next time.