The discovering ML.NET series continues. With the release of v0.3.0, it is time to look at performing K-means clustering using F# and Microsoft’s new ML.NET framework. The use case will be to use examination attributes to classify mammogram results.

NOTE: Due to ML.NET changes, this post is superceded by the post Clustering-V2.

For reference, previous ML.NET series posts are below:

As I mentioned in the previous posts, there is a disclaimer: ML.NET is in its early stages. I found a couple interface idiosyncrasies I suspect will change over time. Just keep that in mind. I am happy with what I have seen so far, and I’m excited to see it grow and mature.

Note: The post was written using Microsoft.ML v0.3.0.

Make sure you have .NET Core version 2.1 installed. If you don’t, head out to the .NET Core Downloads page. Select SDK for your platform. Tangential, but you can also get here by going to dot.net, then navigating to Downloads and .NET Core .

First, create a console F# project, then add the ML.NET package.

1

2

3

dotnet new console --language F# --name MLNet-Mammogram

cd MLNet-Mammogram

dotnet add package Microsoft.ML



Next, it is time to get the data. The source I used for this post is from UCI. The datafile can be found [here] (https://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.data)

1

2

mkdir data && cd data

curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/mammographic-masses/mammographic_masses.data



Here is a sample of what the data looks like. There is no header row. The columns represent 5 features and 1 classification column:

BI-RADS assessment (1-5)

Age (Patient’s age)

Shape (mass shape: round=1 oval=2 lobular=3 irregular=4 (nominal))

Margin (mass margin: circumscribed=1 microlobulated=2 obscured=3 ill-defined=4 spiculated=5 (nominal))

Density: (mass density high=1 iso=2 low=3 fat-containing=4 (ordinal))

Severity: (benign=0 or malignant=1)

1

2

3

4

5

6

# Data Rows

5,67,3,5,3,1

4,43,1,1,?,1

5,58,4,5,3,1

4,28,1,1,3,0

5,57,1,5,3,1



Now that the project is setup and data is local, we can get to the code. Time to open up the already created Program.fs . First, add the necessary namespaces.

1

2

3

4

5

6

open Microsoft.ML

open Microsoft.ML.Runtime.Api

open Microsoft.ML.Trainers

open Microsoft.ML.Transforms

open Microsoft.ML.Models

open Microsoft.ML.Data



The ML.NET pipeline expects the data in a specific format. In the C# world, this is a class, for F# we can use a type. Below are the required types; MammogramData is the input data, MammogramPrediction is the output prediction. For MammogramData , this is basically a a map of columns to member variables. There are a couple notable points to ensure the pipeline can properly consume the data. Each attribute must be mutable public , it also requires the [<Column("#")>] to specify it’s column position, and [<DefaultValue>] attributes. For MammogramPrediction , PredictionLabel for the cluster id, and Score for calculated distances from all clusters is required.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

type MammogramPrediction () =





val mutable public BiRads:float32







val mutable public Age:float32







val mutable public Shape:float32







val mutable public Margin:float32







val mutable public Density:float32







val mutable public Label :float32



type MammogramPrediction () =





val mutable public SelectedClusterId:uint32







val mutable public Distance: float32[]



As the other examples, building the pipeline structure is intuitive. First, create a pipeline. Then, add components to the pipeline in the order to be executed. So first, load the data with a TextLoader . This data is comma delimited and has a header row.

1

2

3

let pipeline = new LearningPipeline()

let dataPath = "./data/mammographic_masses.data"

pipeline.Add(( new TextLoader(dataPath)).CreateFrom<MammogramData>(separator = ',', useHeader = false ))



After the data is loaded, feature columns need to be added to the pipeline. I’m going to use all feature columns from the file, and exclude severity. The clustering model requires features to be numeric, which if fine here. As the other posts show, you can convert text to numeric mappings if necessary.

1

pipeline.Add( new ColumnConcatenator( "Features" , "BiRads" , "Age" , "Shape" , "Margin" , "Density" ))



Now that the features are defined, it is time to define the training method. This will be KMeansPlusPlusClusterer . Similar to the other trainers, custom parameters can be defined, I have decided to use K = 4 . It also has other options as as MaxIterations , OptTol (convergence tolerance), and NormalizeFeatures .

1

pipeline.Add( new KMeansPlusPlusClusterer(K = 4 ))



The last part, train the model. Note the MammogramData and MammogramPrediction types as part of the Train call.

1

let model = pipeline.Train<MammogramData, MammogramPrediction>()



Validation of any model is important. For a real case, I would train on one dataset and validate against a previously unseen dataset. Since this is just an example, I validate against the training data. As a result, I expect the predictions to be really accurate. ML.NET offers multiple Evaluator classes, based on specific needs. For this, the obvious choice is ClusterEvaluator , it takes a trained model and a dataset, and produces critical metrics.

1

2

3

4

5

6

7

8

9

10

11



let testData = ( new TextLoader(dataPath)).CreateFrom<MammogramData>(separator = ',', useHeader = true )

let evaluator = new ClusterEvaluator()

let metrics = evaluator.Evaluate(model, testData)

printfn ""

printfn "Avg Min Score: %f" <| metrics.AvgMinScore



printfn "DBI : %A" <| metrics.Dbi



printfn "NMI : %A" <| metrics.Nmi

printfn ""



1

2

3

4

5

6

7

8

9

Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.

Initializing centroids

Centroids initialized, starting main trainer

Model trained successfully on 829 instances

Not training a calibrator because it is not needed.



Avg Min Score: 0.049841

DBI : 0.0

NMI : 0.3012495931



With the initial evaluation out of the way, it is time to move onto individual predictions. I want to create aggregate classification percentages for each cluster. To do this I take the predictive model and apply it against the the training file. Using the predicted cluster and the training label, I create a mapping for detailed predictions. Each cluster gets its own raw benign/malignant count, which can be converted into percentage likelihood for each classification. I have the details annotated in comments, to make it easier to follow. Honestly, this is the most labor-intensive part of the process. I’d love to be able to pass an cluster-aggregate-score function in as part of the trainer to eliminate this work or reprocessing the data. Once I have these results as a Map , I can query results easy enough.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48



let clusterClassification =



System.IO.File.ReadAllLines(dataPath)



|> Array.filter ( fun line -> not (line.Contains( "?" )))



|> Array.map ( fun line ->



let row = line.Split(',') |> Array.map float32



let predictedCluster =

model.Predict(

MammogramData(

BiRads = row.[ 0 ],

Age = row.[ 1 ],

Shape = row.[ 2 ],

Margin = row.[ 3 ],

Density = row.[ 4 ]))



if int row.[ 5 ] = 0

then (predictedCluster.SelectedClusterId, [| 1 ; 0 |])

else (predictedCluster.SelectedClusterId, [| 0 ; 1 |]))



|> Array.groupBy ( fun (clusterId, _) -> clusterId)



|> Array.map ( fun (clusterId, data) ->

let countSums =

data

|> Array.map ( fun (_, z) -> z)

|> Array.fold ( fun a (x:int []) ->

[| a.[ 0 ] + x.[ 0 ]; a.[ 1 ] + x.[ 1 ] |]) [| 0 ; 0 |]

(clusterId, countSums))

|> Map.ofArray





let clusterIdToPrediction (clusterClassification:Map<uint32, int[]>) (clusterId:uint32) =

let classifications = clusterClassification.Item clusterId



let total = classifications |> Array.sum |> float

let benignPct = float classifications.[ 0 ] / total

let malignantPct = float classifications.[ 1 ] / total



sprintf "Benign: %0.2f Malignant: %0.2f (%d, %d)"

benignPct

malignantPct

classifications.[ 0 ]

classifications.[ 1 ]



Now that the clusterIdToPrediction is defined, I can pair the ML.NET cluster prediction with the aggregated cluster classification percentages. First, create a MammogramData object and provide it to the Predict method. Second, use the predicted clusterId with the aggregated cluster classification percentages to get a classification result. For this example, I pull one of those rows from the training data.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

let test1 = MammogramData()

test1.BiRads <- 5. f

test1.Age <- 67. f

test1.Shape <- 3. f

test1.Margin <- 5. f

test1.Density <- 3. f





let predictionTest1 = model.Predict(test1)

printfn "Predicted ClusterId: %d" predictionTest1.SelectedClusterId

printfn "Predicted Distances: %A" predictionTest1.Distance

printfn "Predicted Result: %s" (clusterIdToPrediction clusterClassification predictionTest1.SelectedClusterId)

printfn "Actual Result : 1 (Malignant)"

printfn ""



The results show the prediction falls into cluster 3, which has a 84% likelihood it is malignant, which matches the actual value.

1

2

3

4

5

# Prediction Result:

Predicted ClusterId: 3

Predicted Distances: [|0.128789425f; 0.166862488f; 0.0578770638f; 0.80590868f|]

Predicted Result: Benign: 0.16 Malignant: 0.84 (19, 99)

Actual Result : 1 (Malignant)



Like the other models before it, the cluster model can be saved to a file and reloaded later. This is supported by the WriteAsync and ReadAsync methods of a model.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16



model.WriteAsync( "test-model" )

|> Async.AwaitTask

|> ignore





let modelReloaded =

PredictionModel.ReadAsync<MammogramData, MammogramPrediction>( "test-model" )

|> Async.AwaitTask

|> Async.RunSynchronously

let predictionReloaded = modelReloaded.Predict(test1)

printfn "Predicted ClusterId RL: %d" predictionReloaded.SelectedClusterId

printfn "Predicted Distances RL: %A" predictionReloaded.Distance

printfn "Predicted Result RL: %s" (clusterIdToPrediction clusterClassification predictionReloaded.SelectedClusterId)

printfn "Actual Result RL : 1 (Malignant)"

printfn ""



As expected, the prediction results are the same with the reloaded model.

1

2

3

4

5

# Prediction Result: (model reloaded):

Predicted ClusterId RL: 3

Predicted Distances RL: [|0.128789425f; 0.166862488f; 0.0578770638f; 0.80590868f|]

Predicted Result RL: Benign: 0.16 Malignant: 0.84 (19, 99)

Actual Result RL : 1 (Malignant)



Throughout the post, portions of the output have been provided out of band. Here is how the whole thing looks when run with dotnet run .

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

Automatically adding a MinMax normalization transform, use 'norm=Warn' or 'norm=No' to turn this behavior off.

Initializing centroids

Centroids initialized, starting main trainer

Model trained successfully on 829 instances

Not training a calibrator because it is not needed.



Avg Min Score: 0.049841

DBI : 0.0

NMI : 0.3012495931



ClusterId 1u => Benign: 0.26 Malignant: 0.74 (83, 236)

ClusterId 2u => Benign: 0.59 Malignant: 0.41 (41, 29)

ClusterId 3u => Benign: 0.16 Malignant: 0.84 (19, 99)

ClusterId 4u => Benign: 0.88 Malignant: 0.12 (284, 39)



Predicted ClusterId: 3

Predicted Distances: [|0.128789425f; 0.166862488f; 0.0578770638f; 0.80590868f|]

Predicted Result: Benign: 0.16 Malignant: 0.84 (19, 99)

Actual Result : 1 (Malignant)



Predicted ClusterId RL: 3

Predicted Distances RL: [|0.128789425f; 0.166862488f; 0.0578770638f; 0.80590868f|]

Predicted Result RL: Benign: 0.16 Malignant: 0.84 (19, 99)

Actual Result RL : 1 (Malignant)



This has been a brief look into training and using an ML.NET k-means cluster model. As seen with the other models, ML.NET is providing a nice consistent interface and has some good components. It is a framework that continues to grow in a positive direction. Kudos and thanks to all the people making this a reality. That’s all for now. Until next time.