Recently Microsoft announced ML.NET, a machine learning framework for .NET. This is exciting news. So my mind immediately goes to: how does this look with F#? The current post will take a look at using ML.NET’s regression module to predict concrete compressive strength based on its composite ingredients.

Update: This post is here for posterity sake, a rework of this post is here using ML.NET version 1.3.

Before jumping in too far, there is a disclaimer: ML.NET is in its early stages. I found a couple implementation and interface idiosyncrasies I suspect will change over time. Just keep that in mind moving forward. The short version is, I’ve been pleased with what I’ve seen so far. There is some room for improvement, especially having more F#-centric support for calling methods. It will be an interesting journey as the framework matures.

Update: The post was written using Microsoft.ML v0.1.0, and v0.2.0 has since been released. I have noted interfaces changes below, for the example it is just TextLoader.

With that out of the way, make sure you have .NET Core version 2.0 installed. If you don’t, head out to the .NET Core Downloads page. Select SDK for your platform. Tangential, but you can also get here by going to dot.net, then navigating to Downloads and .NET Core .

First, create the project and add the ML.NET package. This will be a console app in F# (obviously).

1

2

3

dotnet new console --language F# --name MLNet-Concrete

cd MLNet-Concrete

dotnet add package Microsoft.ML



Next, it is time to get the data. The source I used for this post is from UCI. The dataset is an Excel file (xls), and I need it as a csv. I used ssource (from apt install gnumeric ) to convert from Excel to CSV, but feel free to use whatever works for you.

1

2

3

mkdir data && cd data

curl -O https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/Concrete_Data.xls

ssource Concrete_Data.xls Concrete_Data.csv



Here is a sample of what the data looks like. There is a header row, I’ve transposed this to a vertical list for readablity. The first 8 columns are features, the last is the concrete compressive strength.

1

2

3

4

5

6

7

8

9

10

# Header Row

Cement (component 1)(kg in a m^3 mixture)

Blast Furnace Slag (component 2)(kg in a m^3 mixture)

Fly Ash (component 3)(kg in a m^3 mixture)

Water (component 4)(kg in a m^3 mixture)

Superplasticizer (component 5)(kg in a m^3 mixture)

Coarse Aggregate (component 6)(kg in a m^3 mixture)

Fine Aggregate (component 7)(kg in a m^3 mixture)

Age (day)

Concrete compressive strength(MPa, megapascals)



1

2

3

4

5

# Data Rows

540,0,0,162,2.5,1040,676,28,79.98611076

540,0,0,162,2.5,1055,676,28,61.887365759999994

332.5,142.5,0,228,0,932,594,270,40.269535256000005

332.5,142.5,0,228,0,932,594,365,41.052779992



Now that the project is setup and data is local, we can get to the code. Time to open up the already created Program.fs . First, add the necessary namespaces.

1

2

3

4

5

6

7

8

9

open System

open Microsoft.ML

open Microsoft.ML.Runtime.Api

open Microsoft.ML.Trainers

open Microsoft.ML.Transforms

open Microsoft.ML.Models





open Microsoft.ML.Data



The ML.NET pipeline expects the data in a specific format. In the C# world, this is a class, for F# we can use a type. Below are the required types; ConcreteData is the input data, ConcretePrediction is the output prediction. For ConcreteData , this is basically a a map of columns to member variables. There are a couple notable points to ensure the pipeline can properly consume the data. Each attribute must be mutable public , it also requires the [<Column("#")>] to specify it’s column position, and [<DefaultValue>] attributes. For ConcretePrediction , a single attribute is required, the prediction value. For the input data, the label variable must be named Label . For the prediction type, the variable must be labeled Score . There are methods where you are supposed to be able to define a ColumnName attribute, or copy a label column into the pipeline. But frankly they didn’t work for me. I’m unclear if I was doing something wrong if its a current early-state problem. Over time I expect this will be resolved, but for now I don’t mind working within tighter constraints.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

type ConcreteData () =





val mutable public Cement:float32







val mutable public Slag:float32







val mutable public Ash:float32







val mutable public Water:float32







val mutable public Superplasticizer:float32







val mutable public CoarseAggregate:float32







val mutable public FineAggregate:float32







val mutable public Age:float32







val mutable public Label :float32



type ConcretePrediction () =



val mutable public Score:float32





The structure of building a pipeline is pretty intuitive. First, create a pipeline. Then, add components to the pipeline in the order to be executed. So first, load the data with a TextLoader . This data is comma delimited and has a header row.

1

2

3

4

5

6

let pipeline = new LearningPipeline()

let dataPath = "./data/Concrete_Data.csv"







pipeline.Add(( new TextLoader(dataPath)).CreateFrom<ConcreteData>(separator = ',', useHeader = true ))



After the data is loaded, feature columns need to be added to the pipeline. I’m going to use all feature columns from the file, but I don’t have to. The regressor model requires features to be numeric. In this example, that is the case and nothing special needs to be done. In cases where columns are strings, the CategoricalOneHotVectorizer() will convert string columns to numeric mappings. I’ve provided an example line below. Even though I don’t need it, its a handy reference to have. Note the order, since it is a pipeline, the string to numeric column conversion needs to happen prior to adding the feature columns.

1

2

3





pipeline.Add( new ColumnConcatenator( "Features" , "Cement" , "Slag" , "Ash" , "Water" , "Superplasticizer" , "CoarseAggregate" , "FineAggregate" , "Age" ))



Now that the features are defined, it is time to determine what training method to use. For this post FastTreeRegressor is used. This is a boosted decision tree and generally offers pretty good results. Custom hyperparameters can also be defined. I found the defaults to be fine, but its good to see the option to tweak those values.

1

2

3

4

pipeline.Add( new FastTreeRegressor())









For the dataset in question, the FastTreeRegressor worked the best, but there are alternatives. I’ve listed them below. Most had worst performance, with the FastTreeTweedieRegressor being similar. As will anything, it is good to investigate options.

1

2

3

4

5

6

7

8



















The last part, train the model. Note the ConcreteData and ConcretePrediction types as part of the Train call.

1

let model = pipeline.Train<ConcreteData, ConcretePrediction>()



Validation of any model is important. For a real case, I would train on one dataset and validate against a previously unseen dataset. Since this is just an example, I validate against the training data. As a result, I expect the results to be very good, and they are. ML.NET offers an Evaluator class, which makes getting some of those crucial high-level numbers pretty easy. It takes a trained model and a dataset, and produces critical metrics. Again, this is one of those components that is crucial to an ML framework and I’m glad to see it here.

1

2

3

4

5

6

7

8

9

10

11

12

13









let testData = ( new TextLoader(dataPath)).CreateFrom<ConcreteData>(separator = ',', useHeader = true )

let evaluator = new RegressionEvaluator()

let metrics = evaluator.Evaluate(model, testData)

printfn ""

printfn "R-Squared: %f" <| metrics.RSquared

printfn "RMS : %f" <| metrics.Rms

printfn "L1 : %f" <| metrics.L1

printfn "L2 : %f" <| metrics.L2

printfn ""



1

2

3

4

5

# Evaluator Results:

R-Squared: 0.988533

RMS : 1.788017

L1 : 1.139818

L2 : 3.197006



Backtracking to the hyperparameter example, here are those results. As you can tell, my randomly picked hyperparameter choices were not better. Certainly it seems like a fun opportunity to pair some optimization searches with the pipeline to see how methods can be improved. Of course, this is more meaningful if it is not validating against the training data, there is already a risk of overfitting that we’re not seeing.

1

2

3

4

5

# Evaluator Results (with hyperparameters):

R-Squared: 0.947057

RMS : 3.841995

L1 : 2.846822

L2 : 14.760922



Here is an example of how individual predictions can be made. Create a ConcreteData object and provide it to the Predict method. For this example, I pull one of those rows from the training data.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

let test1 = ConcreteData()

test1.Cement <- 198.6 f

test1.Slag <- 132.4 f

test1.Ash <- 0. f

test1.Water <- 192. f

test1.Superplasticizer <- 0. f

test1.CoarseAggregate <- 978.4 f

test1.FineAggregate <- 825.5 f

test1.Age <- 90. f



let predictionTest1 = model.Predict(test1)

printfn "Predicted Strength: %f" predictionTest1.Score

printfn "Actual Strength : 38.074243671999994"

printfn ""



1

2

3

# Prediction Result:

Predicted Strength: 38.882920

Actual Strength : 38.074243671999994



On a lark, let’s see what happens if slag is increased, and the water content is reduced. It looks like compressive strength gets stronger.

1

2

3

4

5

6

7

8

9

10

11

12

13

let test2 = ConcreteData()

test2.Cement <- 198.6 f

test2.Slag <- 150.0 f

test2.Ash <- 0. f

test2.Water <- 172. f

test2.Superplasticizer <- 0. f

test2.CoarseAggregate <- 978.4 f

test2.FineAggregate <- 825.5 f

test2.Age <- 90. f



let predictionTest2 = model.Predict(test2)

printfn "Predicted Strength: %f" predictionTest2.Score

printfn ""



1

2

# Prediction Result:

Predicted Strength: 45.623180



Once a model is trained, it can also be saved to a file a reloaded at a later time. This is supported by the WriteAsync and ReadAsync methods of a model.

1

2

3

4

5

6

7

8

9

10

11

12

13

14



model.WriteAsync( "test-model" )

|> Async.AwaitTask

|> ignore





let modelReloaded =

PredictionModel.ReadAsync<ConcreteData, ConcretePrediction>( "test-model" )

|> Async.AwaitTask

|> Async.RunSynchronously

let predictionReloaded = modelReloaded.Predict(test1)

printfn "Predicted Strength RL: %f" predictionReloaded.Score

printfn "Actual Strength : 38.074243671999994"

printfn ""



1

2

3

# Prediction Result (model reloaded):

Predicted Strength RL: 38.882920

Actual Strength : 38.074243671999994



Throughout the post, portions of the output have been provided out of band. Here is how the whole thing looks when run with dotnet run .

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

Not adding a normalizer.

Making per-feature arrays

Changing data from row-wise to column-wise

Processed 1030 instances

Binning and forming Feature objects

Reserved memory for tree learner: 234780 bytes

Starting to train ...

Not training a calibrator because it is not needed.



R-Squared: 0.988533

RMS : 1.788017

L1 : 1.139818

L2 : 3.197006



Predicted Strength: 38.882920

Actual Strength : 38.074243671999994



Predicted Strength: 45.623180



Predicted Strength RL: 38.882920

Actual Strength : 38.074243671999994



There you have it. A brief look into training and using an ML.NET regressor model. Although there are a couple quirks, I’m excited to see this released. This will only get better over time and if F# can be a part of that, even better.