Kotlin Island outside of Saint Petersburg, Russia (SOURCE: Wikimedia)

Introduction to Kotlin-Statistics

Fluent Data Science Operators with Kotlin

Over the past few years, I have been an avid user of Kotlin. But my proclivity for Kotlin is not simply due to language boredom or zeal for JetBrains products (including PyCharm, the great Python IDE). Kotlin is a more pragmatic Scala, or “Scala for Dummies” as I heard someone once describe it. It is unique in the fact it tries not to be, focusing on practicality and industry rather than academic experimentation. It takes many of the most useful features from programming languages to date (including Java, Groovy, Scala, C#, and Python), and integrates them into a single language.

My usage of Kotlin rose out of need, and this is something I talked about in a KotlinConf talk back in 2017:

You might say “well Python is pragmatic” and sure, I still use Python especially when I need certain libraries. But it can be difficult to manage 10,000 lines of Python code in a rapidly evolving production application. While some people are able to do this successfully and run entire companies on Python, some companies like Khan Academy are discovering the benefits of Kotlin and its modern approach to static typing. Khan Academy wrote about their experiences switching from a Python ecosystem to a Python/Kotlin one:

Khan has also written a document for Python developers wishing to learn Kotlin:

But I digress. What I want to introduce in this article is a library I have worked on for some time called Kotlin-Statistics. It started out as an experiment to express meaningful statistical and data analysis with functional and object-oriented programming, while making the code legible and intuitive. In other words, I wanted to prove it was possible to analyze OOP/functional data without resorting to data frames and other data science-y structures.

Take for example this Kotlin code below where I declare a Patient type, and I include the first name, last name, birthday, and white blood cell count. I also have an enum called Gender reflecting a MALE/FEMALE category. Of course, I could import this data from a text file, a database, or another source, but for now I am going to declare them in literal Kotlin code:

import java.time.LocalDate



data class Patient(val firstName: String,

val lastName: String,

val gender: Gender,

val birthday: LocalDate,

val whiteBloodCellCount: Int) {



val age get() =

ChronoUnit.YEARS.between(birthday, LocalDate.now())

} val patients = listOf(

Patient(

"John",

"Simone",

Gender.MALE,

LocalDate.of(1989, 1, 7),

4500

),

Patient(

"Sarah",

"Marley",

Gender.FEMALE,

LocalDate.of(1970, 2, 5),

6700

),

Patient(

"Jessica",

"Arnold",

Gender.FEMALE,

LocalDate.of(1980, 3, 9),

3400

),

Patient(

"Sam",

"Beasley",

Gender.MALE,

LocalDate.of(1981, 4, 17),

8800

),

Patient(

"Dan",

"Forney",

Gender.MALE,

LocalDate.of(1985, 9, 13),

5400

),

Patient(

"Lauren",

"Michaels",

Gender.FEMALE,

LocalDate.of(1975, 8, 21),

5000

),

Patient(

"Michael",

"Erlich",

Gender.MALE,

LocalDate.of(1985, 12, 17),

4100

),

Patient(

"Jason",

"Miles",

Gender.MALE,

LocalDate.of(1991, 11, 1),

3900

),

Patient(

"Rebekah",

"Earley",

Gender.FEMALE,

LocalDate.of(1985, 2, 18),

4600

),

Patient(

"James",

"Larson",

Gender.MALE,

LocalDate.of(1974, 4, 10),

5100

),

Patient(

"Dan",

"Ulrech",

Gender.MALE,

LocalDate.of(1991, 7, 11),

6000

),

Patient(

"Heather",

"Eisner",

Gender.FEMALE,

LocalDate.of(1994, 3, 6),

6000

),

Patient(

"Jasper",

"Martin",

Gender.MALE,

LocalDate.of(1971, 7, 1),

6000

)

)



enum class Gender {

MALE,

FEMALE

}

Let’s start with some basic analysis: what is the average and standard deviation of whiteBloodCellCount across all the patients? We can leverage some extension functions in Kotlin Statistics to find this quickly:

fun main() {



val averageWbcc =

patients.map { it.whiteBloodCellCount }.average()



val standardDevWbcc = patients.map { it.whiteBloodCellCount }

.standardDeviation()



println("Average WBCC: $averageWbcc,

Std Dev WBCC: $standardDevWbcc")



// PRINTS:

// Average WBCC: 5346.153846153846,

Std Dev WBCC: 1412.2177503341948

}

We can also create a DescriptiveStatistics object off a collection of items:

fun main() {



val descriptives = patients

.map { it.whiteBloodCellCount }

.descriptiveStatistics



println("Average: ${descriptives.mean}

STD DEV: ${descriptives.standardDeviation}")



/* PRINTS

Average: 5346.153846153846 STD DEV: 1412.2177503341948

*/

}

However, we sometimes need to slice our data not only for more detailed insight but also to judge our sample. For example, did we get a representative sample with our patients for both male and female? We can use the countBy() operator in Kotlin Statistics to count a Collection or Sequence of items by a keySelector as shown here:

fun main() {



val genderCounts = patients.countBy { it.gender }



println(genderCounts)



// PRINTS

// {MALE=8, FEMALE=5}

}

This returns a Map<Gender,Int> , reflecting the patient count by gender that shows {MALE=8, FEMALE=5} when printed.

Okay, so our sample is a bit MALE-heavy, but let’s move on. We can also find the average white blood cell count by gender using averageBy() . This accepts not only a keySelector lambda but also an intSelector to select an integer off each Patient (we could also use doubleSelector , bigDecimalSelector , etc). In this case, we are selecting the whiteBloodCellCount off each Patient and averaging it by Gender , as shown next. There are two ways we can do this:

APPROACH 1:

fun main() {



val averageWbccByGender = patients

.groupBy { it.gender }

.averageByInt { it.whiteBloodCellCount }



println(averageWbccByGender)



// PRINTS

// {MALE=5475.0, FEMALE=5140.0}

}

APPROACH 2:

fun main() {



val averageWbccByGender = patients.averageBy(

keySelector = { it.gender },

intSelector = { it.whiteBloodCellCount }

)



println(averageWbccByGender)



// PRINTS

// {MALE=5475.0, FEMALE=5140.0}

}

So the average WBCC for MALE is 5475, and FEMALE is 5140.

What about age? Did we get a good sampling of younger and older patients? If you look at our Patient class, we only have a birthday to work with which is a Java 8 LocalDate . But using Java 8's date and time utilities, we can derive the age in years in the keySelector like this:

fun main() {



val patientCountByAge = patients.countBy(

keySelector = { it.age }

)



patientCountByAge.forEach { age, count ->

println("AGE: $age COUNT: $count")

}



/* PRINTS:

AGE: 30 COUNT: 1

AGE: 48 COUNT: 1

AGE: 38 COUNT: 1

AGE: 37 COUNT: 1

AGE: 33 COUNT: 3

AGE: 43 COUNT: 1

AGE: 27 COUNT: 2

AGE: 44 COUNT: 1

AGE: 24 COUNT: 1

AGE: 47 COUNT: 1

*/

}

If you look at our output for the code, it is not very meaningful to get a count by age. It would be better if we could count by age ranges, like 20–29, 30–39, and 40–49. We can do this using the binByXXX() operators. If we want to bin by an Int value such as age, we can define a BinModel that starts at 20, and increments each binSize by 10. We also provide the value we are binning using valueSelector , which is the patient's age as shown below:

fun main() {



val binnedPatients = patients.binByInt(

valueSelector = { it.age },

binSize = 10,

rangeStart = 20

)



binnedPatients.forEach { bin ->

println(bin.range)

bin.value.forEach { patient ->

println(" $patient")

}

}

} /* PRINTS: [20..29]

Patient(firstName=Jason, lastName=Miles, gender=MALE...

Patient(firstName=Dan, lastName=Ulrech, gender=MALE...

Patient(firstName=Heather, lastName=Eisner, gender=FEMALE...

[30..39]

Patient(firstName=John, lastName=Simone, gender=MALE...

Patient(firstName=Jessica, lastName=Arnold, gender=FEMALE...

Patient(firstName=Sam, lastName=Beasley, gender=MALE...

Patient(firstName=Dan, lastName=Forney, gender=MALE...

Patient(firstName=Michael, lastName=Erlich, gender=MALE...

Patient(firstName=Rebekah, lastName=Earley, gender=FEMALE...

[40..49]

Patient(firstName=Sarah, lastName=Marley, gender=FEMALE...

Patient(firstName=Lauren, lastName=Michaels, gender=FEMALE...

Patient(firstName=James, lastName=Larson, gender=MALE...

Patient(firstName=Jasper, lastName=Martin, gender=MALE... */

We can look up the bin for a given age using a getter syntax. For example, we can retrieve the Bin for the age 25 like this, and it will return the 20-29 bin:

fun main() {



val binnedPatients = patients.binByInt(

valueSelector = { it.age },

binSize = 10,

rangeStart = 20

)



println(binnedPatients[25])

}

If we wanted to not collect the items into bins but rather perform an aggregation on each one, we can do that by also providing a groupOp argument. This allows you to use a lambda specifying how to reduce each List<Patient> for each Bin . Below is the average white blood cell count by age range:

val avgWbccByAgeRange = patients.binByInt(

valueSelector = { it.age },

binSize = 10,

rangeStart = 20,

groupOp = { it.map { it.whiteBloodCellCount }.average() }

)



println(avgWbccByAgeRange) /* PRINTS:

BinModel(bins=[Bin(range=[20..29], value=5300.0),

Bin(range=[30..39], value=5133.333333333333),

Bin(range=[40..49], value=5700.0)]

)

*/

There may be times you want to perform multiple aggregations to create reports of various metrics. This is usually achievable using Kotlin’s let() operator. Say you wanted to find the 1st, 25th, 50th, 75th, and 100th percentiles by gender. We can tactically use a Kotlin extension function called wbccPercentileByGender() which will take a set of patients and separate a percentile calculation by gender. Then we can invoke it for the five desired percentiles and package them in a Map<Double,Map<Gender,Double>> , as shown below:

fun main() {



fun Collection<Patient>.wbccPercentileByGender(

percentile: Double) =

percentileBy(

percentile = percentile,

keySelector = { it.gender },

valueSelector = {

it.whiteBloodCellCount.toDouble()

}

)



val percentileQuadrantsByGender = patients.let {

mapOf(1.0 to it.wbccPercentileByGender(1.0),

25.0 to it.wbccPercentileByGender(25.0),

50.0 to it.wbccPercentileByGender(50.0),

75.0 to it.wbccPercentileByGender(75.0),

100.0 to it.wbccPercentileByGender(100.0)

)

}



percentileQuadrantsByGender.forEach(::println)

} /* PRINTS:

1.0={MALE=3900.0, FEMALE=3400.0}

25.0={MALE=4200.0, FEMALE=4000.0}

50.0={MALE=5250.0, FEMALE=5000.0}

75.0={MALE=6000.0, FEMALE=6350.0}

100.0={MALE=8800.0, FEMALE=6700.0}

*/

This was a somewhat simple introduction to Kotlin-Statistics . Be sure to read the project’s README to see a more comprehensive set of operators available in the library (it also has some disparate tools like a Naive Bayes Classifier and stochastic operators).

I hope this demonstrates Kotlin’s efficacy in being tactical but robust. Kotlin is capable of rapid turnaround for quick ad hoc analysis, but you can take that statically-typed code and evolve it with many compile-time checks. While you may think Kotlin does not have the ecosystem that Python or R has, it actually has a lot of libraries and capabilities already on the JVM. As Kotlin/Native gains traction, it will be interesting to see what numerical libraries might rise from the Kotlin ecosystem.

To get some resources on using Kotlin for data science purposes, I have curated a list here:

Here are some other articles I have made demonstrating Kotlin for mathematical modeling: