Swift Solution to Dave Thomas’s Data Munging Kata Part 1

I’m solving Dave Thomas’s code katas in Swift and will be sharing my thought process and solutions. Below is my solution to Part 1 of the Data Munging kata [project repo]. The challenge is to analyze a messy, real-world data set.

In weather.dat you’ll find daily weather data for Morristown, NJ for June 2002. Download this text file, then write a program to output the day number (column one) with the smallest temperature spread (the maximum temperature is the second column, the minimum the third column).

Weather.dat file

My Approach

List all the steps

Before writing any code, I broke the problem down into baby steps.

load data file

2. transform data into string

3. split strings into arrays broken up by new line characters and trim whitespace

4. ignore any row not starting with an integer

5. split each row into columns, creating a table

6. remove unneeded columns

7. clean messed up data items

8. convert strings to Ints

9. create a WeatherData data structure with three values: dayIndex, maxTemperature, minTemperature, and a computed property tempertureDelta

10. create array of WeatherData items

11. find WeatherData item with smallest temperatureDelta

12. return integer of that day

Data Structures

When choosing a data type for WeatherRecord I chose to use a struct because its preferable to use a value type over a reference type. With a value type you avoid the side effects that come with reference types. You could use either an enum or a struct in this case, I chose to go with a struct because we’re only dealing with one ‘kind’ of record, which is semantically closer to what a struct is.

Next, I grouped each of the twelve tasks in my list into a bucket:

DataLoader

DataTransformer

WeatherRecordEvaluator

Executor

Because none of these types would benefit from inheritance or have a reference type as a property, I chose to use a struct for each. I then stubbed out the method signatures based on the task list.

While reviewing the goal of the kata, what struck me was that nowhere in this program do we need to maintain state. We just need to apply functions to arrays (and arrays of arrays) of data. Swift’s higher order functions are perfect for this kind of work.

TDD

I don’t think writing unit tests adds much value in apps where you’re primarily writing front end code, using core data and making network calls. But for projects like this where you’re manipulating data in a series of steps, I love the discipline and security you get from having every method under test. I used test-driven development for this entire application.

Data Loader

Don’t be off put by the .dat file extension, there’s nothing fancy or proprietary about it. Just think of it as any text file. To load the data as a string just get a reference to the file’s path and use contentsOfFile.

Data Transformer

The bulk of the work in this kata is transforming (munge is a data science term for cleaning and transforming data) the data. The code below is specific to this assignment and I intentionally did not attempt to create a generic tool.

We only want to use rows starting with a numeric value, because days of the month are represented numerically. removeUnneededRows() grabs the first character in each row and attempts to cast it to a double. If the result is nil (meaning it couldn’t be cast successfully) we filter it out.

We turn the array of strings into a table with convertRowsIntoATable() by splitting each row by one or more blank spaces using .split() This leaves us with an array of string arrays [[String]] that form a table

.split() definition

To get rid of unneeded columns, removeUnneededColumns() loops over each row and only keeps the first three cells. This method works properly, but is ugly and needs to be refined (any suggestions?).

Notice that not all of our data points are ‘clean’, some contain junk information such as a ‘*’. To fix this in cleanData() we create a set of invalid characters and get rid of them.

let invalidCharachters = NSCharacterSet(charactersInString: “0123456789.”).invertedSet

Another thing to note (which I was fuzzy on before this kata) is that when you have nested higher order functions in Swift, you can’t use the short-hand syntax (e.g. array.map { $0 *2}) you need to list the parameters and use the ‘in’ keyword

Weather Record Evaluator

Now that we’ve loaded our data, organized it, cleaned it, and transformed it into an array of WeatherRecords, we need to find the day with the smallest temperature variation. Since the WeatherRecord struct has the computed property temperatureDelta:

var temperatureDelta: Double { return maximumDailyTemperature — minimumDailyTemperature }

All we need to do is loop over the records and find which has the smallest delta. The minElement() method will do just this (note that we’re back to using the more concise $0, $1 syntax now that we’re not using nested closures).

Executor

We’ve now got all the pieces needed to solve this kata, we just need to put them together and get the correct answer. We’ll do this with an Executor struct that calls each method, using the output of the previous method and returns the answer. The unit test for execute() also acts as an integration test that validates that everything is working as expected.

I’ll be writing up my solutions and thought process behind other katas in this series. Please let me know what was useful and what wasn’t. Also, I’d love to hear your ideas on how to improve any section of the code.