CSV files

CSV files might not be a go-to format for big data, but as a data scientist or developer working in machine learning, you are sure to encounter this format. You might need a mapping of zip codes to latitude/longitude and find this as a CSV file on the internet, or you may be given sales figures from your sales team in a CSV format. In any event, we need to understand how to parse these files.

The main package that we will utilize in parsing CSV files is encoding/csv from Go's standard library. However, we will also discuss a couple of packages that allow us to quickly manipulate or transform CSV data-- github.com/kniren/gota/dataframe and go-hep.org/x/hep/csvutil .

Reading in CSV data from a file Let's consider a simple CSV file, which we will return to later, named iris.csv (available here: https://archive.ics.uci.edu/ml/datasets/iris). This CSV file includes four float columns of flower measurements and a string column with the corresponding flower species: $ head iris.csv 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa 4.6,3.4,1.4,0.3,Iris-setosa 5.0,3.4,1.5,0.2,Iris-setosa 4.4,2.9,1.4,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa With encoding/csv imported, we first open the CSV file and create a CSV reader value: // Open the iris dataset file. f, err := os.Open("../data/iris.csv") if err != nil { log.Fatal(err) } defer f.Close() // Create a new CSV reader reading from the opened file. reader := csv.NewReader(f) Then we can read in all of the records (corresponding to rows) of the CSV file. These records are imported as [][]string : // Assume we don't know the number of fields per line. By setting // FieldsPerRecord negative, each row may have a variable // number of fields. reader.FieldsPerRecord = -1 // Read in all of the CSV records. rawCSVData, err := reader.ReadAll() if err != nil { log.Fatal(err) } We can also read in records one at a time in an infinite loop. Just make sure that you check for the end of the file ( io.EOF ) so that the loop ends after reading in all of your data: // Create a new CSV reader reading from the opened file. reader := csv.NewReader(f) reader.FieldsPerRecord = -1 // rawCSVData will hold our successfully parsed rows. var rawCSVData [][]string // Read in the records one by one. for { // Read in a row. Check if we are at the end of the file. record, err := reader.Read() if err == io.EOF { break } // Append the record to our dataset. rawCSVData = append(rawCSVData, record) } Note If your CSV file is not delimited by commas and/or if your CSV file contains commented rows, you can utilize the csv.Reader.Comma and csv.Reader.Comment fields to properly handle uniquely formatted CSV files. In cases where the fields in your CSV file are single-quoted, you may need to add in a helper function to trim the single quotes and parse the values.

Handling unexpected fields The preceding methods work fine with clean CSV data, but, in general, we don't encounter clean data. We have to parse messy data. For example, you might find unexpected fields or numbers of fields in your CSV records. This is why reader.FieldsPerRecord exists. This field of the reader value lets us easily handle messy data, as follows: 4.3,3.0,1.1,0.1,Iris-setosa 5.8,4.0,1.2,0.2,Iris-setosa 5.7,4.4,1.5,0.4,Iris-setosa 5.4,3.9,1.3,0.4,blah,Iris-setosa 5.1,3.5,1.4,0.3,Iris-setosa 5.7,3.8,1.7,0.3,Iris-setosa 5.1,3.8,1.5,0.3,Iris-setosa This version of the iris.csv file has an extra field in one of the rows. We know that each record should have five fields, so let's set our reader.FieldsPerRecord value to 5 : // We should have 5 fields per line. By setting // FieldsPerRecord to 5, we can validate that each of the // rows in our CSV has the correct number of fields. reader.FieldsPerRecord = 5 Then as we are reading in records from the CSV file, we can check for unexpected fields and maintain the integrity of our data: // rawCSVData will hold our successfully parsed rows. var rawCSVData [][]string // Read in the records looking for unexpected numbers of fields. for { // Read in a row. Check if we are at the end of the file. record, err := reader.Read() if err == io.EOF { break } // If we had a parsing error, log the error and move on. if err != nil { log.Println(err) continue } // Append the record to our dataset, if it has the expected // number of fields. rawCSVData = append(rawCSVData, record) } Here, we have chosen to handle the error by logging the error, and we only collect successfully parsed records into rawCSVData . The reader will note that this error could be handled in many different ways. The important thing is that we are forcing ourselves to check for an expected property of the data and increasing the integrity of our application.

Handling unexpected types We just saw that CSV data is read into Go as [][]string . However, Go is statically typed, which allows us to enforce strict checks for each of the CSV fields. We can do this as we parse each field for further processing. Consider some messy data that has random fields that don't match the type of the other values in a column: 4.6,3.1,1.5,0.2,Iris-setosa 5.0,string,1.4,0.2,Iris-setosa 5.4,3.9,1.7,0.4,Iris-setosa 5.3,3.7,1.5,0.2,Iris-setosa 5.0,3.3,1.4,0.2,Iris-setosa 7.0,3.2,4.7,1.4,Iris-versicolor 6.4,3.2,4.5,1.5, 6.9,3.1,4.9,1.5,Iris-versicolor 5.5,2.3,4.0,1.3,Iris-versicolor 4.9,3.1,1.5,0.1,Iris-setosa 5.0,3.2,1.2,string,Iris-setosa 5.5,3.5,1.3,0.2,Iris-setosa 4.9,3.1,1.5,0.1,Iris-setosa 4.4,3.0,1.3,0.2,Iris-setosa To check the types of the fields in our CSV records, let's create a struct variable to hold successfully parsed values: // CSVRecord contains a successfully parsed row of the CSV file. type CSVRecord struct { SepalLength float64 SepalWidth float64 PetalLength float64 PetalWidth float64 Species string ParseError error } Then, before we loop over the records, let's initialize a slice of these values: // Create a slice value that will hold all of the successfully parsed // records from the CSV. var csvData []CSVRecord Now as we loop over the records, we can parse into the relevant type for that record, catch any errors, and log as needed: // Read in the records looking for unexpected types. for { // Read in a row. Check if we are at the end of the file. record, err := reader.Read() if err == io.EOF { break } // Create a CSVRecord value for the row. var csvRecord CSVRecord // Parse each of the values in the record based on an expected type. for idx, value := range record { // Parse the value in the record as a string for the string column. if idx == 4 { // Validate that the value is not an empty string. If the // value is an empty string break the parsing loop. if value == "" { log.Printf("Unexpected type in column %d

", idx) csvRecord.ParseError = fmt.Errorf("Empty string value") break } // Add the string value to the CSVRecord. csvRecord.Species = value continue } // Otherwise, parse the value in the record as a float64. var floatValue float64 // If the value can not be parsed as a float, log and break the // parsing loop. if floatValue, err = strconv.ParseFloat(value, 64); err != nil { log.Printf("Unexpected type in column %d

", idx) csvRecord.ParseError = fmt.Errorf("Could not parse float") break } // Add the float value to the respective field in the CSVRecord. switch idx { case 0: csvRecord.SepalLength = floatValue case 1: csvRecord.SepalWidth = floatValue case 2: csvRecord.PetalLength = floatValue case 3: csvRecord.PetalWidth = floatValue } } // Append successfully parsed records to the slice defined above. if csvRecord.ParseError == nil { csvData = append(csvData, csvRecord) } }