There are many scenarios where you need to quickly analyze, modify and process large files, both in number and size. However, files are often text-based Comma Separated Values (CSV) files. Opening them with standard spreadsheet applications, such as Excel, LibreOffice or OpenOffice, overloads a machine's memory. Also, processing a large number of files in one batch often fails after a few hours because of some unexpected file anomaly.

You inevitably spend a lot of time in frozen screens, restarts and long waits.

However, most of these tasks could be carried out with a few lines of code. What's more, familiarity with a few simple shell command lines can go a long way in saving time and reducing frustration.

This post will give an overview of some shell commands that I use nearly every day. I hope you find them useful.

Quick Notes

Note that there are different types of shell (bash, zsh, ...), with bash shell the most common as it is the default shell on OS X and major linux distributions. Zsh (and Oh My Zsh) is a popular and powerful alternative. The shell commands that are included in this blog post have been tested on bash on OS X (macOS) and should work with other shells and environments.

All the following examples are based on the adult dataset from the UCI Machine Learning repository also known as "Census Income" dataset and available here. This data set is commonly used to predict whether income exceeds \$50K/yr based on census data. With 48842 rows and 14 attributes, it is not a large dataset by far but will be sufficient to illustrate the examples. Although the data is stored with the .data extension, it is a well-formatted CSV file. Feel free to download the dataset to follow along!

Count with wc

Given a new text based file, you want to know how many lines it contains. This can be done with the word count wc -l command:

$ wc -l adult.data 32562 adult.data

Which tells you that adult.data file contains 32562 rows. The -l flag tells wc to count the lines. But you can also use wc to count words instead using the -w flag.

$ wc -w adult.data 488415 adult.data

The adult.data file contains nearly 500k words.

The wc command can also count the number of files in a directory by using the output of a simple ls -l command as input to wc . Using the output of a command as input to another command using the pipe symbol | is a useful shell pattern called pipelining. You'll see it in several examples throughout this post.

Assume now you have a folder with many files. The following command will tell you exactly how many file it contains:

$ ls -l <folder> | wc -l 508

There are many other applications for wc if you start adding wildcards and subfolders.

Let's go back to your adult.data file and look at the first lines with the head command. By default, head outputs 10 lines, you limit the output to the first 2 lines using the -n flag.

$ head -n 2 adult.data 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K 50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, <=50K

And you notice that the file does not have a header line. The names of columns are not within the file. You can add that header line by concatenating two files using the cat command.

Concatenate files with cat

The cat command prints a file's content to the standard output (aka your terminal). It can also be used to concatenate a series of files. The command cat file_1.csv file_2.csv > target_file.csv will merge the content of both file_1.csv and file_2.csv into target_file.csv , adding file_2.csv at the end of file_1.csv .

The header file is not in the original dataset and you need to create it. To do so, you echo the comma separated list of column names into a header.csv file.

$ echo "age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,native-country,class" > header.csv

In this example, you used the shell redirection > character to dump the output of cat to the adult.csv file. The > will either create a file or replace its content entirely if the file already exists. Doubling the symbol >> will append the new content to an already existing file without erasing its content.

Let's now add the header.csv file at the beginning of the adult.data file. At the same time, you rename adult.data to adult.csv since it is, after all, a CSV formatted file.

$ cat header.csv adult.data > adult.csv

Check that the first row of adult.csv contains the column names with:

$ head -n 1 adult.csv age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,native-country,class 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K

Modify a file with sed

Another frequent data mining scenario happens when a file is corrupted or badly formatted, such as with non UTF-8 characters or a misplaced comma. You can correct that file without actually opening it using the sed command.

The generic sed pattern is

$ sed "s/<string to replace>/<string to replace it with>/g" <source_file> > <target_file>.

The adult.csv dataset uses the ? character to denote a missing value. You can replace that character with a better suited default value using sed . The empty string is preferable as an indication of a missing value as it will be interpreted as a NaN value when loading the data in a Pandas DataFrame.

$ grep ", ?," adult.csv | wc -l 2399

This gives you a count of 2399 lines with at least one column with a missing value denoted by ? . The following command will replace all the columns with ? by an empty string. All the cell that only contain the ? , followed by a space will now be truly empty.

$ sed "s/, ?,/,,/g" adult.csv > adult.csv

Note that you use the column delimiter , in the source and target strings to avoid replacing legit question marks that could be present elsewhere in the dataset.

Subset a large file

Imagine now that you're dealing with a file with over 30 Million rows. As you craft your Python or R script, you will want to test your workflow on just a sample of that original large file and not have to load the whole dataset into memory. You can use a combination of head and tail to create a subsample of the original dataset.

head outputs the first part of files;

outputs the first part of files; tail outputs the last part of files

As you've seen before, these commands take a -n flag, which stands for --lines , that restrict the number of lines in the output. tail also takes a -f flag where f stands for --follow which outputs appended data as lines are added to the end of the file. This is particularly convenient to monitor log files as your scripts are running using tail -f <logfile> .

Mixing in the pipe symbol | with head and tail , you can extract a certain number of lines from within a source file and export the content to a subset file. For example, to extract 20 lines starting at line 100:

$ head -n 120 adult.csv | tail -n 20 > adult_sample.csv

Note that you first take the first_line + number_of_lines with head and then tail the number_of_lines. The generic subsetting command is:

$ head -n <total_lines> <source_file> | tail -n <number_of_lines> > <target_file>

where total_lines = first_line + number_of_lines . However, your new sampled file no longer contains the header row. We already know how to add the header back in the subset file using cat to concatenate the subset file and the header.csv file created before:

$ cat adult_sample.csv header.csv > adult_sample_with_header.csv

Note that you did not use the source filename adult\_sample.csv as the target filename, but instead created a new file titled adult\_sample\_with\_header.csv . Having the same name as one of the sources and as the target while using cat will result in unexpected file content. It is better to create a new file as the output of the cat command.

Finding duplicates with uniq

With the uniq command you can find adjacent repeated lines in a file. uniq takes several flags, the more useful ones being:

uniq -c : which adds the repetition count to each line;

: which adds the repetition count to each line; uniq -d : which only outputs duplicate lines; And

: which only outputs duplicate lines; And uniq -u : which only outputs unique lines.

However, uniq is not a smart command. Repeated lines will not be detected if they are not adjacent. Which means that you first need the sort the file. This command counts the number of duplicated lines in adult.csv .

$ sort adult.csv | uniq -d | wc -l 23

and shows that there are 23 duplicates. The next command takes the output of all lines with added repetition counts, sorts in reverse and outputs the first 3 duplicates:

$ sort adult.csv | uniq -c | sort -r | head -n 3 3 25, Private, 195994, 1st-4th, 2, Never-married, ... 2 90, Private, 52386, Some-college, 10, Never-married, ... 2 49, Self-emp-not-inc, 43479, Some-college, 10, Married-civ-spouse, ...

There are many powerful options that can be obtained by combining sort and uniq with different flags. Use the man sort and man uniq to further explore these commands.

Selecting columns with cut

The great thing about CSV files and shell commands is that you can also work at the column level by using cut to select a particular column. cut takes two main flags: -d to specify the column delimiter and -f to specify the columns you want to work on. In the following example, you use cut to find the number of unique values taken by the categorical variable workclass (column 2).

First select the column workclass and pipe to head to verify that you have the right column:

$ cut -d "," -f 2 adult.csv | head -3 workclass State-gov Self-emp-not-inc

Now, to count uniques, you sort the output of cut and pipe the result to uniq -c , as such:

$ cut -d "," -f 2 adult.csv | sort | uniq -c 1837 960 Federal-gov 2093 Local-gov 7 Never-worked 22696 Private 1116 Self-emp-inc 2541 Self-emp-not-inc 1298 State-gov 14 Without-pay 1 workclass

Which tells you, for instance, that you have 1837 null values, and that the main class is by far the Private class with 22969 occurrences.

The command line above is similar to the value_counts() method applied to a DataFrame containing the adult.csv data.

Looping

So far, you've mostly worked on one file. To rename, process or transfer large number of files, you should add loops to our shell toolkit. The generic format of a loop in bash shell is:

while true; do _do something_ ; done

Let's put that to work. Imagine you have 1000 files with spaces in their filename and you want to replace each space by an undescore "_".

replace_source=' ' replace_target='_' for filename in ./*.csv; do new_filename=${filename//$replace_source/$replace_target} mv "$filename" "$new_filename" done

In this example,

You first declare two variables: replace_source and replace_target as placeholders for the space and the underscore characters

and as placeholders for the space and the underscore characters You loop over all the *.csv files in the current folder

files in the current folder for each filename , you create a new_filename by replacing each space by an underscore

, you create a by replacing each space by an underscore and you rename the file from its current filename to the new filename with the mv command

In fact, not only have you used shell to loop through files in a directory, but you also have created variables.

Variables

You will end this intro to shell command with variables. Variable creation in shell is simply done by

$ varname='<a string>' $ varname=a number

For instance:

$ varname='Hello world' $ varname=123.4

Note that there is no space around the '=' sign. var = '<a string>' would not work. To return the variable value, simply echo it:

$ echo $varname 123.4

And to use it in a script, encapsulates it within quotes as you've seen before:

$ mv "$filename" "$new_filename"

Writing loops and using variables opens the door for more complex file manipulations and is a gateway to shell scripting which is beyond this introductory article. However, shell scripting is simple enough if you start small and improve as your confidence grows.

Conclusion

Shell command lines are extremely useful in your daily work as data scientists. There are many more examples and use cases that could be explored. And as you will see there are often always different ways to achieve a particular result using a wide range of available shell commands and related flags. In that case, keeping it simple is always the winning strategy.

Hopefully the examples presented here will help you integrate shell commands in your data processing toolkit. Feel free to reach out and share your shell tricks with me on twitter: @alexip.