Getting comfortable with the Unix pipe

Files ain’t going away anytime soon, son.

If you’re a developer, engineer, technical person who’s expected to know something about coding, you’re gonna be working with files. You might as well learn how to get glean some information from it without busting open excel, lest you want to be mistaken for a business person.

Pipe it like it’s hot

The Unix pipe is the weapon I use to get a quick picture of what’s going on in a file, whether it’s a large data file [if you’re into Big Data, you can leave] that I need to read in, or it’s some output that my system has created. The weapon is really plumbing between different unix commands that read from stdin and output to stdout. Unfortunately you can’t hurt anyone with this pipe.

Examples are good, I hear

Download this impatient_data.csv that I got from data.gov if you want to follow along. We’re gonna do some cool stuff with it. This is basically data that describes each patient’s diagnosis, along with the hospital discharge data – most importantly, the location and the money.

Here are the columns of the csv file, brought to you by head :

bash-3.2$ head -n 1 impatient_data.csv DRG Definition,Provider Id,Provider Name,Provider Street Address,Provider City,Provider State,Provider Zip Code,Hospital Referral Region Description, Total Discharges , Average Covered Charges , Average Total Payments ,Average Medicare Payments

Let’s see how many lines are in the file.

bash-3.2$ wc -l impatient_data.csv 163066 impatient_data.csv

Ha! 163,000 lines! Good luck doing anything in Excel.

Let’s say I want to know how many of these charges there are in Jersey, my home state, the best state. And then I want to know what are the unique diagnoses DRG Definition) in Jersey.

bash-3.2$ grep ',NJ,' impatient_data.csv | wc -l 4826 bash-3.2$ grep ',NJ,' impatient_data.csv | cut -d ',' -f 1 | sort | uniq "280 - ACUTE MYOCARDIAL INFARCTION " 281 - ACUTE MYOCARDIAL INFARCTION "282 - ACUTE MYOCARDIAL INFARCTION " 286 - CIRCULATORY DISORDERS EXCEPT AMI "287 - CIRCULATORY DISORDERS EXCEPT AMI " 391 - ESOPHAGITIS "392 - ESOPHAGITIS " 563 - FX .... 97 - ALCOHOL/DRUG ABUSE OR DEPENDENCE W/O REHABILITATION THERAPY W/O MCC 917 - POISONING & TOXIC EFFECTS OF DRUGS W MCC 918 - POISONING & TOXIC EFFECTS OF DRUGS W/O MCC 948 - SIGNS & SYMPTOMS W/O MCC bash-3.2 $

Whoa ho ho there! Lemme break that down

grep ',NJ,' impatient_data.csv

I used grep to filter for the charges in New Jersey. In fact, I padded ‘NJ’ with commas in my command to filter out any extraneous ‘NJ’s that showed up in other columns. There’s about 5000 data points from the Dirty Jersey.

cut -d',' -f 1

I used cut to cut out a single column or field from the output of the grep command. Specifically, the first field of the output, which is the diagnosis description.

sort | uniq

Well, I wanted a list of all the unique charges that happened so I’m gonna use uniq obviously. But first you have to sort.

So together that becomes:

grep ',NJ,' impatient_data.csv | cut -d',' -f 1 | sort | uniq

This gives you all the unique Diagnosis-related groups (DRG).

Pro tip: use less to verify the output of your piped commands

What if we want to figure out the top 10 most frequently occuring Diagnoses in Joisey? Well, that’s easy.

$ grep ',NJ,' impatient_data.csv | cut -d ',' -f 1 | cut -d '-' -f 2 | sort | uniq -c | sort -nr | head -n 10 166 ACUTE MYOCARDIAL INFARCTION 119 MISC DISORDERS OF NUTRITION 116 ESOPHAGITIS 93 CIRCULATORY DISORDERS EXCEPT AMI 64 SIMPLE PNEUMONIA & PLEURISY W CC 64 SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W MCC 64 HEART FAILURE & SHOCK W MCC 63 SYNCOPE & COLLAPSE 63 KIDNEY & URINARY TRACT INFECTIONS W/O MCC 63 HEART FAILURE & SHOCK W/O CC/MCC

Interesting… Acute myocardial infarctions happen a LOT. I’ve never heard of them, but a quick Google search shows that they’re in fact Heart Attacks. What’s more surprising are all these nutrition disorders. Hmph. I should really eat my vegetables. Let’s move on to a extremely personal question of mine..

Should I retire in Florida or not?

As a 25 year old man, this question begs to be answered. Let’s find the average Total costs of hospital charges in New York and Florida.

+Rohans-MacBook-Pro:public rohan $ grep ',NY,' impatient_data.csv | cut -d ',' -f 11 | cut -c 2-| awk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }' +13567.2 +Rohans-MacBook-Pro:public rohan $ grep ',FL,' impatient_data.csv | cut -d ',' -f 11 | cut -c 2-| awk '{ sum += $1; n++ } END { if (n > 0) print sum / n; }' +11716.5

Well, I save an average of $1800 on Heart Failure related conditions, but then again, I’ll be in Florida. That’s a tough one.

I hope you learned as much about Unix piping as I did from this Health data set. Thanks Obama.