by @jehiah on 2010-10-20 19:00Filed under: All

Data Hacks is a new library we have developed at bit.ly which is a set of command line tools to assist in data analysis.

We love the beauty of command line tools that read/write from stdin/stdout and these are a set of utilities that do that, and help explore large data sets.

Included: a tool to calculate 95 percentile values, a histogram display, sample to a % of stdin, and a tool to pass stdin to stdout for a set time period.

For example you can now run this on the fly to get a histogram of request response time for a 30 second period. (in my case awk '{print $NF}' gets the last column in a access log which has the response time)

$ tail -f access.log | awk '{print $NF}' | run_for.py 30s | sample.py 10% | histogram.py --min=0 --max=1.0 --buckets=20 # NumSamples = 6809; Min = 0.00; Max = 0.05 # 313 values outside of min/max # Mean = 0.014075; Variance = 0.001441; SD = 0.037954 # each * represents a count of 34 0.0000 - 0.0025 [ 404]: *********** 0.0025 - 0.0050 [ 2595]: **************************************************************************** 0.0050 - 0.0075 [ 1099]: ******************************** 0.0075 - 0.0100 [ 1056]: ******************************* 0.0100 - 0.0125 [ 476]: ************** 0.0125 - 0.0150 [ 403]: *********** 0.0150 - 0.0175 [ 122]: *** 0.0175 - 0.0200 [ 81]: ** 0.0200 - 0.0225 [ 37]: * 0.0225 - 0.0250 [ 32]: 0.0250 - 0.0275 [ 25]: 0.0275 - 0.0300 [ 26]: 0.0300 - 0.0325 [ 6]: 0.0325 - 0.0350 [ 29]: 0.0350 - 0.0375 [ 12]: 0.0375 - 0.0400 [ 25]: 0.0400 - 0.0425 [ 10]: 0.0425 - 0.0450 [ 28]: 0.0450 - 0.0475 [ 13]: 0.0475 - 0.0500 [ 17]:

For more information and examples see http://github.com/bitly/data_hacks

Update 2010/10/20: I've also added a utility to generate ascii bar chart.