One thing that has really helped me over the years is an experimentalist is using regular expressions to sort and parse datafiles. If you aren’t using them in the lab to process data, you really should be!

As an experimentalist, you’re going to be generating tons and tons of data. Most of it will probably be directly created by a computer: images from cameras, traces from oscilloscopes, etc. But how do you label it? I’ve seen a lot of different methods, and I’ve used a fair amount myself.

I used to just label data with a key, like “r4”, and have a table in the lab notebook to look up details about the datafile. Over the years, I’ve moved to trying to have every useful piece of information actually encoded into the file’s name. This is because, with modern computing and data analysis, you have no idea where that file is going to end up, so you want it as self contained as possible (also, nothing is worse trying to find out more details about some enigmatically labeled datafile three years after it was taken). It also makes regular expressions work.

Because of the way I do things, my file names are Frankenstian-monstrosities, like the following, where I’ve encoded the temperature, frequency, peak-to-peak voltage, and two different gain values:

regex?

Regular expressions are a way of identifying and capturing patterns in strings. They are useful, because if I have a list of files like the one above, I can programmatically extract the temperature (or other useful piece of information) and filter on it.

I’m going to go over three workflows that I use regularly: two in python (re and pandas) and one in the bash.

re (python)

I use this workflow when I’m just doing something quickly, like plotting every one of these pieces of data, labeled by temperature.

import numpy as np import glob import re import matplotlib.pyplot as plt fileNames = glob . glob ( '*.dat' ) data = [ np . loadtxt ( f , unpack = True ) for f in fileNames ]

Now, I have a list of numpy arrays. To extract the temperature, we use the python re (regular expressions) module.

tempSearch = re . compile ( ".*T( \ d*d? \ d*)" ) temps = [ tempSearch . search ( t )[ 1 ] for t in fileNames ]

There are better resources for learning regular expressions, but I’ll translate what that weird string is. It is saying, “look for the capital letter T at any part of the string, then, to the right of ‘T’, there should be a series of digits, maybe with a letter ‘d’, followed by more digits.”

The ‘.*’ are two special characters. The ‘.’ stands for any character, and the ‘*’ stands for ‘repeat the last character 0 to infinity’. So, by prepending this to our search, we can look for the pattern anywhere in the string.

I’ll get back to the parenthesis, but the ‘\d’ stands for any digit. It is followed by a ‘d?’, because if you look at my file list, I use a d instead of a decimal point. The question mark says, ‘the character to my left may not be there, but check’.

The parenthesis denote a capturing group– we are interested in extracting the temperature from that big filename string, so by surrounding our search pattern by parenthesis, if we get a hit, then the .search() function should return it as the second element of an array.

I’ve screencasted what it looks like, so you can follow along (bonus points if you noticed when I screw up)

This results in the following graph (I did clean it up a bit after)…

I ran out of time today, so I’ll cover doing more heavy lifting in pandas next time!