I’m a big believer in open source tools. It wasn’t always this way. I did my undergraduate work using Mathematica or Maple, with a smattering of Excel (not to mention pen and paper). My graduate work was primarily in Matlab. My exposure to open source came from one class where the Prof insisted we use Octave, an unfortunate foray into Fortran for a particular project, and some dabblings in Python, which at the time was quite immature as a scientific computing platform and very difficult to get up and running.



I was pretty happy with my toolchain- Matlab is pretty fast, and seemed to do a lot of things out of the box. I even figured out ways to massively parallelize it, running Map/Reduce type algorithms for some more intensive tasks. It wasn’t until I moved into the commercial world that I started to see the drawbacks. I was working for a startup at the time, and they bought me a Matlab license, since that’s what I knew how to use. I nearly had a coronary when I saw the price tag- over 10x what my lab had been spending on the exact same product. On top of which, it was missing most of the core libraries I needed to do my actual work. When I saw the price tag of those, I resolved to find another way. I was working with some statisticians at the time and they pointed me in the direction of R. It took me a bit to get used to R’s peculiar syntax, which offended my C-generated sensibilities, but once I did the experience was magical. It had everything I needed and, most importantly, was FREE. As in it did not cost my company or myself a penny.



And this was important not just because I’m a bit cheap. I couldn’t share my Matlab code with anyone who didn’t have a Matlab license. I couldn’t run my Matlab code outside machines I specifically set it up on. I couldn’t use it for hobby projects, since my company owned the license, which meant I couldn’t use it to learn. R, and later Python (which I migrated towards for separate reasons), allowed me all of these things and more. I could sample and test multiple implementations of the same algorithms. I could get help easily when I was having a problem, just through Google (the product forums for Matlab I found to be worse than useless). And the open source platforms made incredibly fast progress, due to the thriving communities around them. It turned out that, because R and Python were free, TONS OF PEOPLE WERE USING THEM. I summarize it as: open source is a democratized tool chain. I took full advantage and I’ve never looked back.



By now it’s clear that this is the way the world is going- R and Python have both overtaken SAS in job listings. If you’re a data scientist still on proprietary software you needed to have switched yesterday, but there is no time like the present.

Resources

(You can find these resources (and more) by searching worldlybayes.com)



Install scientific python on OS X from scratch

http://jeetworks.org/setting-up-a-python-scientific-environment-numpy-scipy-pandas-statsmodels-etc-in-os-x-10-9-mavericks/

The Jupyter Notebook- a great tool for python (and R)

http://jupyter.readthedocs.org/en/latest/

Lots of people use Anaconda. I don’t prefer it because I like to have tighter control over my machine setup, but that shouldn’t stop you.

https://www.continuum.io/downloads

Installing R on OS X

http://www.r-bloggers.com/installing-r-on-os-x/