Share Pin 0 Shares

In which an experienced analytics guy advises the younger generation to leave the walled garden of enterprise analytics tools and learn how to write code using a real programming language. Specifically advocating the use of R and Python for data analysis and related programming. But hey, I’m flexible on that point…

The use of COBOL cripples the mind; its teaching should, therefore, be regarded as a criminal offense. – Dijkstra

I was taught a long time ago in some Management 101 course to sandwich constructive criticism between two compliments. So I’ll open with this statement:

SAS and the other BI vendors have done a nice job of bringing statistical computing techniques within the reach of the typical college graduate.

Now pull up a chair and grab yourself some popcorn, since I’m going to bite the hand that fed me for the first half of my career. I spent the first seven years of my career in roles involving significant usage of SAS and a variety of drag & drop query tools. The COBOL of the analytics world.

I spent the early part of my career hacking on credit card direct mail campaign data. SAS was the defacto standard in the banking industry for marketing and risk analytics. Many of us learned it in school, the balance of us learned it on the job. We were also given a drag & drop query/reporting tool (Business Objects) which allowed us to streamline and automate some of the smaller ad-hoc requests.

From a business perspective, this was pretty awesome. Most non-coding analysts could generally go from clueless to relatively useful in a month or two. I was able to free up about 80% of my time as a junior analyst by building some automated reports using SAS and Business Objects – which allowed me to pursue more ambitious projects.

But there’s a gotcha here… what happens when you leave?

You Can’t Take It With You…

Employment for life is a nice little concept but effectively gone in America. If you’re in your twenties, you should expect to work for at least half-a-dozen employers during your career even if you are a relatively loyal sort. Contractors and adventurous types should expect to rack up much bigger numbers. At some point you will be sitting down figuring out how to build your next batch of analytics infrastructure. For many of you, this will likely involve hanging out your own shingle in analytics or a related discipline..

So here’s the first problem – you can expect to throw away a big chunk of your language specific experience when you change jobs. Enterprise BI packages aren’t cheap or very easy to implement. Your next employer will likely have a different standard and you’re probably going to have to relearn the implementation details of your statistical package and data slinging utilities. The novelty of this wears off rather quickly…

The issue isn’t just money by the way. A colleague of mine built the analytics program for the direct marketing arm of a big retailer. His first year was spent implementing Microstrategy. He got a head start on things because the money was already approved. Securing funding could easily add a year or two to your expected timeline…

Want to go out own your own? I did some checking a couple of years ago when I had to set up my own analytics group at my current employer. A one-person base SAS license cost about ten grand. Plus a couple thousand per year of ongoing fees. Even the simple tools (Minitab, Crystal Ball) had commercial use license costs of about $1000 – $2000. So leveraging your enterprise analytics experience on your own adds an immediate $5K to $15K fixed expense to whatever number you consider “ramen profitable”. Ouch!

No problem, you say. Programming is Programming. I’ll just go learn a new language…

“So You Think You Can Code..”

My standard job interview for new analysts includes a couple of simple coding tests. Nothing too complicated – since most of our work involves using relational databases, the questions are SQL oriented. The core question is the SQL equivalent of fizz-buzz, generally followed by a discussion of data architecture or query performance drivers. Most decent developers with SQL experience can rip through this in about 3 minutes.

About 60% of our decent looking candidates without non-SQL coding experience fail. After being warned on the phone screen and in the job description that a whiteboard SQL coding test is coming. And I don’t mean syntactical nit-picks and other forms of nerd sadism. I mean facepalm-level failure to generate anything close to executable SQL or identify meaningful abstractions which allow you to scale up your activities.

These are smart people. The root cause is simple. They’ve never had to do it before…

The most insidious thing about SAS and the drag and drop query tools is that you never really get to learn what is happening under the hood. SQL is generated auto-magically. There is no real path to the lower layers of the application (in the unlikely event your vendor includes an extension option, your local IT administrator probably shut it off). Welcome to the walled garden – you can do *this list of things* really well and absolutely nothing else is tolerated or supported by the system.

From a personal development perspective, it’s like over-feeding and declawing a house cat. They slowly lose the ability to hunt on their own and survive outside…

I discovered this when I returned to writing code a few years ago. Despite having about ten years of experience using the common BI / Analytics packages, I had to relearn a big chunk of what I thought I knew.

Life on The Other Side of The Window…

As the title implies, I’m recommending that analysts should learn how to use R and Python for data analysis. There’s actually some value in learning both.

R has a huge assortment of pre-packaged statistics programs which replicate many of the useful parts of the base SAS proc library.

I’ve found that Python does a better job of handling large datasets and building code to handle ugly parsing and unpacking tasks.

Learning Python also gives you an easy path to turn your analytics projects into real sharable software applications. There are a huge number of web frameworks (my personal favorite is bottle.py, many others exist). There are also many options for packaging your work up into a GUI application.

Another nice win? Both of them are free and can be implemented in under 15 minutes, increasing your ability to use them throughout your career. Within a corporate role, you can usually find a specialized application or project type where using Python for data analysis will produce better results than your existing BI suite. It’s hard for a sane manager to object to: BETTER + FREE. In the event you change jobs, you can be immediately productive using either tool (vs. spending several weeks/months learning a new BI package, assuming they actually bothered to do a decent job of collecting requirements from their analysts). And for your personal projects – lets just say a $10,000 license is a show stopper in my household…

Furthermore, the more time you spend coding in the same package, the better you get. Python has an awesome standard library with tested modules to handle common tasks. Numpy / SciPy and Rcran offer an equally diverse array of pre-built code. Better yet, unlike the enterprise analytics world, you’ve actually got the ability to see and tweak the code inside the box. Using, reading, and discussing Python makes you a smarter coder by exposing you to a more advanced set of programming concepts. Second career development tip: find your local Python meetup and go to a meeting or two…

Leaving Eden: Using R and Python for Data Analysis

You can take the first steps out of the walled garden at no cost but a few hours of your own time. There are many free online tutorials and books about using R and Python. Check out Project Euler (collection of math problems you can solve with short programs) or the Python Challenge. Check out Python’s list comprehensions (link to a presentation I did; list comprehensions are one of my favorite Python features). Spend some time looking at the itertools module – this has several tools you can use to slice and dice data.

Once you get a grasp of how to use R and Python for data analysis, you should spend a little time exploring how databases work. Start by learning SQL, then go check out MongoDB, Hadoop, and neo4j. Learning basic web page development (HTML, CSS, Javascript) is also helpful. Then spend time building something – anything – and sharing it with some end users…

Like a house cat that has finally snuck outside, there’s a whole new world out there…

If you like this article, please share it!

We wrote a Word Game Solver site for those who like to unscramble words and solve hangman puzzles. We recently added a cryptograms game to the site.