Over time you have probably developed a set of python scripts that you use on a frequent basis to make your daily work more effective. However, as you start to collect a bunch of python files, the time you take take to manage them can increase greatly. Your once simple development environment can become an unmanageable mess; especially if you do not try to have some consistency and common patterns for your development process. This article will discuss some best practices to manage your python code base so that you can sustain and maintain it over the years without pulling your hair out in the process.

I am targeting this article towards a certain problem domain. Many of the points apply universally but I’m generally going to be talking about situations where:

Because python is so expressive, you can do some very complex activities in a very small number of lines of code. In my particular case, I have been using pandas for a while now and have developed a nice library of scripts I can use to manipulate the data I work with on a daily basis. As you start to develop your own repository, you’ll find that you will end up with dozens of scripts that work great. However, if you use them on an infrequent basis, maintenance starts to consume more and more of your time.

This is one small example but once you start rolling, I am confident you will have many of your own. I am hopeful that these guidelines will be applicable to your situations as well.

I looked at my script and in a total of less than 5 minutes, I made all those changes and re-ran it. The new output was finished in a fraction of the time it would have taken me to manipulate it in Excel. I also know there will be more changes and that it is super easy to re-run it again if I need to. That extra time and attention I spent in the beginning will save me way more time in the long run.

Fast forward a couple of days when I was discussing the report. The group had some good ideas for how to modify it. For instance, we wanted to look at prior 6 months of sales instead of 12 months. We also wanted to look at units instead of revenue.

I received a request to produce a summary report of some data. It was an exploratory data request for some sales information over a period of time and I had a good idea of how to pull it together (including other scripts that did many of the actions I needed). I figured it would have taken me 10-20 minutes of Excel manipulation to get the report. I also knew that I could put in about 1 hour and have a python script to pull the data and output it into an Excel file. What to do?

Before I go into specifics, let me give one example that just happened a week ago. I think it illustrates my point well.

I have tried to apply these ideas to my internal projects and have had good success. However, nothing is perfect so I’m interested to see what others say.

General Guidelines

One of the biggest pieces of advice I have is to treat your code like an open source project. I do not mean that you release all your code to the world but that you should use the best practices and conventions in the open source world to make your repository more manageable. Always think about how you would hand this code off to someone else in your organization.

Use Version Control Whatever you do, make sure you check the code into a version control system. Even if you think you’ll be the only person using it and that the code won’t grow and change that much - do it. There is no excuse. If you have an internal system, use it. If not, then look for alternatives. I know that most people use github but bitbucket is actually a little more attractive if you want to use it for private repos. You can have unlimited private repos and up to 5 users in the free plan. The one caveat I would have is to make sure you understand your company’s policy on this type of hosting. Some companies may have restrictions on using external version control systems. I will leave it to you to navigate that. One word of caution I would give you is to keep any confidential data sorted locally and not stored in an external repo. Your code is going to be mostly useless without the data files so I would feel most comfortable with that approach. Please make sure you understand your organization’s policies. Even if you find yourself unable to host code externally, you can still set up a local mercurial repo. The other really cool thing is that you can use hg serve to have a local webserver that allows you to browse your repo and view changesets. This is a really useful feature for an individual developer. Once you have a repo set up, then you should start to manage the code in the repo like you would an open source project. What does that mean? Document the code with a README file. If you create a README , this has the advantage of giving you a nice summary of what is in the repo. In my README , I include certain items like: Overview of the required python versions and how to get them.

Description of major packages (Pandas, SQL Alchemy, etc)

Alchemy, etc) Description of each file, including working files, log files, configs.

Notes on upgrading the environment or configuring certain items.

What the directories are used for. As mentioned above, I don’t keep the external files in a remote repo but do want to keep a record of the various directories I use and how I get the files.

Notes on when specific files need to be run (daily, monthly, etc).

Reminders to yourself about how to update packages or any dependencies. Have good commit notes. It’s so easy to put in commit notes like “Minor formatting changes” or “Fixes for Joe”. However, those notes will not help you when you’re trying to figure out why you made a change many months ago. This post is a good summary of what your commits should look like. Consider using the ticket functionality. If your code is in bitbucket or github then you get ticket functionality for free. Go ahead and use it. Anything you can do to help pull your thoughts and history together in one place is a good idea. If you use tickets, make sure to reference them in your commit notes.

Document Your Code Even when you have only a few lines of code in a file, you should still make sure to follow good coding practices. One of the most important is good documentation. For the particular class of problems we are solving, I want to cover a couple of specific methods that have worked well for me. I like to include a couple of items in the docstring header of my file that look something like this: # -*- coding: utf-8 -*- """ Created on Tue Jun 30 11:12:50 2015 Generate 12 month sales trends of Product X, Y and Z Report was requested by Jane Smith in marketing Expect this to be an ad-hoc report to support new product launch in Q3 2015 Source data is from SAP ad-hoc report generated by Fred Jones Output is summarized Excel report """ In the example above, I include a creation date as well as a summary of what the script is for. I also find it incredibly useful to include who is asking for it and then some idea if this is a one off request or something I intend to run frequently. Finally, I include descriptions of any input and output files. If I am working on a file that someone gave to me, I need to make sure I understand how to get it again. In addition to summary info, I wanted to give a couple of specific examples of inline code comments. For instance, if you have any code that you are using based on a stack overflow answer or blog post, I recommend you provide a link back to the original answer or post. In one particular case, I wanted to merge two dictionaries together. Since I was not sure the best approach, I searched the web and found a detailed stack overflow answer. Here is what my comment looked like: # Create one dict of all promo codes # We know keys are unique so this is ok approach # http://stackoverflow.com/questions/38987/how-can-i-merge-two-python-dictionaries-in-a-single-expression all_promos = sold_to . copy () all_promos . update ( regional ) Another important item to comment would be business rationale around certain assumptions. For instance, the following code is straightforward pandas and wouldn’t warrant a comment except for understanding why are we choosing the number 3. # Also filter out any accounts with less than 3 units. # These accounts are just noise for this analysis. # These are typically smaller accounts with no consistent business all_data = all_data [ all_data [ "Volume" ] >= 3 ]

Code Style Fortunately python has many tools to help you enforce the style of your code. If you want to read lots of opinions there is a reddit discussion on the options. I personally think that pyflakes is useful for the style of coding we’re discussing. I think the actual choice matters less than the fact that you do make a choice. I encourage you to use an editor that has some sort of integration with one of these tools. I find that it helps me make sure my spacing is consistent and that I don’t have imported but unused modules. It won’t guarantee bug free code but the consistency really helps when you look at code that is several months/years old. I also encourage you to read and follow the Python Code Style Guide. It contains a bunch of useful examples for best practices in python coding. You should refer to it often and try to incorporate these guidelines in your code, no matter how small the script.

Managing Input and Outputs Many of the scripts will have multiple input and output files. I try to keep all of the files in one input directory and one output directory. I also include a date (and sometimes) time stamp in my files so that I can run them multiple times and have some record of the old ones. If you need to run them multiple times per day, you would need to include the time as well as the date. Here is a code snippet I frequently use in my scripts: # Data files are stored relative to the current script INCOMING = os . path . join ( os . getcwd (), "in" ) OUTGOING = os . path . join ( os . getcwd (), "out" ) default_file_name = "Marketing-Launch-Report-{:%m- %d -%Y}.xlsx" . format ( date . today ()) save_file = os . path . join ( OUTGOING , default_file_name ) input_file = os . path . join ( INCOMING , "inputfile.xlsx" ) df = pd . read_excel ( input_file ) # Do more stuff with pandas here # Save the data to excel by creating a writer so that we can easily add # multiple sheets writer = pd . ExcelWriter ( save_file ) df . to_excel ( writer ) writer . save () In addition to this code, here are a couple of recommendations on managing your input files: Try not to make any alterations by hand to the file. Keep it as close to the original as possible.

Don’t delete old input files, move them to an archive directory so you don’t lose them.