Before we get started, I would like to re-introduce the author:

This guest post is the second article about using python tools to automate a report generation pipeline. If you have not read the first post you should probably review before continuing with this article.

All of the code for this article is available on GitHub .

Note: This code was written in python 3.7. You might have to adapt the code for older versions of python.

In the first part , we covered 4 main important processes that are part of the automation process. In this second and final part, we will bring everything together and build our report automation system.

Welcome to part 2 of this two-part series post about automating report generation using python, jupyter, papermill, and a couple of other tools.

A workflow to automatically generate reports in a shared cloud folder

Let’s imagine you want to generate automatic reports for every similar excel file of sales reports. You also want to share them with your colleagues. Your colleagues are interested in the reports, but not in learning how to program python, how would you proceed?

There are a lot of options, and hardly any incorrect ones, but one I found particularly interesting was using what a lot of people and companies already use: a cloud folder (Google Drive, OneDrive, Dropbox).

Cloud folders (particularly shared ones) are very popular in companies. So a good idea would be to create a shared folder where everyone can upload sales excel reports, and automatically generate Html reports from them, so everyone can read!

Here is the basic architecture of the solution:

Let’s describe each one of the steps:

A user uploads a new excel sales report to a shared cloud folder.

We sync the cloud folder with a local folder and detect a new excel sales report.

We use papermill to generate a new notebook file from that new excel sales report.

We use nbconvert to generate an Html file from that new notebook file.

We upload the Html file to the cloud folder, so the user can read it.

Let’s start building this step by step:

1. Sync a cloud folder with a local folder and detect new files To sync cloud directories with local directories, we will a tool called Rclone. Of course, we will integrate it with python. Start by installing rclone in the same machine as your local folder (your personal computer or a virtual private server for example). To do so, on a Mac or Linux machine, you should: $ curl https://rclone.org/install.sh | sudo bash On Windows, download the executable in the Rclone downloads page. Once rclone is installed, we must configure it. Depending on your cloud provider (Dropbox, Google Drive, OneDrive), the instructions will vary, so make sure to follow the configuration instructions. Once configured, let us do a first sync from the command line: $ rclone sync remote:REMOTE_FOLDER_NAME LOCAL_FOLDER_NAME This will sync your local folder with your remote folder. We can also spark this command from a python script using the core subprocess library. That allows you to run command-line programs from python: import subprocess # define our variables REMOTE_FOLDER_NAME = "shared folder" LOCAL_FOLDER = "local folder" # run the rclone sync command from python subprocess . run ( [ "rclone" , "sync" , f "remote: { REMOTE_FOLDER_NAME } " , LOCAL_FOLDER ] ) Now that we know how to sync a local and a cloud directory, how do we detect if a user has uploaded a new file to our cloud directory? Well, an option would be to navigate to our local directory and use the ls command and see what pops out. Rclone also allows us to list files in our cloud directory. Having this, we can create a python function that detects new files if they have been uploaded to the cloud folder: def get_new_files ( remote_folder , local_folder ): """ A function that returns files that were uploaded to the cloud folder and do not exist in our local folder. """ # list the files in our cloud folder list_cloud = subprocess . run ( [ "rclone" , "lsf" , f "remote: { remote_folder } " ], capture_output = True , text = True , ) # transform the command output into a list cloud_directories = list_cloud . split ( "

" )[ 0 : - 1 ] print ( f "In the cloud we have:

{ cloud_directories } " ) # list the files in our local folder list_cloud = subprocess . run ( [ "ls" , local_folder ], capture_output = True , text = True ) # transform the command output into a list local_directories = list_cloud . stdout . split ( "

" )[ 0 : - 1 ] print ( f "In the local copy we have:

{ local_directories } " ) # create a list with the differences between the two lists above new_files = list ( set ( cloud_directories ) - set ( local_directories )) return new_files A couple of notes about the script above: The capture_output file in the subprocess.run function, allows us to capture the output of the command.

file in the function, allows us to capture the output of the command. The text flag allows us to treat everything as text, avoiding problems with spaces for example.

flag allows us to treat everything as text, avoiding problems with spaces for example. After running subprocess.run , we apply the .split function to parse the output of the subprocess.run function which is a string of different files separated by a line break (

). This split function allows us to but all the elements into a nicely formatted python list.

, we apply the function to parse the output of the function which is a string of different files separated by a line break (

). This split function allows us to but all the elements into a nicely formatted python list. The new_files list will contain only files that are in the cloud directory, but not in the local directory, or in other words: the excel file that users have uploaded to the cloud drive. In case there are no differences, the function will return an empty list.

2. Using Papermill and Nbconvert to generate new reports Once we have a reliable way of detecting if new files are uploaded to the cloud, we now need to process that new file and generate an html report from it. We will use two of the tools mentioned in the first article: papermill, and nbconvert. We start by creating a function that will produce a new notebook file, based on an excel report. Using, of course, a notebook template (for example template.ipynb ) as previously described in part 1. import papermill as pm def run_notebook ( excel_report , notebook_template ): # take only the name of the file, and ignore the .xlsx ending no_extension_name = excel_report . split ( "." )[ 0 ] # run with papermill pm . execute_notebook ( notebook_template , f " { no_extension_name } .ipynb" , parameters = dict ( filename = excel_report ), ) return no_extension_name Then, we must convert the notebook to an Html file. To do this, we create another function that calls the nbconvert command from the python interpreter. import subprocess def generate_html_report ( notebook_file ): generate = subprocess . run ( [ "jupyter" , "nbconvert" , notebook_file , "--to=html" , ] ) print ( "HTML Report was generated" ) return True This function runs the nbconvert command previously described in the beginning of the article, from a python script.

4. Uploading an Html file back to the cloud folder There is another Rclone command that is pretty handy. If you want to push a file from a local folder to a cloud folder, you can use the following from the command line: $ rclone copy FILENAME remote:REMOTE_FOLDER_NAME We could do it from the command line, but why not do it from python? With the subprocess library, it’s pretty straightforward: import subprocess def push_to_cloud ( remote_folder , html_report ): push = subprocess . run ( [ "rclone" , "copy" , html_report , f "remote: { remote_folder } " ] ) print ( "Report Published!!!" )