Doing Ad-hoc Data Analysis the Right Way — If You Use Python or R

In this article I’ll give an overview of what ad-hoc data analysis is, how it’s done, and how its results can be organized with dstack.ai for better collaboration, — if use Python or R.

On ad-hoc data analysis

If a few decades ago data analysis was a prerogative of universities and private research groups, today even non-tech commercial companies start massively using it to build better products and improve processes. With the rise of technologies, data analysis has become both affordable and critical for companies to stay competitive.

There are tons of tools in the market for different kinds of analysis. This includes tools that are either tailored for specific needs or those offered out-of-the box for a variety of use-cases and industries.

Ad-hoc analysis is normally aimed at answering specific situational questions or finding new patterns in the data that might lead to important insights. As the questions are not typical or the pattern in data is unknown, the analysis requires human attention. Normally the analysis includes a data analyst or a data scientist.

In this article, I’ll be talking about the ad-hoc data analysis that is done with programming languages such as Python or R.

Due to the availability of tools and libraries for data analysis and manipulation, Python and R are getting especially popular today across the community of data scientists. Both Python and R offer interactive environments such as notebooks for data manipulation as well as the packages for data wrangling, visualization, and training machine learning models. Most of these tools are open sourced and do not require any special background to use (except maybe statistical or mathematical background). All of these reasons make these tools very helpful for ad-hoc data analysis.

Typical workflow of ad-hoc data analysis

A typical workflow for ad-hoc data analysis with Python or R consists of the following simple steps:

1. Getting data

An ad-hoc analysis starts with getting the data that has to be analysed. The data can be given by someone or can be acquired from external data sources.

2. Exploring data

Once the data is obtained, it usually needs exploration. The step is critical to understand the nature of the data, its format, its completeness, and even correctness. Methods of exploring data may include data wrangling, data aggregation, and data visualization.

3. Processing data

Often, in order to answer specific questions, the data must be first processed. Once data is processed, e.g. cleaned, filtered, and enriched, aggregated, it’s a lot easier and faster to work.

In general, the steps of exploring and processing data can repeat until the analysis is over.

Most often, boths steps are done using interactive tools. In Python, it’s for example Jupyter notebooks. In R, it’s RStudio.

Sometimes, the step of processing data can be also run in batch on large datasets, e.g. using scripts.

4. Sharing results

Regardless of the type of data, the tools used to analyse, and the purpose of the analysis, the end outcome of any data analysis is to get insights and share it with other people. Sometimes, the end result of data analysis can be a simple answer such as a Yes or a No or just a number. Often it can also be another dataset or a visualization or even a machine learning model.

Why collaboration matters

When it comes to collaboration with other data scientists, sharing prepared datasets is very important as it helps other colleagues to save time on acquiring and processing data, and getting insights faster.

Due to how our brain works, visualization is a tool to present an answer to a complex question with others. Also, in addition to answering specific questions, visualizations can tell stories — e.g. trigger new ideas about the researched topic.

Needless to say, sharing the end result is a critical part of the whole process of ad-hoc data analysis. Ad-hoc data analysis is expensive as you need data and data scientists, and time. If the results of the analysis are not properly shared with the team, all this time and money spent on data analysis is wasted. It is even worse (than wasting time and money) when you cannot apply the results to improve the product or the processes to stay competitive.

It is also easy to be mistaken that sharing the end results is the end goal. In fact, it is the first one. Sharing only ensures that there is an exchange of feedback between the data science team and the clients of the research.

How dstack.ai fits in here

While today there are a lot of tools that help at doing data analysis itself as well as for tracking tasks for data analysis, there are very few tools that help organize the collaboration between data scientists and the end clients of the research. In tech companies, data scientists use tools such as notebooks with code and outputs or custom web applications that are built and hosted by themselves. Non-tech companies, on the other hand, don’t have specific tools and rather rely on email exchanges, file storages or issue trackers as tools for collaboration.

Because of the complexity or imperfection of these tools, it is often difficult to collaborate effectively or find any needed result after some time.

This is where dstack.ai steps in by offering the missing tool to better organize the data analysis process and let teams collaborate easier, and perform meaningful work faster. Most importantly, instead of substituting an existing tool, dstack.ai offers a tool that can be used together with other tools.

The dstack.ai tool consists of two parts:

The dstack package for Python (PyPi) and R (CRAN) A web application (https://dstack.ai)

The dstack packages for Python and R offer functions for publishing any data analysis results — which can include both datasets and visualizations.

Installing the dstack package is possible via conda:

Or using pip:

Once the package is installed, you have to configure it by invoking the dstack command line (that is installed with the package):

The user and token are obtained by signing up at https://dstack.ai. The token is used for authorization and to ensure the secure access to your published data.

In order to publish a dataset or a visualization, you only need to choose a name (a stack name), and pass a pandas dataframe, a matplotlib, Bokeh, or Plotly figure.

Here’s an example of publishing a pandas dataframe:

The published dataframe will be available at https://dstack.ai/<user>/pandas_example: