Anyone working in cloud environments such as Amazon Web Services, Google Cloud Platform, or Microsoft Azure have enormous computing power and tools at their disposal with the the ability to share and analyze data across diverse organizations. You may have heard of this referred to as a “data lake”. Having had experience working in analytics, reporting & big data (Hadoop) throughout my career, in the past it could take weeks to months to spin up a data lake and put it to use. With major advances in serverless technology and off-the-shelf ETL (Extract Transform Load) tools I wanted to see how quickly I could build my analytics reporting service and write zero lines of code.

This post provides a look into the specific tools and techniques I used to build this solution out. If you want to see some of the specific reports we’ve built using this solution, just jump to the bottom.

Note: While this solution focuses on some specific vendors, this is not an endorsement for any particular tool as there are many ways to arrive at the same solution. However, I do focus specifically on Jira data which has its own quirks due to Jira-specific schemas.

Skills & Tools (The Recipe)

If you were inclined to build out a similar solution, here’s some of the skills and tools you will need:

SQL Fundamentals (Comfortable with select, joins, group by)

A Jira API key that can read all your projects

Stitch Data account. You can see a list of their connections. (Or your favorite ETL (Extract Transform Load) tool)

Tableau. (Or your favorite BI visualization tool)

A Google Could account with BigQuery.

Architecture

Other than writing some cleaning scripts in SQL, the entire system is built with zero lines of code. All reporting is done via BigQuery views and we batch process de-duping scripts daily to keep the data sets sets from exploding in size. You can accomplish all this without the need to pull in any sort of BigData processing tools or complicated infrastructure setup. You can literally get this basic infrastructure set up in a matter of hours. Building views and cleaning the data takes some more time, but the end-to-end connectivity is really quite simple. The next sections will provide you some of the more nuanced things I found out while making this reporting system reliable.

The Nitty Gritty: Putting it All Together

These are the little helpful nuggets I learned along that way that, had I known, would have made the whole build-out much faster.

Working with Views

I want all my data to be readable and adhere to my own naming convention. And because Jira data has custom fields that are identified by numeric Id you would never make any sense of the data. Hence, I simply make my reporting policy that all reporting is done from views. This also helps to abstract out the underlying data source in case you need to point your view to some other data set.

SELECT

key AS issue_id,

fields.summary as summary,

. . . other fields . . .

fields.customfield_10230.value AS incident_environment,

_sdc_received_at as etl_received_at,

_sdc_batched_at as etl_batched_at,

_sdc_extracted_at as etl_extracted_at

We keep the _sdc_* fields from stitch to indicate when a particular row of data was processed during the ETL execution. We’ll use these to de-dupe our data since Stitch does not perform any destructive operations.

Once you have your query set, just save the view and start to use it. In the image below the issues table is the raw extraction from Jira and the issues_view is the cleaned version we build reports on. There are some other specialized tables we use as well for looking at pull requests, labels and issue links.

Working with Nested Data

If you try to work with Jira data out of the box in BigQuery ,you will likely see an error like:

Duplicate column names in the result are not supported. Found duplicate(s): id

An important distinction to understand with BigQuery is the difference between nested and repeated fields. It’s easiest to report with a flattened data structure (one issue per row). By default the Jira issues table is populated with tons of nested and repeated fields. The easiest way to deal with this is to simply alias every nested record with a unique name:

SELECT

fields.customfield_10230.id as column_name_1,

fields.customfield_10018.data.id as column_name_2

FROM

`my_project.my_dataset_jira.issues`

De-Duping Your Data

Every time a record is updated in Jira, Stitch will append the full record again. As Stitch never tries to modify or remove your data, it’s your responsibility to ensure uniqueness. It’s fairly simple by just replacing your table with the most recently batched Jira key. I run something like this as a nightly batch job where it just keeps the most recently sent key.

CREATE OR REPLACE TABLE

`my_project.my_dataset_jira.issues` AS

select * EXCEPT(max_batched_timestamp) from

(

SELECT *,

MAX(_sdc_batched_at) OVER (PARTITION BY key) AS max_batched_timestamp

FROM `my_project.my_dataset_jira.issues`

)

WHERE _sdc_batched_at = max_batched_timestamp

Working with Arrays

When keeping a flat data structure to keep reporting fast and simple, I like to handle array data as separate use cases. For example, when working with labels or include links in Jira, I create separate tables to convert arrays to dimensions. I won’t go into too much detail here, but if you want to extract out array data, the unnest function is your friend to flatten your data.

Working with JSON Data

If you are using Github and linking Pull Requests to Jira, the developer data can be invaluable for reporting. Unfortunately it’s not straightforward to extract it, so I’ve provided a script you can use to create your own table of pull requests linked to Jira issues from the development_json field:

SELECT

issue_id, project_name, status_name,

assignee_name, resolution_dt, resolution_name,

html_url, issuetype_name,

json_EXTRACT(pr, ‘$.cachedValue.summary.pullrequest.overall.count’) AS pr_count,

json_EXTRACT(pr, ‘$.cachedValue.summary.pullrequest.overall.state’) AS pr_state,

json_EXTRACT(pr, ‘$.cachedValue.summary.pullrequest.overall.open’) AS pr_open_flag,

json_EXTRACT(pr, ‘$.cachedValue.summary.pullrequest.overall.lastUpdated’) AS pr_last_updated,

json_EXTRACT(pr, ‘$.cachedValue.summary.pullrequest.overall.dataType’) AS pr_data_type

FROM

(

SELECT issue_id, project_name, status_name, assignee_name,

resolution_dt, resolution_name, html_url,

issuetype_name, development_json,

SPLIT(development_json, ‘json=’)[

OFFSET

(1)] AS pr

FROM `my_project.my_dataset_jira.issues`

WHERE

development_json != ‘{}’

)

WHERE

json_EXTRACT(pr, ‘$.cachedValue.summary.pullrequest.overall’) is

not null

The result is a table that looks like this:

Change Logs : Identifying Moved & Deleted Issues

Jira has no know way to identify any issues that are deleted. If for some reason you allow your teams to delete issues you are going to have some potentially messy reporting. However, for issues that you need to clean up due to a project move, you can view change logs to be able to identify where you want to remove an issue that was transitioned to another project. For this, we can view the changelogs table from Jira.

You can identify issues that transitioned to another project like so:

SELECT i.field, i.tostring, i.from, i.fromstring, i.to, i.fieldtype, i.fieldid

FROM `mydb.changelogs`, unnest(items) as i

where i.field = ‘Key’

and (i.tostring is not null and i.fromstring is not null)

Summary

If you have ever wanted to take control of all you software delivery data and metrics, I’ve shown a simple zero code solution that will allow you infinite flexibility in reporting and answering any question that needs deeper analysis. I’ve been able to create several custom reports quite easily and answer questions and provide visualizations I otherwise would not have been able to complete with the standard Jira tools and plugins.

Happy Reporting! Below are a few examples of reports reports we have created using this solution.