I think I have your attention now

Photo by Stefan Cosma on Unsplash

dbt handles the materialisation

A dbt project is a directory of `.sql` and `.yml` files, which dbt uses to transform your data. At a minimum, a dbt project must contain:

A project file : `dbt_project.yml` file tells dbt that a particular directory is a dbt project, and also contains configurations for your project.

: `dbt_project.yml` file tells dbt that a particular directory is a dbt project, and also contains configurations for your project. Models: dbt treats SQL queries as models (select statements). dbt handles turning these select statements into tables and views.

Materialisations are strategies for persisting dbt models in a warehouse. As a data engineer, all you need to do is to concentrate on writing your query and dbt will handle the materialisation for you!

Materialisation can be defined either at the project file level:

# materialization defied at project file level

# Applies to all files under models/example/

# snippets from dbt_project.yml file models:

example:

materialized: view

Alternatively, they can be defined at the model level:

# materialization defied at a model level

# define your materialized strategy here {{ config(materialized=’table’) }} SELECT

*

FROM

`project.dataset.tableA` A

left JOIN

`project.dataset.tableB` B

ON

A._key = B._key

Materialisation can be chosen from table, views, incremental and ephemeral. You can read more about them here.

dbt automatically builds the dependency graph

The most important function in dbt is ref(); it’s impossible to build even moderately complex models without it. ref() is how you reference one model within another. dbt uses references between models to automatically build the dependency graph. This will enable dbt to deploy models in the correct order.

Previously I have had to write custom code in order to decide the correct order to deploy views and tables in BigQuery, the ref() feature makes this really simple without having to write a single line of code!

Let’s take an example where query C requires tables (A,B). Normally, you would write it like this:

SELECT

*

FROM

`project.dataset.tableA` A

left JOIN

`project.dataset.tableB` B

ON

A._key = B._key

But using dbt refs together with the materialisation config it’ll look something like this:

# materialization defined at a model level {{ config(materialized="table") }} SELECT

*

FROM

{{ref("tableA")}} A

left JOIN

{{ref("tableB")}} B

ON

A._key = B._key

Indeed, you can see that the most important function in dbt is ref()!

We are not done here, dbt docs also allows us to visualise the dependency graph with just a few commands.

Pretty neat, eh!?

dbt allows you to re-use SQL between models

What!? Macros in dbt are snippets of SQL that can be invoked like functions from models. This makes it possible to re-use SQL between models in keeping with the engineering principle of DRY (Don’t Repeat Yourself). Macro files can contain one or more macros and to simply use macros, add a macro-paths config entry to your dbt_project.yml file, that’s it!

Let’s take a look at this in action, we have all seen code similar to this:

select

field_1,

field_2,

field_3,

field_4,

field_5,

count(*)

from

my_table

group by

1,

2,

3,

4,

5

While this might not look so bad, what if in another query we had to group by n number of columns where n > 50? It would be really tedious to write 1,2,3…50, to top that off what if this needed to be done across multiple SQL queries.

Let’s take a look at the previous example using macros. First, we define our macro in a macro file:

{% macro group_by(n) %}

GROUP BY

{% for i in range(1, n + 1) %}

{{ i }}

{% if not loop.last %} , {% endif %}

{% endfor %}

{% endmacro %}

Above, we have defined a macro called group_by which takes a single argument, n. Integrating our macro into the previous example would like this:

select

field_1,

field_2,

field_3,

field_4,

field_5,

count(*)

from

my_table

group by

{{ group_by(5) }}

This group_by macro can now be used across multiple queries and for n number of fields, pretty cool ;)

dbt can combine SQL with a fully-featured templating language

dbt allows you to combine SQL with Jinja, a templating language. Jinja allows you to turn your dbt project into a programming environment for SQL, giving you the ability to do things that aren’t normally possible in SQL such as control structures, use of environment variables etc

For example, let’s say you have a BigQuery table with two fields and an array of structs called `event`. This array contains two structs `event_name` and `event_value`, one way of extracting values from this array could be:

SELECT

field_1,

field_2,

SELECT event_value FROM UNNEST(event) WHERE event_name = “X1” as

“column_X1”,

SELECT event_value FROM UNNEST(event) WHERE event_name = “X2” as

“column_X2”,

SELECT event_value FROM UNNEST(event) WHERE event_name = “X3” as

“column_X3”,

SELECT event_value FROM UNNEST(event) WHERE event_name = “X4” as

“column_X4”,

SELECT event_value FROM UNNEST(event) WHERE event_name = “X5” as

“column_X5”,

SELECT event_value FROM UNNEST(event) WHERE event_name = “X6” as

“column_X6”,

SELECT event_value FROM UNNEST(event) WHERE event_name = “X7” as

“column_X7”,

SELECT event_value FROM UNNEST(event) WHERE event_name = “X8” as

“column_X8”

FROM

`project.dataset.table`

Looks kinda ugly, right? Let’s write the same query this time using Jinja:

{% set event_names = [“X1”, “X2”,“X3”,”X4",”X5",”X6",”X7",”X8"] %}

{% set my_struct = “event” %} SELECT

field_1,

field_2,

{% for name in event_names %}

SELECT event_value FROM UNNEST({{my_struct}}) WHERE event_name=

‘{{name}}’ as column_{{name}},

{% endfor %}

FROM

`project.dataset.table`

How cool is that? If you want to get adventurous you could create a macro and use it in your query as we saw previously :O