ClickHouse

ClickHouse is an open source, columnar-oriented database that’s been developed primarily by engineers at Yandex. Yandex is one of Europe’s largest Internet-focused businesses. There search engine alone was receiving 150 million searches a day in 2012. ClickHouse has been deployed among a number of their businesses including their Metrica offering which is the world’s second largest web analytics platform. Outside of Yandex, ClickHouse has also been deployed at CERN where it was used to analyse events from the Large Hadron Collider.

ClickHouse is very feature-rich. It supports 16 different table engines, its CLI includes animated progress indicators and syntax highlighting and its performance puts it into its own class among open source offerings.

I have given some basic overall introduction about clickhouse and its features in my previous blog post here

Although clickhouse is not drop in replacement for for something like PostgreSQL. It lacks transactions, full-fledged UPDATE and DELETE statements and calls like DROP TABLE won’t work on tables above 54 GB without first changing the server’s configuration. But it is worth note that it will be a very respectable data store.

Why TPC-DS benchmark ?

There are several benchmarks available for clickhouse , In this post i try to push to the boundaries of clickhouse with the available hardware resources using TPC-DS data and queries.

What is TPC-DS?

The Transaction Processing Performance Council (TPC) is a benchmark model decision support system. Several major firms are member of TPC. You can get more information about TPC and members in the official website.

I have personally used TPC-DS to build one of machine learning model and compare results across different environments.

In this post I will explore how to generate test data and test queries using dsdgen and dsqgen utilities on a Centos machine against the product supplier snowflake-type schema as well as how to load test data into the created database in order to run some queries TPC-DS defines. I don’t want to go too much into details as TPC already provides a very comprehensive overview of the database and schema it provides, along with detail description of constraints and assumptions. This post is more focused on how to generate the required components used to evaluate the target platform(Clickhouse) for TPC-DS data sets and data loading.

Steps to Generate and Load TPC-DS Data into Clickhouse Server

Below are the steps to generate and load TPC-DS data into Clickhouse server:

I used this tool kit

Install git and other tools you need with the following command

sudo yum install gcc make flex bison byacc git

Now clone the tools needed for generating dataset

git clone https://github.com/gregrahn/tpcds-kit.git

after downloading , compile it by CD in to it and

cd tpcds-kit/tools and make OS=LINUX

and hit enter

After the above steps I created some scripts in order to generate data based on my requirement for clickhouse

The hardware used in this test contain 8 vCPUs, 32 GB of RAM, 2TB of SSD storage and support 100 Mbps networking. I’ll be using CentOS Linux 7 (Core)Server for the operating system.

Data Generation

Created a seperate directory to store the generated data

mkdir tpcds-data

vi datagen-ch.sh #! /bin/bash cd /root/tpcds/tpcds-kit/tools/ ./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 1 & ./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 2 & ./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 3 & ./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 4 & ./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 5 & ./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 6 & ./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 7 & ./dsdgen -scale 150 -dir /root/tpcds-data -parallel 8 -child 8 &

I am going to run <dsdgen> executable tool kit that we got from github which was complied. I generate 150 GB of data in to tpcds-data dir that I created.

The tpc-ds consist of 24 tables.Before we do anything with the script we have to give execute permission

chmod +x datagen-ch.sh

Now we all set to generate data , Since the data set i generate was large enough to take some time i ran it in background mode

nohup ./datagen-ch.sh &

To make sure it was running

ps-ef | grep datagen-ch

It took almost 3 hrs to generate data, below will show the generated data.

With this we come to end of this blog post …will discuss loading and running queries in next post….here

0