For more than a year I've been trying to gain access to what I would consider to be a super computing cluster that I could run a GPU-based database benchmark on. There have been several challenges I've come across during this time. Some software providers aren't certain their code will scale up to the hardware on offer and often the owners of the hardware are busy running workloads that could very well solve some of the world's biggest challenges.

BrytlytDB recently added support for IBM's POWER8 CPU architecture and I've been kindly granted access to a five-node IBM Power System S822LC for HPC cluster to run my 1.1 Billion Taxi Rides benchmark on. The systems are more commonly known by their code-name "Minsky" and, in aggregate, the cluster has 20 Nvidia P100 GPUs with a combined total of 71,680 CUDA cores and 320 GB of HBM2 memory that the CUDA cores can communicate with at 14,640 GB/s.

For those unfamiliar with BrytlytDB I'd describe it as PostgreSQL for GPUs. All the regular features of PostgreSQL like Joins, Stored Procedures, ODBC (important for Tableau users) work as expected but once PostgreSQL puts together an execution plan for a query BrytlytDB takes over and executes it on one or more of Nvidia's CUDA-powered GPUs. The software has the added benefit of being able to run single queries across multiple GPUs on multiple machines.

In this benchmark I'll see how well the latest version 2.1 of BrytlytDB runs on five IBM Minsky Servers when querying 1.1 billion taxi trips. This dataset is made up of 1.1 billion taxi trips conducted in New York City between 2009 and 2015. In CSV format the data is about 500 GB in size. This is the same dataset I've used to benchmark Amazon Athena, BigQuery, ClickHouse, Elasticsearch, EMR, kdb+/q, MapD, PostgreSQL, Redshift and Vertica. I have a single-page summary of all these benchmarks for comparison.

The Hardware In this benchmark I'll be using five IBM Power System S822LC for HPC servers. Each 2U server comes with two 4 GHz, 8-core by 8-thread, POWER8-based CPUs. $ lscpu Architecture: ppc64le Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Thread(s) per core: 8 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Model: 1.0 (pvr 004c 0100) Model name: POWER8NVL (raw), altivec supported CPU max MHz: 4023.0000 CPU min MHz: 2061.0000 L1d cache: 64K L1i cache: 32K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-63 NUMA node1 CPU(s): 64-127 The CPUs were set to performance mode for this benchmark. $ sudo cpupower frequency-set -g performance There is half a terabyte of DDR4 RAM on each machine. $ cat /proc/meminfo | head -n1 MemTotal: 535690368 kB Each server has four Nvidia Tesla P100-SXM2-16GB cards. These cards have 3,584 CUDA cores, 16 GB of HBM2 memory that the CUDA cores can communicate with at 732 GB/s and a total power draw of 300 Watts. $ nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 361.93.02 Driver Version: 361.93.02 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla P100-SXM2... On | 0002:01:00.0 Off | 0 | | N/A 33C P0 30W / 300W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla P100-SXM2... On | 0003:01:00.0 Off | 0 | | N/A 34C P0 29W / 300W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla P100-SXM2... On | 000A:01:00.0 Off | 0 | | N/A 33C P0 28W / 300W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla P100-SXM2... On | 000B:01:00.0 Off | 0 | | N/A 29C P0 28W / 300W | 0MiB / 16280MiB | 0% Default | +-------------------------------+----------------------+----------------------+ The compute cores on the GPUs are clocked at 1.48 GHz, the memory clocks are set to 715 MHz and performance mode is switched on. $ sudo nvidia-smi -ac 715 ,1480 $ sudo nvidia-smi -pm ENABLED The Minsky boxes are equipped with NVLink which provides much faster interconnectivity between the CPUs and GPUs than you'd find in a PCIe-based system. The CPUs can speak to the system's RAM at 115 GB/s, the CPUs and GPUs can all communicate between one another at 80 GB/s thanks in part to NVLink and the GPUs can speak to their memory at 732 GB/s. The ratio of two P100 cards for each CPU seems to create a good balance for BrytlytDB. Each server has a Seagate 1 TB 2.5" 7200 RPM hard disk drive and a 1.6 TB NVMe Flash drive. Below shows how the partitions are setup. $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 1 931.5G 0 disk |-sda1 8:1 1 7M 0 part |-sda2 8:2 1 929.6G 0 part / `-sda3 8:3 1 1.9G 0 part [SWAP] sdb 8:16 1 931.5G 0 disk sr0 11:0 1 1024M 0 rom sr1 11:1 1 1024M 0 rom sr2 11:2 1 1024M 0 rom sr3 11:3 1 1024M 0 rom nvme0n1 259:0 0 1.5T 0 disk `-nvme0n1p1 259:1 0 1.5T 0 part /nvme3T I ran dd on one of the NVMe drives to get an idea of what sort of single-process, single-thread speeds could be seen with the drive. I was able to max out writes at 3.8 GB/s and reads at 1 GB/s. $ sudo dd if = /dev/zero of = /nvme3T/test bs = 500K count = 1024 1024+0 records in 1024+0 records out 524288000 bytes (524 MB, 500 MiB) copied, 0.137118 s, 3.8 GB/s $ sync $ echo 3 | sudo tee /proc/sys/vm/drop_caches $ dd if = /nvme3T/test of = /dev/null bs = 500K count = 1024 1024+0 records in 1024+0 records out 524288000 bytes (524 MB, 500 MiB) copied, 0.520857 s, 1.0 GB/s Though the Minsky machines do come with InfiniBand network ports BrytlytDB keeps network chatter at a minimum so for this benchmark we used a 1 Gbps network for communication between the five machines.

BrytlytDB Up and Running Each of the servers is running Ubuntu 16.04.1 LTS with a 4.4.0-57-generic Kernel compiled for ppc64le. Docker version 17.06.0-ce along with Docker Engine Utility for NVIDIA GPUs were already installed on the machines when I gained access to them. For this benchmark I'll be running the new version 2.1 of BrytlytDB which supports POWER8. In order to minimise the installation steps I'll be using a Docker Image with the software already setup. $ sudo nvidia-docker login $ sudo nvidia-docker pull brytlytdb/ppc64-4n I then launched the containers on each of the five servers. PostgreSQL's CLI client port 5432 and BrytlytDB's data node communication ports will be exposed from the Docker container. $ sudo nvidia-docker run \ -dti \ --name = cluster \ -p 5432 :5432 \ -p 30000 -30003:30000-30003 \ brytlytdb/ppc64-4n $ sudo nvidia-docker exec \ cluster /root/brytlyt start I do have to highlight that there was an oversight: these containers were launched on the mechanical disks rather than the NVMe drives on each of the machines. This will have an impact on the performance of any I/O operation BrytlytDB performs. I'm disappointed this happened and I would have rerun the benchmark but I only noticed this after I surrendered the servers back to their owner. On the coordinator I'll register the other data nodes. The network setup is as such that the other machines are addressed via non-reserved IPv4 addresses so for the sake of the owner's privacy I'll be obscuring the first three IPv4 classes. $ sudo docker exec -ti cluster \ /usr/local/brytlyt/bin/psql \ brytlyt brytlyt CREATE NODE d04 WITH ( HOST = 'x.x.x.213' , TYPE = 'datanode' , PORT = 15432 ); CREATE NODE d05 WITH ( HOST = 'x.x.x.213' , TYPE = 'datanode' , PORT = 15433 ); CREATE NODE d06 WITH ( HOST = 'x.x.x.213' , TYPE = 'datanode' , PORT = 15434 ); CREATE NODE d07 WITH ( HOST = 'x.x.x.213' , TYPE = 'datanode' , PORT = 15435 ); CREATE NODE d08 WITH ( HOST = 'x.x.x.103' , TYPE = 'datanode' , PORT = 15432 ); CREATE NODE d09 WITH ( HOST = 'x.x.x.103' , TYPE = 'datanode' , PORT = 15433 ); CREATE NODE d10 WITH ( HOST = 'x.x.x.103' , TYPE = 'datanode' , PORT = 15434 ); CREATE NODE d11 WITH ( HOST = 'x.x.x.103' , TYPE = 'datanode' , PORT = 15435 ); CREATE NODE d12 WITH ( HOST = 'x.x.x.104' , TYPE = 'datanode' , PORT = 15432 ); CREATE NODE d13 WITH ( HOST = 'x.x.x.104' , TYPE = 'datanode' , PORT = 15433 ); CREATE NODE d14 WITH ( HOST = 'x.x.x.104' , TYPE = 'datanode' , PORT = 15434 ); CREATE NODE d15 WITH ( HOST = 'x.x.x.104' , TYPE = 'datanode' , PORT = 15435 ); CREATE NODE d16 WITH ( HOST = 'x.x.x.105' , TYPE = 'datanode' , PORT = 15432 ); CREATE NODE d17 WITH ( HOST = 'x.x.x.105' , TYPE = 'datanode' , PORT = 15433 ); CREATE NODE d18 WITH ( HOST = 'x.x.x.105' , TYPE = 'datanode' , PORT = 15434 ); CREATE NODE d19 WITH ( HOST = 'x.x.x.105' , TYPE = 'datanode' , PORT = 15435 ); SELECT pgxc_pool_reload (); With those in place there is now a data node for each of the 20 GPUs in the cluster and they've been federated on the coordinator.

Loading 1.1 Billion Trips into BrytlytDB I'll first define the table schema for the 1.1 billion taxi trip records. $ sudo docker exec -ti cluster \ /usr/local/brytlyt/bin/psql \ brytlyt brytlyt CREATE FOREIGN TABLE trips ( trip_id INTEGER , vendor_id VARCHAR ( 3 ), pickup_datetime TIMESTAMP , dropoff_datetime TIMESTAMP , store_and_fwd_flag VARCHAR ( 1 ), rate_code_id SMALLINT , pickup_longitude DOUBLE PRECISION , pickup_latitude DOUBLE PRECISION , dropoff_longitude DOUBLE PRECISION , dropoff_latitude DOUBLE PRECISION , passenger_count SMALLINT , trip_distance DOUBLE PRECISION , fare_amount DOUBLE PRECISION , extra DOUBLE PRECISION , mta_tax DOUBLE PRECISION , tip_amount DOUBLE PRECISION , tolls_amount DOUBLE PRECISION , ehail_fee DOUBLE PRECISION , improvement_surcharge DOUBLE PRECISION , total_amount DOUBLE PRECISION , payment_type VARCHAR ( 10 ), trip_type SMALLINT , pickup VARCHAR ( 50 ), dropoff VARCHAR ( 50 ), cab_type VARCHAR ( 10 ), precipitation SMALLINT , snow_depth SMALLINT , snowfall SMALLINT , max_temperature SMALLINT , min_temperature SMALLINT , average_wind_speed SMALLINT , pickup_nyct2010_gid SMALLINT , pickup_ctlabel VARCHAR ( 10 ), pickup_borocode SMALLINT , pickup_boroname VARCHAR ( 13 ), pickup_ct2010 VARCHAR ( 6 ), pickup_boroct2010 VARCHAR ( 7 ), pickup_cdeligibil VARCHAR ( 1 ), pickup_ntacode VARCHAR ( 4 ), pickup_ntaname VARCHAR ( 56 ), pickup_puma VARCHAR ( 4 ), dropoff_nyct2010_gid SMALLINT , dropoff_ctlabel VARCHAR ( 10 ), dropoff_borocode SMALLINT , dropoff_boroname VARCHAR ( 13 ), dropoff_ct2010 VARCHAR ( 6 ), dropoff_boroct2010 VARCHAR ( 7 ), dropoff_cdeligibil VARCHAR ( 1 ), dropoff_ntacode VARCHAR ( 4 ), dropoff_ntaname VARCHAR ( 56 ), dropoff_puma VARCHAR ( 4 ) ) SERVER gm_fdw_server OPTIONS ( max_size '55682651' , index '10:24' ) DISTRIBUTE BY ROUNDROBIN ; In the above table declaration there are indices for both the passenger count and cab type columns. The gm_fdw_server identifier tells PostgreSQL to connect to BrytlytDB's GPU foreign data wrapper. The 55,682,651 max_size value is the maximum number of rows I'm intending to import onto each data node. There are 20 data nodes so this should cover the 1.1 billion total records size of the dataset. I'm using the Round Robin data distribution policy here. This table won't be joined with any other data, if it were HASH and/or REPLICATE would be better options. Note that since my last BrytlytDB benchmark version 2.1 of the software has been released and there is now support for both TIMESTAMP and SMALLINT fields. For this benchmark I've downloaded and decompressed the 500 GB of CSV data I created in my Billion Taxi Rides in Redshift blog post. I'll dedicate 1/5th of the dataset for each of the five servers. On each server its dataset is broken up again into four parts so that each CSV file targets one data node. There is a 1-to-1 ratio of data nodes to Tesla P100 cards. The following was all run on the coordinator. $ vi urls.txt https://<host>/trips_xaa.csv.gz https://<host>/trips_xab.csv.gz https://<host>/trips_xac.csv.gz https://<host>/trips_xad.csv.gz https://<host>/trips_xae.csv.gz https://<host>/trips_xaf.csv.gz https://<host>/trips_xag.csv.gz https://<host>/trips_xah.csv.gz https://<host>/trips_xai.csv.gz https://<host>/trips_xaj.csv.gz https://<host>/trips_xak.csv.gz https://<host>/trips_xal.csv.gz https://<host>/trips_xam.csv.gz https://<host>/trips_xan.csv.gz https://<host>/trips_xao.csv.gz https://<host>/trips_xap.csv.gz https://<host>/trips_xaq.csv.gz https://<host>/trips_xar.csv.gz https://<host>/trips_xas.csv.gz https://<host>/trips_xat.csv.gz https://<host>/trips_xau.csv.gz https://<host>/trips_xav.csv.gz https://<host>/trips_xaw.csv.gz https://<host>/trips_xax.csv.gz https://<host>/trips_xay.csv.gz https://<host>/trips_xaz.csv.gz https://<host>/trips_xba.csv.gz https://<host>/trips_xbb.csv.gz https://<host>/trips_xbc.csv.gz https://<host>/trips_xbd.csv.gz https://<host>/trips_xbe.csv.gz https://<host>/trips_xbf.csv.gz https://<host>/trips_xbg.csv.gz https://<host>/trips_xbh.csv.gz https://<host>/trips_xbi.csv.gz https://<host>/trips_xbj.csv.gz https://<host>/trips_xbk.csv.gz https://<host>/trips_xbl.csv.gz https://<host>/trips_xbm.csv.gz https://<host>/trips_xbn.csv.gz https://<host>/trips_xbo.csv.gz https://<host>/trips_xbp.csv.gz https://<host>/trips_xbq.csv.gz https://<host>/trips_xbr.csv.gz https://<host>/trips_xbs.csv.gz https://<host>/trips_xbt.csv.gz https://<host>/trips_xbu.csv.gz https://<host>/trips_xbv.csv.gz https://<host>/trips_xbw.csv.gz https://<host>/trips_xbx.csv.gz https://<host>/trips_xby.csv.gz https://<host>/trips_xbz.csv.gz https://<host>/trips_xca.csv.gz https://<host>/trips_xcb.csv.gz https://<host>/trips_xcc.csv.gz https://<host>/trips_xcd.csv.gz $ cat urls.txt | \ xargs -n 1 -P 6 \ wget $ find trips_x*.csv.gz | \ xargs -n 1 -P 16 \ gunzip $ cat trips_x*.csv \ | split --lines = 55682651 \ --additional-suffix = .csv count = 0 for filename in xa*.csv ; do /usr/local/brytlyt/bin/psql brytlyt brytlyt \ -c "EXECUTE DIRECT ON ( data_datanode $count ) 'SELECT load_data(''trips'', '' $filename '', '','', 2000000, ''0'', '''', ''append'')'" 2 > & 1 & /usr/local/brytlyt/bin/psql brytlyt brytlyt \ -c "EXECUTE DIRECT ON ( data_datanode $(( $count + 4 )) ) 'SELECT load_data(''trips'', '' $filename '', '','', 2000000, ''0'', '''', ''append'')'" 2 > & 1 & /usr/local/brytlyt/bin/psql brytlyt brytlyt \ -c "EXECUTE DIRECT ON ( data_datanode $(( $count + 8 )) ) 'SELECT load_data(''trips'', '' $filename '', '','', 2000000, ''0'', '''', ''append'')'" 2 > & 1 & /usr/local/brytlyt/bin/psql brytlyt brytlyt \ -c "EXECUTE DIRECT ON ( data_datanode $(( $count + 12 )) ) 'SELECT load_data(''trips'', '' $filename '', '','', 2000000, ''0'', '''', ''append'')'" 2 > & 1 & /usr/local/brytlyt/bin/psql brytlyt brytlyt \ -c "EXECUTE DIRECT ON ( data_datanode $(( $count + 16 )) ) 'SELECT load_data(''trips'', '' $filename '', '','', 2000000, ''0'', '''', ''append'')'" 2 > & 1 & (( count++ )) done The above took around 2.5 hours to complete.