As a Big Data Engineer, I often need to improve Spark jobs or Hive queries by analyzing the input and output data distribution in order to see if the data are well-balanced on the different nodes of the cluster or skewed. The purpose of this article is to present a simple Python script (Python3) to display the data distribution for any HDFS file and repository.

As you know, files in HDFS are split into HDFS blocks which are distributed and replicated on the cluster nodes. Just to be clear, when I say data distribution, I mean HDFS blocks distribution. Here is a simple schema to visualize how an HDFS file is split into blocks and distributed across a cluster. In that example the replication factor is 2.

In the following sections, I am going to introduce the HDFS File System Checking Utility command and how to use it. Then, I will present the Python script I wrote to use this command in order to display the data distribution of a HDFS repository.

HDFS File System Checking Utility command

As mentioned in the documentation, the hdfs fsck command is designed for reporting problems with various files, for example, missing blocks for a file or under-replicated blocks. But, we can also use it to check data distribution.

The hdfs fsck command print some information on a given HDFS path:

Status

Total size

Number of files in the repository

List of the HDFS blocks for each file

Replication factor of each file

Size of each HDFS block

Location of each HDFS block

In my script I use the hdfs fsck command like this

$ hdfs fsck hdfs://my/hdfs/path -files -blocks -locations

Some explanations of the arguments:

-files: print out files being checked

-blocks: print out block report

-locations: print out locations for every block

The -racks argument is also available, you can use it to see how the data are balanced in the different racks of your cluster. I chose to not include this argument in my script because all my cluster is running on a single rack.

Here is the output of the command

$ hdfs fsck hdfs://my/hdfs/path -files -blocks -locations my/hdfs/path/file_1 3233752175 bytes, 3 block(s): OK

1. BP-132798933-10.231.164.51-1495778573138:blk_1073760544_19759 len=268435456 repl=2 [DatanodeInfoWithStorage[10.231.164.54:50010,DS-aefa4fdf-b2b7-4143-8833-5022f2a86e31,DISK], DatanodeInfoWithStorage[10.231.164.55:50010,DS-cc75eb07-a744-4e30-80a0-140f10e1e7a2,DISK]]

0. BP-132798933-10.231.164.51-1495778573138:blk_1073760522_19737 len=268435456 repl=2 [DatanodeInfoWithStorage[10.231.164.52:50010,DS-ae123305-08ca-4e5f-86dc-3e2e9b7ebe2a,DISK], DatanodeInfoWithStorage[10.231.164.55:50010,DS-cc75eb07-a744-4e30-80a0-140f10e1e7a2,DISK]]

2. BP-132798933-10.231.164.51-1495778573138:blk_1073760573_19788 len=268435456 repl=2 [DatanodeInfoWithStorage[10.231.164.53:50010,DS-0c13c48c-0f3e-471f-932c-2131f0ae055c,DISK], DatanodeInfoWithStorage[10.231.164.56:50010,DS-20931ad0-b272-4e1b-bac3-f651a2d38e41,DISK]]



my/hdfs/path/file_2 3493050817 bytes, 2 block(s): OK

0. BP-132798933-10.231.164.51-1495778573138:blk_1073760524_19739 len=268435456 repl=2 [DatanodeInfoWithStorage[10.231.164.56:50010,DS-4a35cd92-ff2d-4a0a-8679-401cab2bd127,DISK], DatanodeInfoWithStorage[10.231.164.52:50010,DS-ae123305-08ca-4e5f-86dc-3e2e9b7ebe2a,DISK]]

1. BP-132798933-10.231.164.51-1495778573138:blk_1073760552_19767 len=268435456 repl=2 [DatanodeInfoWithStorage[10.231.164.56:50010,DS-4a35cd92-ff2d-4a0a-8679-401cab2bd127,DISK], DatanodeInfoWithStorage[10.231.164.55:50010,DS-6fce9f9a-f16e-44f4-a87b-431e3d1a8fdd,DISK]]



Status: HEALTHY

Total size: 1342177280B

Total dirs: 1

Total files: 2

Total symlinks: 0

Total blocks (validated): 5 (avg. block size 268435456B)

Minimally replicated blocks: 5 (100.0 %)

Over-replicated blocks: 0 (0.0 %)

Under-replicated blocks: 0 (0.0 %)

Mis-replicated blocks: 0 (0.0 %)

Default replication factor: 2

Average block replication: 2.0

Corrupt blocks: 0

Missing replicas: 0 (0.0 %)

Number of data-nodes: 5

Number of racks: 1

FSCK ended at Thu Jan 24 16:03:48 NZDT 2019 in 15 milliseconds

By looking at the output, we can do a very simple analysis. The HDFS repository we used as example contains two files:

file_1: split across 3 HDSF blocks, each one with a replication factor of 2

- Block1 on 10.231.164.54 and 10.231.164.55 (size = 268435456B)

- Block2 on 10.231.164.52 and 10.231.164.55 (size = 268435456B)

- Block3 on 10.231.164.53 and 10.231.164.56 (size = 268435456B)

- Block1 on 10.231.164.54 and 10.231.164.55 (size = 268435456B) - Block2 on 10.231.164.52 and 10.231.164.55 (size = 268435456B) - Block3 on 10.231.164.53 and 10.231.164.56 (size = 268435456B) file_2: split across 2 HDSF blocks, each one with a replication factor of 2

- Block1 on 10.231.164.56 and 10.231.164.52 (size = 268435456B)

- Block2 on 10.231.164.56 and 10.231.164.55 (size = 268435456B)

We can notice that the total size displayed is equals to the sum of each HDFS blocks (268435456B * 5 = 1342177280B) whereas the actual size used to store the data on the cluster is twice as bulky because the replication factor is 2 in this example.

Python Script to display the data distribution

Using the hdfs fsck can be useful to understand better the HDFS data distribution but it is still pretty messy with all the files and block listed. That is why I wrote a Python script to parse the output of the hdfs fsck command and calculate the data size on each node of the cluster.

I think the script is simple enough to not explain it in details. If it is not the case, feel free to leave a comment.

Here is the output we get by running this script on the same HDFS path

$ python3 hdfs_data_distribution.py hdfs://my/hdfs/path Total size = 2.5 GB

Data by node

10.231.164.53 : 256.0 MB ( 10.00 %)

10.231.164.54 : 256.0 MB ( 10.00 %)

10.231.164.52 : 512.0 MB ( 20.00 %)

10.231.164.55 : 768.0 MB ( 30.00 %)

10.231.164.56 : 768.0 MB ( 30.00 %)

Using this script, we can see that the data are not evenly distributed.

We can also notice that the total size printed is twice bigger than the total size displayed in the hdfs fsck. This is because in the Python script we sum the size of each HDFS block, so the replication blocks are counted in that sum. In that example the replication factor is 2, that is why the total size is twice bigger using the script.

Conclusion

This was how I simply use an existing HDFS command to better understand the distribution of the data I am dealing with every day. I hope this article and the Python script I wrote will help you soon, let me know if you have any feedback to improve it !