The Hadoop Distributed File System (HDFS) allows you to both federate storage across many computers as well as distribute files in a redundant manor across a cluster. HDFS is a key component to many storage clusters that possess more than a petabyte of capacity.

Each computer acting as a storage node in a cluster can contain one or more storage devices. This can allow several mechanical storage drives to both store data more reliably than SSDs, keep the cost per gigabyte down as well as go some way to exhausting the SATA bus capacity of a given system.

Hadoop ships with a feature-rich and robust JVM-based HDFS client. For many that interact with HDFS directly it is the go-to tool for any given task. That said, there is a growing population of alternative HDFS clients. Some optimise for responsiveness while others make it easier to utilise HDFS in Python applications. In this post I'll walk through a few of these offerings.

If you'd like to setup an HDFS environment locally please see my Hadoop 3 Single-Node Install Guide (skip the steps for Presto and Spark). I also have posts that cover working with HDFS on AWS EMR and Google Dataproc.

The Apache Hadoop HDFS Client The Apache Hadoop HDFS client is the most well-rounded HDFS CLI implementation. Virtually any API endpoint that has been built into HDFS can be interacted with using this tool. For the release of Hadoop 3, considerable effort was put into reorganising the arguments of this tool. This is what they look like as of this writing. $ hdfs Admin Commands: cacheadmin configure the HDFS cache crypto configure HDFS encryption zones debug run a Debug Admin to execute HDFS debug commands dfsadmin run a DFS admin client dfsrouteradmin manage Router-based federation ec run a HDFS ErasureCoding CLI fsck run a DFS filesystem checking utility haadmin run a DFS HA admin client jmxget get JMX exported values from NameNode or DataNode. oev apply the offline edits viewer to an edits file oiv apply the offline fsimage viewer to an fsimage oiv_legacy apply the offline fsimage viewer to a legacy fsimage storagepolicies list/get/set block storage policies Client Commands: classpath prints the class path needed to get the hadoop jar and the required libraries dfs run a filesystem command on the file system envvars display computed Hadoop environment variables fetchdt fetch a delegation token from the NameNode getconf get config values from configuration groups get the groups which users belong to lsSnapshottableDir list all snapshottable dirs owned by the current user snapshotDiff diff two snapshots of a directory or diff the current directory contents with a snapshot version print the version Daemon Commands: balancer run a cluster balancing utility datanode run a DFS datanode dfsrouter run the DFS router diskbalancer Distributes data evenly among disks on a given node httpfs run HttpFS server, the HDFS HTTP Gateway journalnode run the DFS journalnode mover run a utility to move block replicas across storage types namenode run the DFS namenode nfs3 run an NFS version 3 gateway portmap run a portmap service secondarynamenode run the DFS secondary namenode zkfc run the ZK Failover Controller daemon The bulk of the disk access verbs most people familiar with Linux will recognise are kept under the dfs argument. $ hdfs dfs Usage: hadoop fs [generic options] [-appendToFile <localsrc> ... <dst>] [-cat [-ignoreCrc] <src> ...] [-checksum <src> ...] [-chgrp [-R] GROUP PATH...] [-chmod [-R] <MODE[,MODE]... | OCTALMODE> PATH...] [-chown [-R] [OWNER][:[GROUP]] PATH...] [-copyFromLocal [-f] [-p] [-l] [-d] [-t <thread count>] <localsrc> ... <dst>] [-copyToLocal [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>] [-count [-q] [-h] [-v] [-t [<storage type>]] [-u] [-x] [-e] <path> ...] [-cp [-f] [-p | -p[topax]] [-d] <src> ... <dst>] [-createSnapshot <snapshotDir> [<snapshotName>]] [-deleteSnapshot <snapshotDir> <snapshotName>] [-df [-h] [<path> ...]] [-du [-s] [-h] [-v] [-x] <path> ...] [-expunge] [-find <path> ... <expression> ...] [-get [-f] [-p] [-ignoreCrc] [-crc] <src> ... <localdst>] [-getfacl [-R] <path>] [-getfattr [-R] {-n name | -d} [-e en] <path>] [-getmerge [-nl] [-skip-empty-file] <src> <localdst>] [-head <file>] [-help [cmd ...]] [-ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] [<path> ...]] [-mkdir [-p] <path> ...] [-moveFromLocal <localsrc> ... <dst>] [-moveToLocal <src> <localdst>] [-mv <src> ... <dst>] [-put [-f] [-p] [-l] [-d] <localsrc> ... <dst>] [-renameSnapshot <snapshotDir> <oldName> <newName>] [-rm [-f] [-r|-R] [-skipTrash] [-safely] <src> ...] [-rmdir [--ignore-fail-on-non-empty] <dir> ...] [-setfacl [-R] [{-b|-k} {-m|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]] [-setfattr {-n name [-v value] | -x name} <path>] [-setrep [-R] [-w] <rep> <path> ...] [-stat [format] <path> ...] [-tail [-f] <file>] [-test -[defsz] <path>] [-text [-ignoreCrc] <src> ...] [-touchz <path> ...] [-truncate [-w] <length> <path> ...] [-usage [cmd ...]] Notice how the top usage line doesn't mention hdfs dfs but instead hadoop fs . You'll find that either prefix will provide the same functionality if you're working with HDFS as an endpoint. $ hdfs dfs -ls / $ hadoop fs -ls /

A Golang-based HDFS Client In 2014, Colin Marc started work on a Golang-based HDFS client. This tool has two major features that stand out to me. The first is that there is no JVM overhead so execution begins much quicker than the JVM-based client. Second is the arguments align more closely with the GNU Core Utilities commands like ls and cat . This isn't a drop-in replacement for the JVM-based client but it should be a lot more intuitive for those already familiar with GNU Core Utilities file system commands. The following will install the client on a fresh Ubuntu 16.04.2 LTS system. $ wget -c -O gohdfs.tar.gz \ https://github.com/colinmarc/hdfs/releases/download/v2.0.0/gohdfs-v2.0.0-linux-amd64.tar.gz $ tar xvf gohdfs.tar.gz $ gohdfs-v2.0.0-linux-amd64/hdfs The release also includes a bash completion script. This is handy for being able to hit tab and get a list of commands or to complete a partially typed-out list of arguments. I'll include the extracted folder name below to help differentiate this tool from the Apache HDFS CLI. $ gohdfs-v2.0.0-linux-amd64/hdfs Valid commands: ls [-lah] [FILE]... rm [-rf] FILE... mv [-nT] SOURCE... DEST mkdir [-p] FILE... touch [-amc] FILE... chmod [-R] OCTAL-MODE FILE... chown [-R] OWNER[:GROUP] FILE... cat SOURCE... head [-n LINES | -c BYTES] SOURCE... tail [-n LINES | -c BYTES] SOURCE... du [-sh] FILE... checksum FILE... get SOURCE [DEST] getmerge SOURCE DEST put SOURCE DEST df [-h] As you can see, prefixing many GNU Core Utilities file system commands with the HDFS client will produce the expected functionality on HDFS. $ gohdfs-v2.0.0-linux-amd64/hdfs df -h Filesystem Size Used Available Use% 11.7G 24.0K 7.3G 0% The GitHub homepage for this project shows how listing files can be two orders of magnitude quicker using this tool versus the JVM-based CLI. This start up speed improvement can be handy if HDFS commands are being invoked a lot. The ideal file size of an ORC or Parquet file for most purposes is somewhere between 256 MB and 2 GB and it's not uncommon to see these being micro-batched into HDFS as they're being generated. Below I'll generate a file containing a gigabyte of random data. $ cat /dev/urandom \ | head -c 1073741824 \ > one_gig Uploading this file via the JVM-based CLI took 18.6 seconds on my test rig. $ hadoop fs -put one_gig /one_gig Uploading via the Golang-based CLI took 13.2 seconds. $ gohdfs-v2.0.0-linux-amd64/hdfs put one_gig /one_gig_2

Spotify's Python-based HDFS Client In 2014 work began at Spotify on a Python-based HDFS CLI and library called Snakebite. The bulk of commits on this project we put together by Wouter de Bie and Rafal Wojdyla. If you don't require Kerberos support then the only requirements for this client are the Protocol Buffers Python library from Google and Python 2.7. As of this writing Python 3 isn't supported. The following will install the client on a fresh Ubuntu 16.04.2 LTS system using a Python virtual environment. $ sudo apt install \ python \ python-pip \ virtualenv $ virtualenv .snakebite $ source .snakebite/bin/activate $ pip install snakebite This client is not a drop-in replacement for the JVM-based CLI but shouldn't have a steep learning curve if you're already familiar with GNU Core Utilities file system commands. $ snakebite snakebite [general options] cmd [arguments] general options: -D --debug Show debug information -V --version Hadoop protocol version (default:9) -h --help show help -j --json JSON output -n --namenode namenode host -p --port namenode RPC port (default: 8020) -v --ver Display snakebite version commands: cat [paths] copy source paths to stdout chgrp <grp> [paths] change group chmod <mode> [paths] change file mode (octal) chown <owner:grp> [paths] change owner copyToLocal [paths] dst copy paths to local file system destination count [paths] display stats for paths df display fs stats du [paths] display disk usage statistics get file dst copy files to local file system destination getmerge dir dst concatenates files in source dir into destination local file ls [paths] list a path mkdir [paths] create directories mkdirp [paths] create directories and their parents mv [paths] dst move paths to destination rm [paths] remove paths rmdir [dirs] delete a directory serverdefaults show server information setrep <rep> [paths] set replication factor stat [paths] stat information tail path display last kilobyte of the file to stdout test path test a path text path [paths] output file in text format touchz [paths] creates a file of zero length usage <cmd> show cmd usage The client is missing certain verbs that can be found in the JVM-based client as well as the Golang-based client described above. One of which is the ability to copy files and streams to HDFS. That being said I do appreciate how easy it is to pull statistics for a given file. $ snakebite stat /one_gig access_time 1539530885694 block_replication 1 blocksize 134217728 file_type f group supergroup length 1073741824 modification_time 1539530962824 owner mark path /one_gig permission 0644 To collect the same information with the JVM client would involve several commands. Their output would also be harder to parse than the key-value pairs above. As well as being a CLI tool, Snakebite is also a Python library. $ python from snakebite.client import Client client = Client ( "localhost" , 9000 , use_trash = False ) [ x for x in client . ls ([ '/' ])][: 2 ] [{ 'access_time' : 1539530885694 L , 'block_replication' : 1 , 'blocksize' : 134217728 L , 'file_type' : 'f' , 'group' : u 'supergroup' , 'length' : 1073741824 L , 'modification_time' : 1539530962824 L , 'owner' : u 'mark' , 'path' : '/one_gig' , 'permission' : 420 }, { 'access_time' : 1539531288719 L , 'block_replication' : 1 , 'blocksize' : 134217728 L , 'file_type' : 'f' , 'group' : u 'supergroup' , 'length' : 1073741824 L , 'modification_time' : 1539531307264 L , 'owner' : u 'mark' , 'path' : '/one_gig_2' , 'permission' : 420 }] Note I've asked to connect to localhost on TCP port 9000. Out of the box Hadoop uses TCP port 8020 for the NameNode RPC endpoint. I've often changed this to TCP port 9000 in many of my Hadoop guides. You can find the hostname and port number configured for this end point on the master HDFS node. Also note that for various reasons HDFS, and Hadoop in general, need to use hostnames rather than IP addresses. $ sudo vi /opt/hadoop/etc/hadoop/core-site.xml ... <property> <name> fs.default.name </name> <value> hdfs://localhost:9000 </value> </property> ...

Python-based HdfsCLI In 2014, Matthieu Monsch also began work on a Python-based HDFS client called HdfsCLI. Two features this client has over the Spotify Python client is that it supports uploading to HDFS and Python 3 (in addition to 2.7). Matthieu has previously worked at LinkedIn and now works for Google. The coding style of this project will feel very familiar to anyone that's looked at a Python project that has originated from Google. This library includes support for a progress tracker, a fast AVRO library, Kerberos and Pandas DataFrames. The following will install the client on a fresh Ubuntu 16.04.2 LTS system using a Python virtual environment. $ sudo apt install \ python \ python-pip \ virtualenv $ virtualenv .hdfscli $ source .hdfscli/bin/activate $ pip install 'hdfs[dataframe,avro]' In order for this library to communicate with HDFS, WebHDFS needs to be enabled on the master HDFS node. $ sudo vi /opt/hadoop/etc/hadoop/hdfs-site.xml <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name> dfs.datanode.data.dir </name> <value> /opt/hdfs/datanode </value> <final> true </final> </property> <property> <name> dfs.namenode.name.dir </name> <value> /opt/hdfs/namenode </value> <final> true </final> </property> <property> <name> dfs.replication </name> <value> 1 </value> </property> <property> <name> dfs.webhdfs.enabled </name> <value> true </value> </property> <property> <name> dfs.namenode.http-address </name> <value> localhost:50070 </value> </property> </configuration> With the configuration in place the DFS service needs to be restarted. $ sudo su $ /opt/hadoop/sbin/stop-dfs.sh $ /opt/hadoop/sbin/start-dfs.sh $ exit A configuration file is needed to store connection settings for this client. $ vi ~/.hdfscli.cfg [global] default.alias = dev [dev.alias] url = http://localhost:50070 user = mark A big downside of the client is that only a very limited subset of HDFS functionality is supported. That being said the client verb arguments are pretty self explanatory. $ hdfscli --help HdfsCLI: a command line interface for HDFS. Usage: hdfscli [interactive] [-a ALIAS] [-v...] hdfscli download [-fsa ALIAS] [-v...] [-t THREADS] HDFS_PATH LOCAL_PATH hdfscli upload [-sa ALIAS] [-v...] [-A | -f] [-t THREADS] LOCAL_PATH HDFS_PATH hdfscli -L | -V | -h Commands: download Download a file or folder from HDFS. If a single file is downloaded, - can be specified as LOCAL_PATH to stream it to standard out. interactive Start the client and expose it via the python interpreter (using iPython if available). upload Upload a file or folder to HDFS. - can be specified as LOCAL_PATH to read from standard in. Arguments: HDFS_PATH Remote HDFS path. LOCAL_PATH Path to local file or directory. Options: -A --append Append data to an existing file. Only supported if uploading a single file or from standard in. -L --log Show path to current log file and exit. -V --version Show version and exit. -a ALIAS --alias=ALIAS Alias of namenode to connect to. -f --force Allow overwriting any existing files. -s --silent Don't display progress status. -t THREADS --threads=THREADS Number of threads to use for parallelization. 0 allocates a thread per file. [default: 0] -v --verbose Enable log output. Can be specified up to three times (increasing verbosity each time). Examples: hdfscli -a prod /user/foo hdfscli download features.avro dat/ hdfscli download logs/1987-03-23 - >>logs hdfscli upload -f - data/weights.tsv <weights.tsv HdfsCLI exits with return status 1 if an error occurred and 0 otherwise. The following finished in 68 seconds. This is an order of magnitude slower than some other clients I'll explore in this post. $ hdfscli upload \ -t 4 \ -f - \ one_gig_3 < one_gig That said, the client is very easy to work with in Python. $ python from hashlib import sha256 from hdfs import Config client = Config () . get_client ( 'dev' ) [ client . status ( uri ) for uri in client . list ( '' )][: 2 ] [{ u 'accessTime' : 1539532953515 , u 'blockSize' : 134217728 , u 'childrenNum' : 0 , u 'fileId' : 16392 , u 'group' : u 'supergroup' , u 'length' : 1073741824 , u 'modificationTime' : 1539533029897 , u 'owner' : u 'mark' , u 'pathSuffix' : u '' , u 'permission' : u '644' , u 'replication' : 1 , u 'storagePolicy' : 0 , u 'type' : u 'FILE' }, { u 'accessTime' : 1539533046246 , u 'blockSize' : 134217728 , u 'childrenNum' : 0 , u 'fileId' : 16393 , u 'group' : u 'supergroup' , u 'length' : 1073741824 , u 'modificationTime' : 1539533114772 , u 'owner' : u 'mark' , u 'pathSuffix' : u '' , u 'permission' : u '644' , u 'replication' : 1 , u 'storagePolicy' : 0 , u 'type' : u 'FILE' }] Below I'll generate a SHA-256 hash of a file located on HDFS. with client . read ( '/user/mark/one_gig' ) as reader : print sha256 ( reader . read ()) . hexdigest ()[: 6 ] ab5f97

Apache Arrow's HDFS Client Apache Arrow is a cross-language platform for in-memory data headed by Wes McKinney. It's Python bindings "PyArrow" allows Python applications to interface with a C++-based HDFS client. Wes stands out in the data world. He has worked for Cloudera in the past, created the Pandas Python package and has been a contributor to the Apache Parquet project. The following will install PyArrow on a fresh Ubuntu 16.04.2 LTS system using a Python virtual environment. $ sudo apt install \ python \ python-pip \ virtualenv $ virtualenv .pyarrow $ source .pyarrow/bin/activate $ pip install pyarrow The Python API behaves in a clean and intuitive manor. $ python from hashlib import sha256 import pyarrow as pa hdfs = pa . hdfs . connect ( host = 'localhost' , port = 9000 ) with hdfs . open ( '/user/mark/one_gig' , 'rb' ) as f : print sha256 ( f . read ()) . hexdigest ()[: 6 ] ab5f97 The interface is very performant as well. The following completed in 6 seconds. This is the fastest any client transfer this file on my test rig. with hdfs . open ( '/user/mark/one_gig_4' , 'wb' ) as f : f . write ( open ( '/home/mark/one_gig' ) . read ())