Spash is a command line tool for Big Data platforms that simulates a real Unix environment, providing most of the commands of a typical Bash shell on top of YARN, HDFS and Apache Spark.

Spash uses the HDFS APIs to execute simple file operations and Apache Spark to perform parallel computations on big datasets. With Spash, managing a Big Data cluster becomes as natural as writing bash commands.

Spash is an open source project from an idea of Nicola Barbieri. The code is here: https://github.com/nerdammer/spash.

Architecture

The Spash daemon runs on an edge node of a Big Data cluster and listens for incoming SSH connections on port 2222 . Clients can connect using the OS native terminal application, or Putty on Windows.

The Spash daemon will emulate a Unix OS and leverage the power of Spark to perform efficient computations on distributed data.

The World Before Spash

For those who don’t remember the classic way of doing simple operations on HDFS, here’s a reminder:

bash $ hdfs dfs -ls / # ## ### [wait for the JVM to load and execute] ... ########################################## bash $ hdfs dfs -copyFromLocal myFile / # ## ### [wait for the JVM to load and execute] ... ########################################## bash $

Spash Makes It Easier

Just run the Spash daemon on a node of your Big Data cluster, you can connect to it using ssh [email protected] -p 2222 (the password is user ) and then run all your favourite bash commands to manipulate data.

And this is just the tip of the iceberg. From your host, you can use scp ( scp myfile -P 2222 [email protected]:/ ) to copy a file into hdfs. You can even use FileZilla or WinSCP to browse HDFS.

Spash works on Cloudera CDH 5 and many other platforms.

Spash is a proof of concept and needs contributions. Look at how to contribute and get involved on the project home page: https://github.com/nerdammer/spash.