[Today’s random Sourcerer profile: https://sourcerer.io/austinbillings]

The Power of Ruby

High-level languages, like Ruby, are amazing. In just under 500 lines of code, I was able to cobble together a system that catalogs and distributes files across a network to other computers. Each instance of this code can act as a server and a client, depending on configuration. And, because it’s written in Ruby, it’s cross-platform and easily enhanced.

I’m not saying this in a braggadocious manner. I am not the best Ruby programmer by a long shot. I’m sure more competent Ruby developers will look at the code presented in this article and scratch their head more than once. And yet with very little code I can emulate some of the behavior of much more complicated systems.

The true hero in this story is this beautifully expressive language.

Let’s Get Syncronized

Distributed file sync systems are incredibly useful for a wide variety of applications. Before these file sharing schemes, users mounted network drives and hoped for the best. But in today’s mobile world, that paradigm isn’t practical. We’re not always connected, and even if we are, we’re frequently changing networks.

Dropbox filled the gap quite nicely, allowing users to backup and share files without having to depend on a static file share. Multiple users could open files at the same time, avoiding the dreaded “sharing violation” error.

RDFS aims to provide a simple, open source solution to synchronize a folder on your drive with other computers. When you first run RDFS, it creates a folder named “rdfs” in your home folder. This folder is used for syncing. Anything you place here will be copied to other computers running this software.

Installing RDFS

To install RDFS, use a terminal and navigate to the folder one level above you want to install it. Then, run:

It will be cloned into the folder “rdfs” in the folder you run the command.

You need Ruby installed on your system. If you use Mac or Linux, your system may already have it. To see if this is the case, run:

ruby -v

on your terminal. You’ll need version 1.9.3 or higher, but 2.x is strongly recommended. If you get a command not found error, you need to install it.

You can download it here:

https://www.ruby-lang.org/en/

Once you have installed Ruby, you need to install a few gems that RDFS requires. Instructions on doing this vary, but on Debian based Linux distributions (Debian, Ubuntu, Linux Mint, etc.), you’d run:

apt install ruby-sqlite3 ruby-daemons

On Windows and Mac you should be able to run:

gem install sqlite3 daemons

If the gem command doesn’t work, you may not have Ruby installed correctly, or you may need to reboot (especially on Windows).

Since the code with comments is quite long, only select snippets are shown in this article. To see the full source code, please check out this GitHub repo:

https://github.com/sourcerer-io/rdfs

Using RDFS

To start RDFS, run:

ruby rdfsctl.rb start

On Linux and macOS, you can run:

chmod +x rdfsctl.rb

so that you may simply use:

./rdfsctl.rb start

On Windows you must use the full command.

RDFS Goals

When planning this project, I originally thought of a way to duplicate the behavior of Dropbox. That’s a lofty goal, seeing as how Dropbox is not only a wonderfully efficient cross-platform client, but also a cloud-based server infrastructure that provides versioning and syncing with a wide range of apps and interfaces. Emulating Dropbox was out of the question but borrowing some of it’s ideas was certainly feasible.

A short list of features included:

· Ability to sync together a folder on multiple machines, regardless of operating system and underlying file system.

· Ability to add nodes (i.e. clients) on the fly without restarting.

· Dealing gracefully with network issues.

· To protect data, focus on additions not deletions. Deletions would propagate to nodes but if they failed, ignore the situation rather than try to continuously delete files.

· Provide de-duplication at a file, and eventually, block level.

The current code meets many of these goals. The most difficult one to accomplish is de-duplication. Right now, it’s working at a file level. More on that later.

File Synchronization

The principle goal of the project is to provide the ability to sync files from one machine to another in an automatic fashion. To accomplish this, I periodically perform a file and directory transversal and compare it to the database to see if any files have been added, modified, or deleted. If so, these changes are noted and pushed to client nodes.

Adding Nodes on the Fly

The ability to add other nodes without restarting is an important feature. Not only does this provide convenience to the user, but the grand scheme down the road is to allow for nodes to automatically discover themselves. For this scheme to work, adding a node on the fly would be necessary. Currently, this is done by calling an API call via HTTP.

Dealing with Network Issues

Nodes can’t always guarantee to be available, so gracefully dealing with network outages or down links is essential. This is accomplished via Ruby’s excellent exception handling capabilities. If an API call fails between nodes, the call is simply retried an indefinite number of times until it succeeds.

Data Protection

The most dangerous operation RDFS performs is the deleting of files or directories. To be safe, directories with files are never deleted. If a directory isn’t empty due to an issue with syncing, the directory removal call will fail. It’s better to have extra files on a node than to have inadvertently deleted data.

Data Deduplication

If two files are the same but have different names, only one will be transmitted. Breaking a file into blocks and then de-duplicating the blocks is an extremely efficient method, but also more difficult and CPU intensive as you must get an SHA256 hash of each block of a file, not just the entire file itself. However, the savings in bandwidth can be substantial, especially on large files. While block-level data deduplication was planned, as of this writing, I haven’t implemented it.

Current Status

The code presented both in this article and in the GitHub repository is of alpha to beta quality. It is functional, but some planned features are missing, and there will undoubtedly be some bugs. Production use is not recommended at this time for this reason, but as a technology demonstration and proof of concept it works well. In time, the bugs will be ironed out and features added, and I hope to follow up on this development in future articles.

Code Walkthrough

The primary reason RDFS was developed was to showcase Ruby’s amazingly expressive syntax and to demystify the basic concept behind some distributed file systems. Up to now we’ve focused on features, so let’s dive into the specifics.

What’s in a Namespace?

The project is split into multiple files for readability. The primary program, rdfs.rb, includes the necessary files and dependencies with require and require_relative calls. The require_relative call works just like require except it uses the current directory as a starting point for included files.

Next, we use the module statement to define that all code exists within the RDFS module. This provides a way to define a namespace for our work. While not essential, if RDFS is included in any other project, it prevents class, method, and variable name clashes. The included classes in the lib folder also follow this scheme so that all classes stay within the RDFS module.

The CTRL+C Handler

After the RDFS namespace is defined, we edit the CTRL+C handler. I originally added this because I had an embedded instance of IRB (an inter-ruby command line interface) while developing and debugging the application. Technically, I could get by without it, but it does serve to properly return code 130 upon CTRL+C and provides compatibility with a future use of an embedded command interpreter such as IRB.

trap("SIGINT") do

puts "

RDFS Shutdown via CTRL+C."

exit 130

end

With “trap”, we can catch any exception. We don’t have to exit, but since the user expects CTRL+C to do so, we exit with code 130.

Constantly Constant

Now it’s time for constants. In Ruby, constants are just uppercase variable names, like RDFS_DEBUG. Technically I could have gotten rid of the RDFS_ prefix since we’re using namespaces, but it’s an old habit from other languages and it doesn’t harm anything.

The path for the RDFS folder (i.e. the folder to watch and sync) should be moved to either the command line or a config file with a default value of $HOME/rdfs. However, during development the constant value worked best and for now that’s what we have. I will be adding command line switches to improve this process.

This section also contains SQL commands to create the necessary tables in the SQLite3 database. This database stores a list of all files and the nodes that instance of RDFS is going to send its changes to when files are updated or deleted.

I chose 10 and 5 seconds for the updater and transmitter thread timeouts respectively and chose port 47656 for the server thread to listen on to receive API commands. The port number is a thinly veiled Star Trek Voyager reference. Let me know if you spot it!

I’m a Logger and That’s OK

Logging is incredibly useful during development, especially with programs that perform tasks in the background. Ruby comes with an excellent logging system, so it’s initiated early (after the constants are defined) so that debug messages can be logged.

# Setup logging

logger = Logger.new(STDOUT)

if RDFS_DEBUG

logger.level == Logger::DEBUG

else

logger.level == Logger::WARN

end

In the above example, the logger level is set to WARN or DEBUG depending on the constant RDFS_DEBUG. Now we can log with ease.

# Log an INFO message

logger.info("This is an info message") # Log a WARN message

logger.warn("This is a warning!") # Spew DEBUG info

logger.debug("Blah blah ya whatever")

Bootstrapping

If you’re not used to Ruby, you might find the keyword “unless” quite interesting. Rather than use “if”, I chose to use “unless” when checking for the RDFS folder and database presence because those commands would only be running IF the condition was not true (i.e. those items didn’t exist) rather than true. Yes, you can put an exclamation point to NOT your returned condition, but the unless keyword is easier to follow.

# Does file storage area exist? If not, create it.

unless Dir.exists?(RDFS_PATH)

Dir.mkdir(RDFS_PATH)

logger.info("RDFS directory " + RDFS_PATH + " not found, so it was created.")

end

SQLite

I knew I wanted some sort of local storage of data that would persist between executions of the RDFS daemon. Storing it memory would have been faster, but then that information would have to be rebuilt upon each start of the process. To avoid that, I chose to use an SQL database.

I could have used a more robust system like MariaDB, but SQLite made a far more portable and light weight alternative. I didn’t want the user to have to install a large database system just to run this program, so a single file system was ideal.

In this line of code:

RDFS_DB = SQLite3::Database.open RDFS_DB_FILE

The RDFS_DB constant becomes the database handle.

Threading the Needle

Ruby has beautiful, built-in support for threading. This threading system has a checkered past, though. In the early days, Ruby used “green” threads, meaning the thread execution was schedule by the Ruby interpreter and not the underlying operating system. As such, programs couldn’t take advantage of multiple CPUs simply by using the native threading system.

In Ruby 1.9, this changed. Native threads were supported, and performance in this area continues to improve. With a simple “Thread.new”, we can spin up a new thread inline without even having to define a function. In RDFS, we use this to spin up new threads, defined by classes, in the background. Internally, those classes have a while loop that keeps them running with a periodic sleep schedule.

In those classes, I use Thread.pass as well as sleep. I probably don’t need both, but I added it there to debug some concurrency issues and decided to leave them. More idle time slots passed to other threads can’t be a bad thing.

Ruby threading is blissfully simple:

my_thread = Thread.new do

100.times do

puts "Hello, World!"

end

end # Wait for thread to finish

my_thread.join

The Updater

The updater is both a class and a sperate thread. A new instance of this class is instantiated, becoming an object, and spun off on its own thread. The kernel of this class then enters a loop where it is delayed by the RDFS_UPDATE_FREQ (in seconds). On each pass of the internal block, it does three key things: check for deleted files, check for files that were added, and check for updated files. A separate and private instance logger is also setup for debugging purposes.

Deleted Files

First, the updater checks for deleted files. It does this by iterating through the database of files that aren’t marked for update and deletion and checking to see if those files exist on disk. If they don’t, they must have been deleted, and the deleted flag is set for that file. The transmitter will subsequently request other node servers to delete that file or directory.



# Check for deleted files

sql = "SELECT name FROM files WHERE updated = 0 AND deleted = 0"



all_files = RDFS_DB.execute(sql)

if all_files.count > 0

all_files.each do |f|

filename = f[0]

full_filename = RDFS_PATH + "/" + filename

unless File.exists?(full_filename)

# File doesn't exist, so mark it deleted

sql = "UPDATE files SET deleted = 1 WHERE name = '" + filename + "'"



RDFS_DB.execute(sql)

end

end

end

end def check_for_deleted_files# Check for deleted filessql = "SELECT name FROM files WHERE updated = 0 AND deleted = 0" @logger .debug("updater: " + sql)all_files = RDFS_DB.execute(sql)if all_files.count > 0all_files.each do |f|filename = f[0]full_filename = RDFS_PATH + "/" + filenameunless File.exists?(full_filename)# File doesn't exist, so mark it deletedsql = "UPDATE files SET deleted = 1 WHERE name = '" + filename + "'" @logger .debug("updater: " + sql)RDFS_DB.execute(sql)endendendend

Added Files

Next, the updater checks for files that were added since the last scan. This is done by walking the directory tree and finding any files that aren’t in the database. If not, they are added, and the update flag is set so that the transmitter will send them to other nodes.

Updated Files

If a file exists, it is still hashed via the SHA256 algorithm. This is done internally by Ruby without the aid of outside tools (such as sha256sum on Linux and Mac), helping to make this program cross-platform. If the file has changed from the hash stored in the database, the updated flag is set and the transmitter is queued to send this file to other nodes.

This is accomplished with this line:

# sha256 is filled with the SHA256 hash of filename

sha256 = Digest::SHA256.file(filename).hexdigest

The Transmitter

The transmitter class is structured in much the same way as the updater class. A private logger instance is setup, and an internal kernel loop sleeps for RDFS_TRANSMIT_FREQ number of seconds, then iterates through the database and checks for updated and deleted flags. If these flags are set on a file, each node is iterated and contacted with the appropriate API call.

If the actions are taken successfully (i.e. the nodes were reached), the update flag is cleared. If not, the attempt is retried on the next pass of the transmitter kernel loop.

Adding and Updating Files

In the case of an addition (an updated or new file), the nodes are called first with the “add_query” call, which asks the node to check and see if any files exist with the same SHA256 hash. If so, rather than send the file, the file is locally copied on the node to prevent having to transmit a file across the network who’s contents already exist but in another file. This is done via the “add_dup” call. This provides basic data deduplication in transit and saves bandwidth.

If the SHA256 hash isn’t present on the node, then the file is transmitted via the API call “add”.

Deleting Files

If the file is marked for deletion, the API call “delete” is sent.



# DELETED

begin

response = Net::HTTP.post_form(uri,

'api_call' => 'delete',

'filename' => filename)

if response.body.include?("OK")

clear_update_flag(filename)

end

rescue



end

end if (deleted != 0)# DELETEDbeginresponse = Net::HTTP.post_form(uri,'api_call' => 'delete','filename' => filename)if response.body.include?("OK")clear_update_flag(filename)endrescue @logger .debug("transmitter: Unable to connect to node at IP " + ip + ".")endend

The Server

I considered several different frameworks for the server, but in the end I settled on WEBrick, the built-in web server that is part of the Ruby interpreter. With just a few lines of code, one can not only spin up a fully capable web server but also “mount”, or hook classes as URL paths, allowing those classes to handle various HTTP calls, such as GET and POST, inside those classes. This eliminates the burden of devising a custom network communication scheme. While WEBrick isn’t as fast as Apache or Nginx, it does well for what it is, and certainly works for our purposes.

The server class differs somewhat from the other classes in that rather than being served by an internal kernel loop, the WEBrick server is instantiated and the /nodes and /files paths are hooked by classes derived from WEBrick::HTTPServlet::AbstractServlet. This provides a base of methods that allow for overriding various web verbs as previously described.



.mount "/nodes", Nodes

.mount "/files", Files

.start @webrick = WEBrick::HTTPServer.new :Port => RDFS_PORT @webrick .mount "/nodes", Nodes @webrick .mount "/files", Files @webrick .start class Files < WEBrick::HTTPServlet::AbstractServlet # Process a POST request

def do_POST(request, response)

status, content_type, body = api_handler(request)

response.status = status

response['Content-Type'] = content_type

response.body = body

end ...

In any case, currently there is no security or SSL to these calls. They are blindly accepted without much verification. Eventually I’d like to add both SSL and authentication, but for now, since the main use case for this is in a local network, we can get by without it.

Nodes Handler

The nodes handler is quite simple and simply defines “add_node” and “delete_node”. For now, there is no process that currently calls these API functions. They must be called with cURL or something similar. The purpose of these functions is to provide an interface for automatic (or perhaps centralized) discovery and management of nodes.

Add a Node

If “add_node” is called, the IP of node is added to the “nodes” table. Rather than allow an IP to be specified, the IP of the requester is used. This prevents another machine from adding an incorrect node. This can be spoofed, of course, but since security isn’t a focus in this iteration, we do not deal with this possibility. This can be engineered, however, by performing a challenge/response type architecture in this call to ensure the requester node is who they say they are.

Delete a Node

If “delete_node” is called, the node with the IP provided is deleted from the list. Unlike “add_node”, the IP can be specified. This follows the theory that it is better for a node to not receive updates than it is for that node to receive incorrect or unwanted updates, thereby preserving data integrity as much as possible.

Files Handler

The files class provides for interaction with the transmitter. This is where the client node receives its calls and performs actions based on network requests. A case switch is used to divide logic flow between API calls.

Adding a File

The transmitter first calls the “add_query” fuction to determine if a matching SHA256 hash is on the node. If so, the “add_dup” function is called. If not, the “add” function is called, where the entire contents of the file are posted.

If the request to add a file includes a directory path (in other words, it’s not in the RDFS root directory), that directory path is created with the FileUtils.mkdir_p method. This works just like “mkdir -p” in Linux/macOS or “mkdir” in Windows. This way, directories aren’t created until they are absolutely needed.

If a file is added, it’s hashed and added to the database with flags cleared.

Deleting a File

If the transmitter calls for the server to delete a file, it does so via the FileUtils.rm_f method. This is like calling “rm -f” in Linux/macOS or “Remove-Item -Force” in Windows PowerShell. Note that if a directory is to be deleted, it is not forced. This is a protection mechanism in case there are files in that directory that weren’t previously cleared. This shouldn’t happen, but the goal is to preserve data in case of ambiguity.

The Daemon Controller

The rdfsctl.rb file is the start script that contains the functionality necessary to run RDFS in the background as a daemon. Via the daemons Ruby gem, you can run rdfsctl.rb with any number of options like start, stop, status, and restart. This functionality is provided in one beautiful line in the rdfsctl.rb file, it’s only line of actual execution.

Performance

Performance was not a core design goal of RDFS. I knew there was no way that a Ruby script constructed in 500 lines over a few days could possibly compete with other similar systems. However, given the native threading, demonization, and rudimentary transmitting de-duplication, performance isn’t terrible.

The primary bottleneck that I’d like to fix is the WEBrick server. The server itself is fine, but POSTing data isn’t necessarily efficient for extremely large files. My tests were confined to smaller files (not exceeding several megabytes), and thus storing large, gigabyte sized files in RDFS isn’t advised.

Ruby can chew up a lot of memory, but in my tests, at idle, RDFS consumed between 7 and 13 MB of private memory. I used Valgrind to check for memory leaks and found a few instances where memory could increase, buy Ruby’s garbage collector eventually would defragment and release unneeded heap applications. This isn’t to say RDFS doesn’t leak RAM, I’m sure it does, but in my testing, it wasn’t horribly noticeable.

CPU usage is relatively light, especially at idle. If you’re using Ruby 2.x (and you should be, versions below that are quite old and unsupported), you’ll benefit from native threads (introduced in 1.9.3) and better garbage collection.

Security

Security was not a goal, and thus I didn’t design it with that in mind. That said, I do take some precautions. Any call to the SQLite3 database that is initiated from outside RDFS uses prepared statements, thereby reducing the risk of SQL injection. Additionally, a node can only be added by that node, baring any kind of IP spoofing.

Given Ruby’s extendable nature, it would be easy to add some basic authentication (or at least challenges) to API calls to ensure they come from where they say. Creating a system of key exchanges upon node registry would also help, too.

WEBrick can use SSL, encrypting the communications between nodes. This would help tremendously but setting this up was outside the scope of the original release. This can be easily added, however, by modifying the server class. That said, key/cert pairs must be created, and in production use this may involve a certificate authority, so diving into that rabbit hole was something I wanted to avoid at this early stage.

Bugs and Design Flaws

I’m certain there are plenty. At the time of writing, there are several known issues. For one, I’m certain a race condition will happen if you add more than two nodes. My testing and initial development scope was limited to two nodes. The transmitter iterates through active nodes, but the update flag is only set upon successful sending to all nodes, so data may be resent if a file already exists. By the time of publication, this bug may be fixed. If not, expect an update very soon to address this issue.

Adding Nodes

Additionally, there is no internal way to add nodes. I plan to add this, though. For now, you can either do this via cURL on the command line or by manually adding an IP to the nodes table in SQLite3:

sqlite3 ~/.rdfs.sqlite3

INSERT INTO nodes (ip) VALUES (`1.2.3.4`)

Where 1.2.3.4 is the IP to add.

Compression

I had originally planned to compress the data before sending, then decompress it on the server side. This is easily done with Ruby, but I ran into bugs with Zlib headers and had to temporarily disable it. I’ll add it back as soon as I can sort that out.

Out of Sync

If the client and server nodes aren’t kept in sync, mishaps may occur. They won’t involve data loss, but some files may not be added. I plan to fix this by adding a method to send a full list of files between nodes and let them compare which they need, then mark those files as updated so the transmitter will catch a node up with the other(s).

The Future of RDFS

This article marks the birth of RDFS. It was a fun project, and I expect to continuously enhance it in the future. It may never graduate from beta status, and may never be a robust, production-ready system, but my goal is that it will serve as a teaching and demonstration tool and allow for easy expansion or integration into other open source projects.