Swift is the software behind the OpenStack Object Storage service.

This service provides a simple storage service for applications using RESTful interfaces, providing maximum data availability and storage capacity.

I explain here how some parts of the storage and replication in Swift works, and show some of its current limitations.

If you don't know Swift and want to read a more "shallow" overview first, you can read John Dickinson's Swift Tech Overview.

How Swift storage works

If we refer to the CAP theorem, Swift chose availability and partition tolerance and dropped consistency. That means that you'll always get your data, they will be dispersed on many places, but you could get an old version of them (or no data at all) in some odd cases (like some server overload or failure). This compromise is made to allow maximum availability and scalability of the storage platform.

But there are mechanisms built into Swift to minimize the potential data inconsistency window: they are responsible for data replication and consistency.

The official Swift documentation explains the internal storage in a certain way, but I'm going to write my own explanation here about this.

Consistent hashing

Swift uses the principle of consistent hashing. It builds what it calls a ring. A ring represents the space of all possible computed hash values divided in equivalent parts. Each part of this space is called a partition.

The following schema (stolen from the Riak project) shows the principle nicely:

In a simple world, if you wanted to store some objects and distribute them on 4 nodes, you would split your hash space in 4. You would have 4 partitions, and computing hash(object) modulo 4 would tell you where to store your object: on node 0, 1, 2 or 3.

But since you want to be able to extend your storage cluster to more nodes without breaking the whole hash mapping and moving everything around, you need to build a lot more partitions. Let's say we're going to build 210 partitions. Since we have 4 nodes, each node will have 210 ÷ 4 = 256 partitions. If we ever want to add a 5th node, it's easy: we just have to re-balance the partitions and move 1⁄4 of the partitions from each node to this 5th node. That means all our nodes will end up with 210 ÷ 5 ≈ 204 partitions. We can also define a weight for each node, in order for some nodes to get more partitions than others.

With 210 partitions, we can have up to 210 nodes in our cluster. Yeepee.

For reference, Gregory Holt, one of the Swift authors, also wrote an explanation post about the ring.

Concretely, when building one Swift ring, you'll have to say how much partitions you want, and this is what this value is really about.

Data duplication

Now, to assure availability and partitioning (as seen in the CAP theorem) we also want to store replicas of our objects. By default, Swift stores 3 copies of every objects, but that's configurable.

In that case, we need to store each partition defined above not only on 1 node, but on 2 others. So Swift adds another concept: zones. A zone is an isolated space that does not depends on other zone, so in case of an outage on a zone, the other zones are still available. Concretely, a zone is likely to be a disk, a server, or a whole cabinet, depending on the size of your cluster. It's up to you to choose anyway.

Consequently, each partitions has not to be mapped to 1 host only anymore, but to N hosts. Each node will therefore store this number of partitions:

number of partition stored on one node = number of replicas × total number of partitions ÷ number of node

Examples:

We split the ring in 210 = 1024 partitions. We have 3 nodes. We want 3 replicas of data.

→ Each node will store a copy of the full partition space: 3 × 210 ÷ 3 = 210 = 1024 partitions .

We split the ring in 211 = 2048 partitions. We have 5 nodes. We want 3 replicas of data.

→ Each node will store 211 × 3 ÷ 5 ≈ 1129 partitions .

We split the ring in 211 = 2048 partitions. We have 6 nodes. We want 3 replicas of data.

→ Each node will store 211 × 3 ÷ 6 = 1024 partitions .

Three rings to rule them all

In Swift, there is 3 categories of thing to store: account, container and objects.

An account is what you'd expect it to be, a user account. An account contains containers (the equivalent of Amazon S3's buckets). Each container can contains user-defined key and values (just like a hash table or a dictionary): values are what Swift call objects.

Swift wants you to build 3 different and independent rings to store its 3 kind of things (accounts, containers and objects).

Internally, the two first categories are stored as SQLite databases, whereas the last one is stored using regular files.

Note that this 3 rings can be stored and managed on 3 completely different set of servers.

Data replication

Now that we have our storage theory in place (accounts, containers and objects distributed into partitions, themselves stored into multiple zones), let's go the replication practice.

When you put something in one of the 3 rings (being an account, a container or an object) it is uploaded into all the zones responsible for the ring partition the object belongs to. This upload into the different zones is the responsibility of the swift-proxy daemon.

But if one of the zone is failing, you can't upload all your copies in all zones at the upload time. So you need a mechanism to be sure the failing zone will catch up to a correct state at some point.

That's the role of the swift-{container,account,object}-replicator processes. This processes are running on each node part of a zone and replicates their contents to nodes of the other zones.

When they run, they walk through all the contents from all the partitions on the whole file system and for each partition, issue a special REPLICATE HTTP request to all the other zones responsible for that same partition. The other zone responds with information about the local state of the partition. That allows the replicator process to decide if the remote zone has an up-to-date version of the partition.

In case of account and containers, it doesn't check at the partition level, but check each account/container contained inside each partition.

If something is not up-to-date, it will be pushed using rsync by the replicator process. This is why you'll read that the replication updates are "push based" in Swift documentation.

# Pseudo code describing replication process for accounts # The principle is exactly the same for containers for account in accounts: # Determine the partition used to store this account partition = hash(account) % number_of_partitions # The number of zone is the number of replicas configured for zone in partition.get_zones_storing_this_partition(): # Send a HTTP REPLICATE command to the remote swift-account-server process version_of_account = zone.send_HTTP_REPLICATE_for(account): if version_of_account < account.version() account.sync_to(zone)

This replication process is O(number of account × number of replicas). The more your number of account will increase and the more you will want replicas for your data, the more the replication time for your accounts will grow. The same rule applies for containers.

# Pseudo code describing replication process for objects for partition in partitions_storing_objects: # The number of zone is the number of replicas configured for zone in partition.get_zones_storing_this_partition(): # Send a HTTP REPLICATE command to the remote swift-object-server process verion_of_partition = zone.send_HTTP_REPLICATE_for(partition): if version_of_partition < partition.version() # Use rsync to synchronize the whole partition # and all its objects partition.rsync_to(zone)

This replication process is O(number of objects partitions × number of replicas). The more your number of objects partitions will increase, and the more you will want replicas for your data, the more the replication time for your objects will grow.

I think this is something important to know when deciding how to build your Swift architecture. Choose the right number the number of replicas, partitions and nodes.

Replication process bottlenecks

File accesses

The problem, as you might have guessed, is that to replicate, it walks through every damn things, things being accounts, containers, or object's partition hash files. This means it need to open and read (part of) a every file your node stores to check that data need or not to be replicated!

For accounts & containers replication, this is done every 30 seconds by default, but it will likely take more than 30 seconds as soon as you hit around 12 000 containers on a node (see measurements below). Therefore you'll end up checking consistency of accounts & containers on each all node all the time, using obviously a lot of CPU time.

For reference, Alex Yang also did an analysis of that same problem.

TCP connections

Worst, the HTTP connections used to send the REPLICATE commands are not pooled: a new TCP connection is established each time something has to be checked against the same thing stored on a remote zone.

This is why you'll see in the Swift's Deployment Guide this lines listed

under "general system tuning":

# disable TIME_WAIT.. wait.. net.ipv4.tcp_tw_recycle=1 net.ipv4.tcp_tw_reuse=1 # double amount of allowed conntrack net.ipv4.netfilter.ip_conntrack_max = 262144

In my humble opinion, this is more an ugly hack than a tuning. If you don't activate this and if you have a lot of containers on your node, you'll end up soon with thousands of connections in TIME_WAIT state, and you indeed risk to overload the IP conntrack module.

Container deletion

We also should talk about container deletion. When a user deletes a container from its account, the container is marked as deleted. And that's it. It's not deleted. Therefore the SQLite database file representing the container will continue to be checked for synchronization, over and over.

The only way to have a container permanently deleted is to mark an account as deleted. This way the swift-account-reaper will delete all its containers and, finally, the account.

Measurement

On a pretty big server, I measured the replications to be done at a speed of around 350 {account,container,object-partitions}/second, which can be a real problem if you choose to build a lots of partition and you have a low number_of_node ⁄ number_of_replicas ratio.

For example, the default parameters runs the container replication every 30 seconds. To check replication status of 12 000 containers stored on one node at the speed of 350 containers/seconds, you'll need around 34 seconds to do so. In the end, you'll never stop checking replication of your containers, and the more you'll have containers, the more your inconsistency window will increase.

Conclusion

Until some of the code is fixed (the HTTP connection pooling probably being the "easiest" one), I warmly recommend to choose correctly the different Swift parameters for your setup. The replication process optimization consists in having the minimum amount of partitions per node, which can be done by:

decreasing the number of partitions

decreasing the number of replicas

increasing the number of node

For very large setups, some code to speed up accounts and containers synchronization, and remove deleted containers will be required, but this does not exist yet, as far as I know.