Scaleway Elements C14 Cold Storage is a storage class on the Scaleway S3 Object Storage, built for archiving data at a low cost. It is not a product in itself, but rather an extension of our object storage, it cannot be accessed without the S3 API.

In this article, we will go over how this project was born and how it was developed as well as some technical insights on it.

Where It All Started

In 2016, Scaleway Dedibox launched an archiving product: the C14 Classic, still available. The product was very hardware-centric, mainly built around two major aspects:

Competitive SMR disks,

Motherboards, manufactured internally, that can power on/off disks on demand with a SATA bus tree matrix.

On the API side, C14 Classic was built like a vault: one opens a vault, puts data inside it, then closes the vault (archiving it). The main shortcoming with that design was that you needed to unarchive an entire archive to access a single file in it. In other words, to access a 1 GB file inside a 40 TB archive, you first needed to unarchive 40 TB.

The main concern with this design was the “Data front ends”, put another way, the big servers that were keeping the unarchived data. As you can imagine, multiple clients unarchiving multiple vaults of terabyte can fill up a server relatively quickly, thus blocking other clients from using the service.

In 2019, an internal Proof of Concept (PoC) was developed to demonstrate that we were able to use the C14 Classic hardware with the new Scaleway Object Storage. There were limitations, of course, but the PoC was very conclusive. Indeed, the C14 Classic hardware was rock stable, and the product turned out to be reliable with solid SLA. In addition, it also allowed us to use 40 PB of C14 hardware that was already in production.

As a result, huge efforts were deployed to transform the PoC in a production-ready project. A year later, the C14 Cold Storage beta was born at Scaleway.

How C14 Cold Storage Was Born

Integration Within S3

First of all, the project was to be used with the S3 API. As a result, we needed a standard way to access the API. Lots of patches were deployed on our S3

gateways to ensure compliance with Amazon’s S3 Glacier Storage class.

We also had to start working on a complete lifecycle engine, since the feature is required with the Scaleway Elements C14 Cold Storage.

Learning From Our C14 Classic Mistakes

Backed by 3 years of run, we learned a thing or two about archiving data, what works and what does not. The main objective was to build C14 Cold Storage around read optimisation, rather than write optimisation. We also decided to use file systems on the SMR disks, since in those 3 years, lots of patches were made in the Linux Kernel in order to optimise filesystems interaction with SMR disks.

Architecture Overview

Internal Insights



Hardware



MrFreeze

MrFreeze is the name of the board that hosts the archiving solution. The board was made in-house, and we did not modify it for the Scaleway Elements C14 Cold Storage. It comes with the following:

A very basic 4 core ARM CPU

2G of RAM

2 SATA buses

56 Disks slots

4x 1Gbit/sec network port

As you can see, the main feature of the board is to have a lot of disks, but only two SATA buses to access them at the same time. Therefore only two disks can be powered on simultaneously. The board itself exposes a SATA line and power-line API through GPIO ports, and we keep a map on “what-to-power” in order to switch-on disk X or Y on the software side.

One funny quirk of this design is that we do not need to cool down the boards

that much, since 54 disks are powered down all the time, and the heat

dissipation works well. In addition, a whole rack with 9.8 PB of storage consumes less than 600W (22 chassis * 56 disks * 8 TB/disk)

The main caveat, of course, is that a disk needs some time to be powered on:

around 15 seconds from power-on to mountable by the userspace.

SMR Disks

The disk that we use to fill those boards are 8TB Seagate SMR

Disks. These disks are built for storage density, and thus are perfect for data archiving. They can be quite slow, especially for writes, but it is a downside that comes with all SMR disks.

Location Constraints

The C14 racks are located in our Parisian datacenters, DC2 and DC4, also known as The Bunker. Since one rack is around 1 metric ton of disks, we cannot place them wherever we want; for example, a MrFreeze rack in our AMS datacenter is totally out the question. So, we need to be able to transfer data from the AMS Datacenter (or WAW Datacenter) to the ones located in Paris.

Software



Freezer

The Freezer is the software that operates the MrFreeze board. It is responsible

for powering disks on and off, and actually writing data on them. It’s a very

‘simple’ software however, since all the database and intelligence are in the worker,

which is on far more powerful machines.

The Freezer communicates with the worker over a TCP connection, and exposes

a basic binary API:

typedef enum { ICE_PROBE_DISK = 0 , /*!< Probe a disk for used and total size */ ICE_DISK_POWER_ON , /*!< Power on a disk */ ICE_DISK_POWER_OFF , /*!< Power off a disk */ ICE_SEND_DATA , /*!< Send data */ ICE_GET_DATA , /*!< Get data */ ICE_DEL_DATA , /*!< Delete data */ ICE_SATA_CLEAN , /*!< Clean all the SATA buses, power off everything */ ICE_FSCK_DISK , /*!< Request for a filesystem integrity check on a disk */ ICE_LINK_DATA , /*!< Link data (zero copy) */ ICE_ERROR , /*!< Error reporting */ } packet_type_t ;

For simplicity, we logically split the freezer in two parts, one per SATA bus.

So in reality, two workers are speaking to one freezer, each one with a

dedicated bus. We will not go into details about the freezer since it’s mainly

dumping data from a socket to an inode (or the opposite for a read), and performing

some integrity checks.

Integration Within the Object Storage Stack

Before we explain how the actual worker works, we need to explain our Object Storage stack: once an object is created, it is split in erasure coded chunks (6+3), and a rawx is elected to store a chunk. The rawx is a simple binary much like the freezer, without the disk-power bit: It simply dumps a socket into an inode, or the opposite. So for an object, a minimum of 9 rawxs are used to store the actual data.

For hot storage, we do not need to go further than that. We have a rawx per disk

(mechanical or SSD), which means the chunk is accessible at all times. For C14 Cold

Storage, though, we need to go one step further and actually send the data to

the archiving disks.

In order to do that, we used golang rawx code, and added the following logic:

WRITE ( PUT ): notifies a program through an IPC socket

( ): notifies a program through an IPC socket READ ( GET ): checks that the data is actually there, and if it is not, requests it through the IPC

( ): checks that the data is actually there, and if it is not, requests it through the IPC Same operation for DELETE

The other end of the IPC socket is the actual C14 Cold Storage worker, which talks with all the rawx of a rack; there are 12 mechanical disks per rack, so one worker speaks with 12 different rawxs.

On a Write, the worker will transfer the data to the freezers with a low

priority, actually emptying the inode on the rawx. On a READ , it gets the data

from the freezer to the rawx with high-priority in order for it to be

accessible by the usual API.

Worker

The worker is the piece of software that links the object storage with the

archiving servers. It receives commands on an IPC, and executes them

asynchronously.

There are three main types of command that the worker can receive:

CREATE : A chunk has been created, we need to transfer it to the archiving

servers when possible,

: A chunk has been created, we need to transfer it to the archiving servers when possible, GET : A chunk is requested, we need to get it from the freezers as soon as

possible,

: A chunk is requested, we need to get it from the freezers as soon as possible, DELETE : A chunk has been deleted, we need to unlink the inode on the freezers

as well.

In case of a CREATE command, the job is executed in a best-effort manner which does not represent a blocker as the data is available if the client needs it, and we have plenty of space on the hot storage hard drives. When the job is executed, we elect a disk to store the chunk. Election is a straightforward process, as the goal is to use a minimal amount of disks in total, in order to not have too much time loss upon powering disks on and off.

Once we find an appropriate disk, the data is sent to the freezer, and we empty the inode on the hot storage. The inode itself is not deleted, only the content is. We then store the disk number along with other information in a database, in order to be able to retrieve the content later on.

In case of a GET , the job is executed as soon as we can. It’s in the top priority for scheduling, so the only thing keeping us from executing it is prior GET jobs. Once the job is executed, we simply get the data from the archiving server, and fill the inode on the hot-storage hard drive. We then set a time on that inode, usually around a day, in order to garbage collect it, and not fill empty space on the hot storage.

It is important to note that these mechanics have nothing to do with the restore route “Restore time”, as the restore API call does move the object from

glacier to standard (e.g. from archiving to hot storage).

As mentioned previously, the worker is linked with a SATA bus from the freezer. So

the worker only handles 22 disks in total.

Lifecycle

The lifecycle engine places itself between the API, rawx and the S3

gateway, creating and deleting bucket rules and scheduling the transitions or

expiration jobs on the objects matching the rules. The architecture is a

distributed system, with a lifecycle master that communicates with the gateway,

schedules and dispatches jobs to lifecycle workers which are present on all

storage pods.

The master keeps track of every job to execute while the lifecycle worker only

executes the task it receives using the storage pod capacity as a temporary storage solution, while transferring the object to C14 Cold Storage and back. One of the main assets of the engine is that it can wait for workers to put the data back from C14 Cold Storage to rawx, that’s why it is also used to execute the restored object S3 call which does not need any lifecycle rule on the bucket.

Enjoy the new C14 Cold Storage on the Scaleway Console and check out our documentation to learn more on how to use the Scaleway Elements C14 Cold Storage!