It was a day much awaited. It was the day the big meeting happened, during which each team discussed its storage needs and everyone hoped their prayers would be answered. Each team had an opportunity to voice its storage requirements - capacity, type, performance, availability, price, scale - and there was promise that everyone’s requirements would be met. As the meeting progressed, and each team voiced its opinions, the room got hotter with conflict and talk of how each team’s requirements would not satisfy another team’s. Just when the room was about to reach explosive levels, a knight in shining armor rode into the room bearing a flag with the word "Ceph". There was silence before the commotion, but the commotion was different in nature.

My imagination of this scenario: too far fetched? Maybe not.

We’re going through an age of data explosion. There’s data constantly generated. According to Statista, as of last year, there were more than 2.51 billion active mobile social accounts globally. Not only is new data constantly generated, there are new types of data in the landscape - object, file and block. Given that there is ever increasing data and different types of data, this blog is about what Ceph is and how it can help alleviate common enterprise storage concerns. Ceph is open source. It can help avoid vendor lock-in. It is designed from scratch with no single point of failure and high availability. By design when any hardware component fails, the storage cluster is still accessible and functional. It is designed for performance scaling with capacity. It is one of few storage technologies to offer "unified storage" i.e. block, file and object storage.

Let’s delve a little into the three kinds of storage:

Block storage: emulates a physical drive. Your /dev/sda is the classic example of block storage. This data is split into evenly sized blocks, each with its own address.

File storage: Ceph provides a traditional file system interface with POSIX semantics enabling users to use a hierarchy in organizing files and folders. It’s used as a backend for the OpenStack Manila project, offering a shared file system. The traditional equivalents are NFS and CIFS.

Object storage: Object data consists of metadata and a globally unique identifier. Objects are stored in a flat namespace. Objects allow for ease of object expansion.

If you have images on Facebook or files in Dropbox, you’ve used object storage.

Let’s go over Ceph’s main architecture components.

Starting from the bottom:

Reliable Autonomous Distributed Object Store (RADOS): This is the backbone of the cluster.

Librados: This library enables applications to access the object store. This library is even available in several languages to facilitate custom application integration.

Application libraries

Rados Gateway (RGW)- This is the Amazon Simple Storage (S3) / OpenStack Object Storage (Swift) interface with object versioning and multi-site federation and replication.

Rados Block Device (Rbd) - This allows Block Device access to the RADOS. It allows for snapshotting, copy on write and multi-site replication for disaster recovery.

CephFS - This is the POSIX-compliant distributed file system.

Other - A custom application can be written that can interface directly with the Librados API layer to avoid software overhead.

RADOS stands for "Reliable Autonomous Distributed Object Store". This is a self-managing/ self-healing layer composed mainly of the two types of entities, OSDs and MONs.

OSDs (or Object Storage Daemons) are the data storage elements in the RADOS layer.

This tuple of a disk, file-system and object storage software daemon is referred to as the OSD. Ceph is designed for an infinite number of OSDs and you are free to study reference architectures on what has been done in production. OSDs serve stored data to clients. They peer intelligently for replication and recovery without the need of a central conductor.

You can easily add or remove OSDs and the changes will ripple through the cluster to reach a healthy state by peering and replication. A best practice recommendation to storage administrators is to estimate the impact to the cluster when a change is made by ways of adding or removing OSDs.

A monitor or MON node is responsible for helping reach a consensus in distributed decision making using the Paxos protocol. In Ceph, consistency is favored over availability. A majority of the configured monitors need to be available for the cluster to be functional. For example, if there are two monitors and one fails, only 50% of the monitors are available so the cluster would not function. But if there are three monitors, the cluster would survive one node’s failure and still be fully functional. Red Hat supports a minimum of three monitor nodes. A typical cluster would have a small odd number of monitors.

If you’ve stayed awake this far into this blog, you’re probably wondering, "Where do objects actually live?" In Ceph, everything is natively stored as an object in the RADOS cluster. Everything is chopped up into little chunks. This chunk size can be set but it has a default of four megabytes. After being chopped up, the resulting objects are saved in the RADOS cluster. Retrieval is done in parallel and assembled together at the client. The cluster itself is sliced up into smaller units called "placement groups" or "PGs". Maintenance in the cluster is done at the placement group level and not at the object level.

A "pool" is a logical grouping of placement groups. The degree of replication can be set at the pool level. It can be even different for every pool.

So an object lives in a pool and it is associated with one placement group. Depending on the properties of the pool, the placement group is associated with the number of OSDs as the replication count. eg. if for a replication count of three, each placement group with be associated with three OSDs. A primary OSD and two secondary OSDs. The primary OSD will serve data and peer with the secondary OSDs for data redundancy. In case the primary OSD goes down, a secondary OSD can be promoted to become the primary to serve data, allowing for high availability.

When using multiple data pools for storing objects, both the number of PGs per pool and the number of PGs per OSD need to be balanced out. That number should provide a reasonably low variance per OSD for optimum performance. Having a link (https://access.redhat.com/labs/cephpgc/) to this calculator handy is a great idea when deciding on these numbers.

Ceph’s fundamental data placement algorithm is called CRUSH.

CRUSH stands for "Controlled Replication Under [Scalable Hashing]." Its salient features include:

The ability to do data distribution in a reasonable time.

It is pseudo random in nature.

It’s deterministic in nature (i.e. functions called with the exact same arguments yield the same results on any component of the cluster).

The client and the OSD is capable of calculating the exact location of any object.

CRUSH is implemented with the help of a crush map. The main map contains a list of all available physical storage devices, information about the hierarchy of the hardware (OSD, host, chassis, etc.) and rules that map PGs to OSDs.

The native interface to the storage cluster is via the Librados layer. The library has wrappers in several languages eg. ruby/erlang/php/c/c++/python to ease interfacing any application written in those languages.

The three main application offerings are "radosgw", "rbd" and "cephfs client". Radosgw offers a web-like services gateway to offer an AWS S3 compatible interface and a Swift interface.

If you have an application that communicates via an S3 interface to AWS, it enables switching to use Ceph just a redirection to the radosgw.

Rados Block Device (Rbd) is perhaps Ceph’s most popular use case. It enables block-level access to a Ceph object store. It has support for snapshots and clones making it a wonderful replacement for expensive SANs. The librdb library is tasked with translating block commands(scsi commands) with sectors and length of data requests to object requests. Rbd finds heavy usage in OpenStack as an OpenStack Image Service (Glance) and OpenStack Block Storage (Cinder) back-end.

The third type of storage Ceph offers is a POSIX-compliant shared file system called CephFS.

Ceph has been designed with performance in mind from the get-go. It offers a feature known as journaling. Fast media, preferably solid state drives, could be dedicated to a journal. All writes are temporarily stored in the journal until the writes are flushed from memory to the backing storage. Then, the journal is marked clean and is ready to be overwritten. This can absorb burst write traffic, accelerate client ACKs, and create longer sequential write IO, which is more efficient for the backing storage.

It's important to note that the journal contents are not read unless there's an unclean shutdown of the OSD process in which case the journal data is read back into memory and processed to backing storage. Thus the data flow is not journal to backing storage, but memory to backing storage. This is a common misconception. They increase the write throughput seen by the client significantly. Another optional feature is client side caching for librbd users. The latest feature the industry has its eyes set on for an upcoming release is called BlueStore ("Bl" as in Block and "ue" as pronounced in "New"). This is a new architecture to help optimize and further reduce overhead in the existing Ceph architecture.

The journey of enterprise storage has come a really long way. The storage landscape has choices varying from traditional storage subsystems to open source solutions like Ceph and Gluster. Refraining from a personal preference for "open source first", a key factor in making a choice should be the required workload characterization among others. With careful analysis of requirements and a systematic provisioning for functionality and performance, it’s hard to see how one could go wrong with Ceph as a choice.

Ruchika Kharwar is a cloud success architect at Red Hat. She spends her time working with customers helping them take their proof of concept to production by enabling integration of various features and components with the ultimate goal of getting them the infrastructure they want.

Red Hat Cloud Success is designed to help simplify your IT transformation and accelerate your adoption of cloud technologies with deep product expertise, guidance, and support. From the proof of concept stage to production, a highly skilled cloud technical specialist will partner with you to provide continuity and help ensure successful implementation of your cloud solution. Through this limited­ time engagement, Red Hat Cloud Success can help you effectively plan and deploy cloud solutions and strategically plan for the future