If you happen to need a quick and easy way to start a single-node S3/NFS cluster with geo-transparent data access, or simply single-node EdgeFS installation, this article can be of help!

What is EdgeFS? This is a new storage provider addition to CNCF Rook project and you can read more about it at https://rook.io/docs/rook/master/edgefs-storage.html. But you can also run it in a so-called “solo” mode, a single-node Docker container with an ability to scale out your deployment as it grew by simply connecting more nodes and/or geographically distributed cluster segments to it.

EdgeFS stronghold is its ability to virtualize underlying infrastructure as a scalable, highly-available and distributed storage system. It works similarly to “git” where all modifications are globally immutable, fully versioned, self-validated, distributed and therefore fault-tolerant. As a result, it enables cross-cloud and geographically transparent high-performance distributed access to commonly used storage protocols for Object, File and Block.

Use Case — Geo-distributed S3/NFS archive

Geographically distributed and transparent S3/NFS accessible archive. Many millions of small objects of ~ 1MB of size, accessible at 2+ segmented locations for unstructured archive-pattern read-write with performance characteristics constrained by segment’s memory, CPU, and disk types.

Distributed bi-directional EdgeFS Namespace Segmentation

I want to keep this step by step instructions as simple as possible but at the same time demonstrate enablement of a high-performance use case for random 1MB I/O access. As such, I’m going to be using a hybrid style EdgeFS Target deployment with 10 HDDs and 2 SSDs in terms of to achieve a better price/performance ratio. Disk configuration of my first segment looks like this:

# lsblk -o name,model,vendor,rota

NAME MODEL VENDOR ROTA

sda MICRON_M510DC_MT ATA 0

sdb MG04SCA40EE TOSHIBA 1

sdc MG04SCA40EE TOSHIBA 1

sdd MG04SCA40EE TOSHIBA 1

sde MG04SCA40EE TOSHIBA 1

sdf MICRON_M510DC_MT ATA 0

sdh MG04SCA40EE TOSHIBA 1

sdi MG04SCA40EE TOSHIBA 1

sdj MG04SCA40EE TOSHIBA 1

sdk MG04SCA40EE TOSHIBA 1

sdl MG04SCA40EE TOSHIBA 1

The other segments can be located in the same or remote locations. In the article, we will configure ISGW link connecting 2+ EdgeFS segments.

Instead of going into a tedious docker command lines explanations, I thought that we can automate this a little bit with an excellent docker-compose tool. You will need a version that supports format 2.4.

Step 0. Download docker-compose.yml file

Docker-compose file lists all the services we need and designed to work within /edgefs directory. Download it from this gist and install into an empty /edgefs directory. Verify that “docker-compose ps” returns success.

Step 1. Prepare raw disks

EdgeFS using raw disks (no filesystem created!) for its storage media. It converts classic block devices into a virtualized key-value database. The easiest way to provision a single node target is to manually execute “wipefs -a /dev/DEV” command for each raw disk on the node and let EdgeFS configuration wizard build its optimal configuration.

Step 2. Configure target

Before we can start docker-compose services, we need to configure EdgeFS target service. For the sake of experiment, I prepared a few options we can play with.

Option 1. All HDD config

In this option, we will ignore installed SSDs and command our configurational wizard to pick up only HDDs. This option is great for cold, cheap & deep archives:

docker-compose run --rm -e CCOW_LOG_STDOUT=1 target config node \

-i enp5s0f0 -d rtrd -p rtrdAllHDD -o \ '{"LmdbPageSize":32768,"HDDReadAhead":512,"DisableVerifyChid":true}'

Option 2. Hybrid HDD/SSD config with Metadata and WAL on SSD

This is an excellent option if you want to only use rotational media for data chunks. EdgeFS RT-RD does a great job in coalescing writes for streaming I/O. I would recommend this option for any kinds of active archives:

docker-compose run --rm -e CCOW_LOG_STDOUT=1 target config node \

-i enp5s0f0 -d rtrd -p rtrdMDOffload -o \ '{"LmdbPageSize":32768,"HDDReadAhead":512,"DisableVerifyChid":true}'

Important parts of the above configs that you likely need to substitute: “-i enp5s0f0” needs to point to the interface name where services will be available, “-o …” command enables optimal hybrid HDD/SSD configuration and 512KB of HDD read-ahead.

Option 3. The easy option, all data on /edgefs/data directory

This option is for non-production usage. When configured, it will utilize locally mounted location at /edgefs/data and create 4 emulated devices. You will notice that the summary output of “efscli” is multiplied by 4, but ignore that for now. So, to enable this mode just run this command:

docker-compose run --rm -e CCOW_LOG_STDOUT=1 target config node \

-l localhost -i eth0 -d rtlfs

If you would want to adjust the suggested configuration, pass “ — help” flag.

Once completed, you can view created configuration in /edgefs/etc directory. You can run config node command multiple times until you are satisfied with the configuration.

Step 3. Launching target and initializing services

Start target and preconfigured services. This includes REST management service, GUI, NFS, and S3.

# docker-compose up -d

Creating edgefs_target_1

Creating edgefs_nfs01_1

Creating edgefs_mgmt_1

Creating edgefs_s301_1

Creating edgefs_ui_1

Creating edgefs_isgw01_1 # docker-compose logs -f

It may take a few minutes for the first time it starts as it would need to prepare disks. You can monitor its progress and also login into a toolbox:

# docker-compose exec mgmt toolbox Welcome to EdgeFS Toolbox.

Hint: type efscli to begin # efscli system status

ServerID B86037FF3C399D636A270A2BB4E6780E node3075ub16 ONLINE

Initialize cluster segment “myspace”

And after status is online, initialize cluster and create system objects:

### initialize local cluster segment

efscli system init ### initialize myspace/work/shared1

efscli cluster create myspace

efscli tenant create myspace/work

efscli bucket create myspace/work/shared1 \

-s 512K -r 2 -R 1 -t 1 -c 3:1:xor -C 2h

EdgeFS uses globally unique system path in the format of NAMESPACE/TENANT/BUCKET/OBJECT. In the above, we’ve created “myspace/work/shared” bucket with XOR Erasure-Coding schema 3:1 triggered after an object version older than 2hrs, Replication Count 2 with 1 replica delayed and default Chunk Size 512KB.

We could also connect to the GUI and create/monitor/manage services via nice UI by pointing a web browser to http://IPADDR:3000, default user admin, and the default password is edgefs.

Erasure Coding design in EdgeFS

EC in EdgeFS has a number of benefits, specifically in our example, designed as post-process, it is delaying coding until an object considered cold (configured as 2hrs passed since a version creation). It preserves immutable object structure by deleting 4 no longer needed replicas of a group of 3 chunks and adding 1 parity chunk in the background, as a result, write I/O path isn’t affected vs just normal replication. During the coding process, the original 3 chunks kept unencoded so that following read I/O does not require reconstruction, hence zero impact on cold data read performance.

Important: for each encoding schema to trigger cluster needs to have ≥ n_data+n_parity failure domains and bucket replication count has to be ≥ n_parity+1. For instance, for 3:1:xor schema and device level failure domain (single-node), cluster segment needs to provide at least 4 devices.

To summarize:

Encoding is done as post-processing when data “goes cold” (tunable). No impact on write performance

Encoding is performed “across chunks”. No impact on read performance: a full copy of a data chunk is kept in the cluster

Fully distributed lost parity/chunk rebuild

Supports cold->hot data modifications (overwrites)

Supports flexible schemas: 2:1:xor, 3:1:xor, 4:2:rs, 6:2:rs, 9:3:rs

Step 3. Creating S3, NFS and ISGW services

Now that cluster namespace segment, tenant and buckets are created, we can setup S3, NFS and ISGW services definitions:

### NFS service with myspace/work/shared1 export

efscli service create nfs nfs01

efscli service config nfs01 X-MH-ImmDir 1

efscli service serve nfs01 myspace/work/shared1 ### S3 service servicing myspace/work tenant

efscli service create s3 s301

efscli service serve s301 myspace/work ### ISGW endpont link

efscli service create isgw isgw01

efscli service serve isgw01 myspace/work/shared1

efscli service config isgw01 X-ISGW-Remote ccow://REMOTE_IP:14000

efscli service config isgw01 X-Status enabled

At this point, a docker-compose script will pick up changes on next policy restart and services should be available. If for some reason it isn’t picking up, speed up service restart with the following command:

docker-compose logs -f nfs01 # monitor nfs service logs

docker-compose logs -f s301 # monitor s3 service logs

docker-compose logs -f isgw01 # monitor isgw01 service logs docker-compose restart nfs01

docker-compose restart s301

docker-compose restart isgw01

Step 4. Connecting EdgeFS cluster segments

EdgeFS Inter-Segment Gateway link is a building block for EdgeFS cross-segment, cross-cloud global namespace synchronization functionality.

It distributes modified chunks of data asynchronously and enables seamless as well as geographically transparent access to files, objects and block devices. It is important to note that a file or a block device consists of one or more objects, and so, within EdgeFS scope, ultimately everything is an object, globally immutable and self-validated.

To create an analogy, EdgeFS concept of global immutability of modifications very similar to how “git” operates with repository commits and branches. As such, this technique empowers EdgeFS users to construct and operate comprehensive wide-spread global namespaces with management overhead greatly simplified. A file or object modified at a source site where ISGW link is setup will be immediately noticed by ISGW endpoint links, thus spread out the change. Eventually, all the connected sites will receive file modification where only modified blocks get transferred.

Not only ISGW link reduces the amount of data needed to be transferred, but it also deduplicating the transfers. Matching globally unique cryptographic signatures of a file change will not be transferred, thus enabling global namespace deduplication.

ISGW link can be bi-directional, i.e. enabling same file/object modifications across the namespaces. It works well for many use cases where application logic can ensure serialization of changes. Single bi-directional link can connect two sites but it is possible to create as many non-overlapping links as needed.

ISGW link can also transparently synchronize file, object, directory, bucket or tenant level snapshots, grouped into so-called SnapView construct. Thus, modification to a block device, for instance, can be consistently viewed across entire global namespace.

Because EdgeFS metadata is also globally immutable and unique, it is possible to enable a mode of transferring only metadata changes. With this mode enabled, users can construct efficient access endpoints where modifications can be fetched on demand, as a result of creating globally and geographically distributed cache fog aggregation with a built-in E-LRU eviction policy.

Local I/O at each site executed with the speed of physically or virtually connected media devices. Globally immutable modifications(versions) transferred eventually, and not slowing down local site running application workflow.

The assumption is that 2-segments are all provisioned with the steps 1–3 at this point and we can now create bi-directional ISGW link.

The commands below explain how to create an inter-segment link:

Verification and Performance

Now for the fun part lets use the services and validate functionality as well get a sense of top performance we can get for the use case we want.

Let’s start from the primary scenario, simulate the creation of 100,000 x 1MB S3 objects. For this, we need to install CosBench:

COS_HOST=10.3.30.75

docker run --rm -d -p 19088:19088 -p 18088:18088 -e ip=$COS_HOST \

-e t=both -e n=1 -e u=true nexenta/cosbench

Login to http://$COS_HOST:19088/controller/ and submit workload from this gist.

For the sake of the experiment, lets now run a similar test but over NFS, on the same bucket. The following fio command will do the job:

### mount NFS share mkdir /mnt/shared1

showmount -e $COS_HOST

mount -t nfs $COS_HOST:/work/shared1 /mnt/shared1 -o tcp

mkdir /mnt/shared1/dir1 ### execute FIO test fio -name=myjob -directory=/mnt/shared1/dir1 -ioengine=psync \

-norandommap -randrepeat=0 -allrandrepeat=0 -refill_buffers \

-blocksize=1M -nrfiles=1000 -filesize=1M -rw=write -buffered=0 \

-buffer_compress_percentage=0 -dedupe_percentage=0 \

-iodepth=2 -numjobs=10 -fallocate=none -group_reporting

In general, NFS more metadata intensive protocol and as such, I do not envision people would want to use all HDD configuration to work with a large number of small files. As such, we limit the experiment to hybrid configurations only.

NFS performance is similar to S3 with some insignificant overhead on metadata. Let’s now explore if we can transparently access files and objects:

### Place an object via S3 and access via NFS

ls /mnt/shared1/.objects/path/to/obj1 curl http://$COS_HOST:9982/shared1/path/to/obj1 -X PUTls /mnt/shared1/.objects/path/to/obj1 ### Place a file via NFS and access via S3

curl mkdir -p /mnt/shared1/dir/to; touch /mnt/shared1/dir/to/file1curl http://$COS_HOST:9982/shared1/.nfs/dir/to/file1

Paths fully preserved and files/objects can be accessed transparently within the same cluster segment. Now, let's switch to the second cluster segment and try access the same object and file:

### Second segment has our modifications and preserves transaprency

curl ls /mnt/shared1/.objects/path/to/obj1curl http://$COS_HOST2:9982/shared1/.nfs/dir/to/file1 ### We can also see all the objects and files that we created during

### our performance tests ls /mnt/shared1/dir1|wc -l

10002 ### We can display syncing statistics

efscli service show isgw01 -s X-Service-Name: isgw01

X-Service-Type: isgw

X-Description: Inter Segment Gateway

X-Auth-Type: disabled

X-Servers: -

X-Container-Network: -

X-ISGW-Basic-Auth: -

X-ISGW-Direction: -

X-ISGW-MDOnly: -

X-ISGW-Encrypted-Tunnel: -

X-ISGW-DFLocal: 0.0.0.0:49678

X-ISGW-Replication: 3

X-ISGW-Local: 0.0.0.0:14000

X-ISGW-Remote: ccow://10.3.30.75:14000

X-Status: enabled

[

myspace/work/shared1

] Stats: [

myspace/work/shared1: {

"timestamp": 1551664249664,

"status": "active",

"state": "continuous",

"delay": 29664,

"version_manifests": 1,

"requests": 2,

"chunk_manifests": 0,

"data_chunks": 0,

"snapshots": 0,

"bytes": 693,

"latency": 81

}

]

The assumption here is that all the linked cluster segments have myspace/work namespace and tenant pre-created. If desired, we can easily override source object attributes at tenant level thus provide a way of leverage physical characteristics of cluster segments and utilize infrastructure in more cost-efficient ways.

Teardown

To clean up and start over, these commands can be useful:

docker-compose down

docker-compose run --rm -e CCOW_LOG_STDOUT=1 target toolbox \

nezap --do-as-i-say

partprobe # refresh partitions tables

rm -f ./var/run/flexhash* # reset previously discovered FlexHash

It will keep current configuration in /edgefs/etc while it will zap the disks and caches so that configuration can be re-created or re-initialized.

Summary

EdgeFS exploits locally available resources and presents them as a highly-available cluster segment. Outstanding performance characteristics achieved due to its immutable data structure design, dynamic data placement via low latency protocol, and highly scalable shared-nothing architecture. Locally created cluster segment can be expanded as performance and capacity requirements would require by simply adding more servers to it.

EdgeFS has no limitations on how many cluster segments can be created. The full mesh can work as well as just classic primary-secondary, or bi-directional link. The reason it works is that metadata is globally unique and immutable, thus allowing cross-segment transport protocol to avoid distribution of modifications if exists. Hence, de-duplication at inter-segment gateway occurs saves on egress cost tremendously as it operates with compressed data chunks and wouldn’t ever need to transfer duplicates.

Give it a try today! Let me know what you think?

Find out more by joining our growing community at http://edgefs.io and http://rook.io