Wiki »

The ungleich ceph handbook¶

This document is IN PRODUCTION.

This article describes the ungleich storage architecture that is based on ceph. It describes our architecture as well maintenance commands. Required for

Communication guide¶

Usually when disks fails no customer communication is necessary, as it is automatically compensated/rebalanced by ceph. However in case multiple disk failures happen at the same time, I/O speed might be reduced and thus customer experience impacted.

For this reason communicate whenever I/O recovery settings are temporarily tuned.

ceph osd df tree¶

Using ceph osd df tree you can see not only the disk usage per OSD, but also the number of PGs on an OSD. This is especially useful to see how the OSDs are balanced.

Find out the device of an OSD¶

Use mount | grep /var/lib/ceph/osd/ceph-OSDID on the server on which the OSD is located:

[16:01:23] server2.place6:~# mount | grep /var/lib/ceph/osd/ceph-31 /dev/sdk1 on /var/lib/ceph/osd/ceph-31 type xfs (rw,relatime,attr2,inode64,noquota)

Adding a new disk/ssd to the ceph cluster¶

write on the disks, which order / date we bought it with a permanent marker.

For Dell servers¶

First find the disk and then add it to the operating system

megacli -PDList -aALL | grep -B16 -i unconfigur # Sample output: [19:46:50] server7.place6:~# megacli -PDList -aALL | grep -B16 -i unconfigur Enclosure Device ID: N/A Slot Number: 0 Enclosure position: N/A Device Id: 0 WWN: 0000000000000000 Sequence Number: 1 Media Error Count: 0 Other Error Count: 0 Predictive Failure Count: 0 Last Predictive Failure Event Seq Number: 0 PD Type: SATA Raw Size: 894.252 GB [0x6fc81ab0 Sectors] Non Coerced Size: 893.752 GB [0x6fb81ab0 Sectors] Coerced Size: 893.75 GB [0x6fb80000 Sectors] Sector Size: 0 Firmware state: Unconfigured(good), Spun Up

Then add the disk to the OS:

megacli -CfgLdAdd -r0 [Enclosure Device ID:slot] -aX (X : host is 0. md-array is 1) # Sample call, if enclosure and slot are KNOWN (aka not N/A) megacli -CfgLdAdd -r0 [32:0] -a0 # Sample call, if enclosure is N/A megacli -CfgLdAdd -r0 [:0] -a0

Then check disk

fdisk -l [11:26:23] server2.place6:~# fdisk -l ...... Disk /dev/sdh: 7.3 TiB, 8000987201536 bytes, 15626928128 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes [11:27:24] server2.place6:~#

Then create gpt

/opt/ungleich-tools/disk-create-fresh-gpt /dev/XXX [11:31:10] server2.place6:~# /opt/ungleich-tools/disk-create-fresh-gpt /dev/sdh ...... Created a new DOS disklabel with disk identifier 0x9c4a0355. Command (m for help): Created a new GPT disklabel (GUID: 374E31AD-7B96-4837-B5ED-7B22C452899E). ......

Then create osd for ssd/hdd-big

/opt/ungleich-tools/ceph-osd-create-start /dev/XXX XXX(sdd or hdd-big) [11:33:58] server2.place6:~# /opt/ungleich-tools/ceph-osd-create-start /dev/sdh hdd-big + set -e + [ 2 -lt 2 ] ...... + /opt/ungleich-tools/monit-ceph-create-start osd.14 osd.14 [ ok ] Restarting daemon monitor: monit. [11:36:14] server2.place6:~#

Then check rebalancing(if you want to add another disk, you should do after rebalancing)

ceph -s [12:37:57] server2.place6:~# ceph -s cluster: id: 1ccd84f6-e362-4c50-9ffe-59436745e445 health: HEALTH_WARN 2248811/49628409 objects misplaced (4.531%) ...... io: client: 170KiB/s rd, 35.0MiB/s wr, 463op/s rd, 728op/s wr recovery: 27.1MiB/s, 6objects/s [12:49:41] server2.place6:~#

Moving a disk/ssd to another server¶

(needs to be described better)

Generally speaking:

//needs to be tested: disable recovery so data wont start move while you have the osd down

/opt/ungleich-tools/ceph-osd-stop-disable does the following: Stop the osd, remove monit on the server you want to take it out umount the disk

Take disk out

Discard preserved cache on the server you took it out using megacli: megacli -DiscardPreservedCache -Lall -a0

Insert into new server

Clear foreign configuration using megacli: megacli -CfgForeign -Clear -a0

Disk will now appear in the OS, ceph/udev will automatically start the OSD (!) No creating of the osd required!

Verify that the disk exists and that the osd is started using ps aux using ceph osd tree

/opt/ungleich-tools/monit-ceph-create-start osd.XX # where osd.XX is the osd + number Creates the monit configuration file so that monit watches the OSD Reload monit

# where osd.XX is the osd + number Verify monit using monit status

Removing a disk/ssd¶

To permanently remove a failed disk from a cluster, use ceph-osd-stop-remove-permanently from ungleich-tools repo. Warning: if the disk is still active, the OSD will be shutdown AND removed from the cluster -> all data of that disk will need to be rebalanced.

Handling DOWN osds with filesystem errors¶

If an email arrives with the subject "monit alert -- Does not exist osd.XX-whoami", the filesystem of an OSD cannot be read anymore. It is very highly likely that the disk / ssd is broken. Steps that need to be done:

Login to any ceph monitor (cephX.placeY.ungleich.ch)

Check ceph -s , find host using ceph osd tree

, find host using Login to the affected host

Run the following commands: ls /var/lib/ceph/osd/ceph-XX dmesg

ex) After checking message of dmesg, you can do next step [204696.406756] XFS (sdl1): metadata I/O error: block 0x19100 ("xlog_iodone") error 5 numblks 64 [204696.408094] XFS (sdl1): xfs_do_force_shutdown(0x2) called from line 1233 of file /build/linux-BsFdsw/linux-4.9.65/fs/xfs/xfs_log.c. Return address = 0xffffffffc08eb612 [204696.410702] XFS (sdl1): Log I/O Error Detected. Shutting down filesystem [204696.411977] XFS (sdl1): Please umount the filesystem and rectify the problem(



Create a new ticket in the datacenter light project Subject: "Replace broken OSD.XX on serverX.placeY.ungleich.ch" Add (partial) output of above commands Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently XX, where XX is the osd id, to remove the disk from the cluster Remove the physical disk from the host, checkout if there is warranty on it and if yes Create a short letter to the vendor, including technical details a from above Record when you sent it in Put ticket into status waiting If there is no warranty, dispose it



Change ceph speed for i/o recovery¶

By default we want to keep I/O recovery traffic low to not impact customer experience. However when multiple disks fail at the same point, we might want to prioritise recover for data safety over performance.

The default configuration on our servers contains:

[osd] osd max backfills = 1 osd recovery max active = 1 osd recovery op priority = 2

The important settings are osd max backfills and osd recovery max active, the priority is always kept low so that regular I/O has priority.

To adjust the number of backfills per osd and to change the number of threads used for recovery, we can use on any node with the admin keyring:

ceph tell osd.* injectargs '--osd-max-backfills Y' ceph tell osd.* injectargs '--osd-recovery-max-active X'

where Y and X are the values that we want to use. Experience shows that Y=5 and X=5 doubles to triples the recovery performance, whereas X=10 and Y=10 increases recovery performance 5 times.

Debug scrub errors / inconsistent pg message¶

From time to time disks don't save what they are told to save. Ceph scrubbing detects these errors and switches to HEALTH_ERR. Use ceph health detail to find out which placement groups (pgs) are affected. Usually a ceph pg repair <number> fixes the problem.

If this does not help, consult https://ceph.com/geen-categorie/ceph-manually-repair-object/.

Move servers into the osd tree¶

New servers have their buckets placed outside the default root and thus need to be moved inside.

Output might look as follows:

[11:19:27] server5.place6:~# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -3 0.87270 host server5 41 ssd 0.87270 osd.41 up 1.00000 1.00000 -1 251.85580 root default -7 81.56271 host server2 0 hdd-big 9.09511 osd.0 up 1.00000 1.00000 5 hdd-big 9.09511 osd.5 up 1.00000 1.00000 ...

Use ceph osd crush move serverX root=default (where serverX is the new server),

which will move the bucket in the right place:

[11:21:17] server5.place6:~# ceph osd crush move server5 root=default moved item id -3 name 'server5' to location {root=default} in crush map [11:32:12] server5.place6:~# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 252.72850 root default ... -3 0.87270 host server5 41 ssd 0.87270 osd.41 up 1.00000 1.00000

How to fix existing osds with wrong partition layout¶

In the first version of DCL we used filestore/3 partition based layout.

In the second version of DCL, including OSD autodection, we use bluestore/2 partition based layout.

To convert, we delete the old OSD, clean the partitions and create a new osd:

Inactive OSD¶

If the OSD is not active, we can do the following:

Find the OSD number: mount the partition and find the whoami file

root@server2:/opt/ungleich-tools# mount /dev/sda2 /mnt/ root@server2:/opt/ungleich-tools# cat /mnt/whoami 0 root@server2:/opt/ungleich-tools# umount /mnt/

Verify in the ceph osd tree that the OSD is on that server

that the OSD is on that server Deleting the OSD ceph osd crush remove $osd_name ceph osd rm $osd_name



Then continue below as described in "Recreating the OSD".

Remove Active OSD¶

Use /opt/ungleich-tools/ceph-osd-stop-remove-permanently OSDID to stop and remove the OSD

Then continue below as described in "Recreating the OSD".

Recreating the OSD¶

Create an empty partition table fdisk /dev/sdX g w

Create a new OSD /opt/ungleich-tools/ceph-osd-create-start /dev/sdX CLASS # use hdd, ssd, ... for the CLASS



How to fix unfound pg¶

refer to https://redmine.ungleich.ch/issues/6388

Check health state ceph health detail

Check which server has that osd ceph osd tree

Check which VM is running in server place virsh list

Check pg map ceph osd map [osd pool] [VMID]

revert pg ceph pg [PGID] mark_unfound_lost revert



Enabling per image RBD statistics for prometheus¶

[20:26:57] red2.place5:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "one,hdd" [20:27:57] black2.place6:~# ceph config set mgr mgr/prometheus/rbd_stats_pools "hdd,ssd"

S3 Object Storage¶

This section is * UNDER CONTRUCTION *

S3 requests are handled by a publicly accessible gateway, which also has access to the ceph cluster.

s3 buckets are usually

Authentication / Users¶

Ceph can make use of LDAP as a backend However it uses the clear text username+password as a token See https://docs.ceph.com/docs/mimic/radosgw/ldap-auth/

make use of LDAP as a backend We do not want users to store their regular account on machines

For this reason we use independent users / tokens, but with the same username as in LDAP

Creating a user:

radosgw-admin user create --uid=USERNAME --display-name="Name of user"

Listing users:

radosgw-admin user list

Deleting users and their storage:

radosgw-admin user rm --uid=USERNAME --purge-data

Setting up S3 object storage on Ceph¶

Setup a gateway node with Alpine Linux Change do edge Enable testing

Update the firewall to allow access from this node to the ceph monitors

Setting up the wildcard DNS certificate

apk add ceph-radosgw

Wildcard DNS certificate from letsencrypt¶

Acquiring and renewing this certificate is currently a manual process, as it requires to change DNS settings.

run certbot

update DNS with the first token

update DNS with the second token

Sample session:

s3:/etc/ceph# certbot certonly --manual --preferred-challenges=dns --email sre@ungleich.ch --server https://acme-v02.api.letsencrypt.org/directory --agree-tos -d *.s3.ungleich.ch -d s3.ungleich.ch Saving debug log to /var/log/letsencrypt/letsencrypt.log Plugins selected: Authenticator manual, Installer None Cert is due for renewal, auto-renewing... Renewing an existing certificate Performing the following challenges: dns-01 challenge for s3.ungleich.ch dns-01 challenge for s3.ungleich.ch - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - NOTE: The IP of this machine will be publicly logged as having requested this certificate. If you're running certbot in manual mode on a machine that is not your server, please ensure you're okay with that. Are you OK with your IP being logged? - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - (Y)es/(N)o: y - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Please deploy a DNS TXT record under the name _acme-challenge.s3.ungleich.ch with the following value: KxGLZNiVjFwz1ifNheoR_KQoPVpkvRUV1oT2pOvJlU0 Before continuing, verify the record is deployed. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Press Enter to Continue - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Please deploy a DNS TXT record under the name _acme-challenge.s3.ungleich.ch with the following value: bkrhtxWZUipCAL5cBfvrjDuftqsZdQ2JjisiKmXBbaI Before continuing, verify the record is deployed. (This must be set up in addition to the previous challenges; do not remove, replace, or undo the previous challenge tasks yet. Note that you might be asked to create multiple distinct TXT records with the same name. This is permitted by DNS standards.) - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Press Enter to Continue Waiting for verification... Cleaning up challenges IMPORTANT NOTES: - Congratulations! Your certificate and chain have been saved at: /etc/letsencrypt/live/s3.ungleich.ch/fullchain.pem Your key file has been saved at: /etc/letsencrypt/live/s3.ungleich.ch/privkey.pem Your cert will expire on 2020-12-09. To obtain a new or tweaked version of this certificate in the future, simply run certbot again. To non-interactively renew *all* of your certificates, run "certbot renew" - If you like Certbot, please consider supporting our work by: Donating to ISRG / Let's Encrypt: https://letsencrypt.org/donate Donating to EFF: https://eff.org/donate-le

Debugging ceph¶

ceph status ceph osd status ceph osd df ceph osd utilization ceph osd pool stats ceph osd tree ceph pg stat