There comes a time in a network or storage administrator's career

when a large collection of storage volumes needs to be pooled together

and distributed within a clustered or multiple client network, while

maintaining high performance with little to no bottlenecks when accessing

the same files. That is where Lustre comes into the picture. The Lustre

filesystem is a high-performance distributed filesystem intended for

larger network and high-availability environments.

The Storage Area Network and Linux

Traditionally, Lustre is configured to manage remote data storage disk

devices within a Storage Area Network (SAN), which is two or more

remotely attached disk devices communicating via a Small Computer System

Interface (SCSI) protocol. This includes Fibre Channel, Fibre Channel

over Ethernet (FCoE), Serial Attached SCSI (SAS) and even iSCSI. To better

explain what a SAN is, it may be more beneficial to begin with what it

isn't. For instance, a SAN shouldn't be confused with a Local Area Network

(LAN), even if that LAN carries storage traffic (that is, via networked filesystem shares and so on). Only if the LAN carries storage traffic using

the iSCSI or FCoE protocols can it then be considered a SAN. Another

thing that a SAN isn't is Network Attached Storage (NAS). Again, the

SAN relies heavily on a SCSI protocol, while the NAS uses the NFS and

SMB/CIFS file-sharing protocols.

An external storage target device will represent storage volumes as

Logical Units within the SAN. Typically, a set of Logical Units will be

mapped across a SAN to an initiator node—in our case, it would be the

server(s) managing the Lustre filesystem. In turn, the server(s) will

identify one or more SCSI disk devices within its SCSI subsystem and treat

them as if they were local drives. The amount of SCSI disks identified

is determined by the amount of Logical Units mapped to the initiator. If

you want to follow along with the examples here, it is relatively simple

to configure a couple virtual machines: one as the server node with

one or more additional disk devices to export and the second to act as

a client node and mount the Lustre enabled volume. Although it is bad

practice, for testing purposes, it also is possible to have a single

virtual machine configured as both server and client.

SCSI

SCSI is an ANSI-standardized hardware and software computing

interface adopted by all early storage manufacturers. Revised editions

of the standard continue to be used today.

The Distributed Filesystem

A distributed filesystem allows access to files from multiple hosts

sharing a computer network. This makes it possible for multiple

users on multiple client nodes to share files and storage resources. The

client nodes do not have direct access to the underlying block storage

but interact over the network using a protocol and, thus, make it possible

to restrict access to the filesystem depending on access lists or

capabilities on both the servers and the clients, unlike a clustered filesystem, where all nodes have equal access to the block storage where the

filesystem is located. On these systems, the access control must reside on

the client. Other advantages to utilizing distributed filesystems include

the fact that they may involve facilities for transparent replication

and fault tolerance. So, when a limited number of nodes in a filesystem

goes off-line, the system continues to work without any data loss.

Lustre (or Linux Cluster) is one such distributed filesystem,

usually deployed for large-scale cluster computing. Licensed under

the GNU General Public License (or GPL), Lustre provides a solution in

which high performance and scalability to tens of thousands of nodes

and petabytes of storage becomes a reality and is relatively simple to

deploy and configure. Despite the fact that Lustre 2.0 has been released,

for this article, I work with the generally available 1.8.5.

Lustre contains a somewhat unique architecture, with three major

functional units. One is a single metadata server or MDS that contains

a single metadata target or MDT for each Lustre filesystem. This

stores namespace metadata, which includes filenames, directories, access

permissions and file layout. The MDT data is stored in a single disk filesystem mapped locally to the serving node and is a dedicated filesystem

that controls file access and informs the client node(s) which object(s)

make up a file. Second are one or more object storage servers (OSSes) that

store file data on one or more object storage targets or OST. An OST is a

dedicated object-base filesystem exported for read/write operations. The

capacity of a Lustre filesystem is determined by the sum of the total

capacities of the OSTs. Finally, there's the client(s) that accesses and

uses the file data.

Lustre presents all clients with a unified namespace

for all of the files and data in the filesystem that allow concurrent

and coherent read and write access to the files in the filesystem. When

a client accesses a file, it completes a filename lookup on the MDS,

and either a new file is created or the layout of an existing file is

returned to the client. Locking the file on the OST, the client will

then run one or more read or write operations to the file but will not

directly modify the objects on the OST. Instead, it will delegate tasks to

the OSS. This approach will ensure scalability and improved security and

reliability, as it does not allow direct access to the underlying storage,

thus, increasing the risk of filesystem corruption from misbehaving/defective clients. Although all three components (MDT, OST and client)

can run on the same node, they typically are configured on separate

nodes communicating over a network (see the details on LNET later in this

article). In

this example, I'm running the MDT and OST on a single server node

while the client will be accessing the OST from a separate node.

Installing Lustre

To obtain Lustre 1.8.5, download the prebuilt binaries packaged in RPMs,

or download the source and build the modules and utilities for your

respective Linux distribution. Oracle provides server RPM packages

for both Oracle Enterprise Linux (OEL) 5 and Red Hat Enterprise Linux

(RHEL) 5, while also providing client RPM packages for OEL 5, RHEL 5

and SUSE Linux Enterprise Server (SLES) 10,11. If you will be building

Lustre from source, ensure that you are using a Linux

kernel 2.6.16 or greater. Note that in all deployments of Lustre, the

server that runs on an MDS, MGS (discussed below) or OSS must utilize a

patched kernel. Running a patched kernel on a Lustre client is optional

and required only if the client will be used for multiple purposes,

such as running as both a client and an OST.

If you already have a supported operating system,

make sure that the patched kernel, lustre-modules, lustre-ldiskfs (a

Lustre-patched backing filesystem kernel module package for the ext3 filesystem), lustre (which includes userspace utilities to configure and run

Lustre) and e2fsprogs packages are installed on the host system while

also resolving its dependencies from a local or remote repository. Use

the rpm command to install all necessary packages:

$ sudo rpm -ivh kernel-2.6.18-194.3.1.0.1.el5_lustre.1.8.4.i686.rpm $ sudo rpm -ivh lustre-modules-1.8.4-2.6.18_194.3.1.0.1.el5_ ↪lustre.1.8.4.i686.rpm $ sudo rpm -ivh lustre-ldiskfs-3.1.3-2.6.18_194.3.1.0.1.el5_ ↪lustre.1.8.4.i686.rpm $ sudo rpm -ivh lustre-1.8.4-2.6.18_194.3.1.0.1.el5_ ↪lustre.1.8.4.i686.rpm $ sudo rpm -ivh e2fsprogs-1.41.10.sun2-0redhat.oel5.i386.rpm

After these packages have been installed, list the boot directory to

reveal the newly installed patched Linux kernel:

[petros@lustre-host ~]$ ls /boot/ config-2.6.18-194.3.1.0.1.el5_lustre.1.8.4 grub initrd-2.6.18-194.3.1.0.1.el5_lustre.1.8.4.img lost+found symvers-2.6.18-194.3.1.0.1.el5_lustre.1.8.4.gz System.map-2.6.18-194.3.1.0.1.el5_lustre.1.8.4 vmlinuz-2.6.18-194.3.1.0.1.el5_lustre.1.8.4

View the /boot/grub/grub.conf file to validate that the newly installed

kernel has been set as the default kernel. Now that all packages have

been installed, a reboot is required to load the new kernel image. Once

the system has been rebooted, an invocation of the uname command will

reveal the currently booted kernel image:

[petros@lustre-host ~]$ uname -a Linux lustre-host 2.6.18-194.3.1.0.1.el5_lustre.1.8.4 #1 ↪SMP Mon Jul 26 22:12:56 MDT 2010 i686 i686 i386 GNU/Linux

Meanwhile, on the client side, the packages for the lustre client

(utilities for patchless clients) and lustre client modules (modules for

patchless clients) need to be installed on all desired client machines:

$ sudo rpm -ivh lustre-client-1.8.4-2.6.18_194.3.1.0.1.el5_ ↪lustre.1.8.4.i686.rpm $ sudo rpm -ivh lustre-client-modules-1.8.4-2.6.18_194.3.1.0.1.el5_ ↪lustre.1.8.4.i686.rpm

Note that these client machines need to be within the same network

as the host machine serving the Lustre filesystem. After the packages

are installed, reboot all affected client machines.

Configuring Lustre Server

In order to configure the Lustre filesystem, you need to configure

Lustre Networking, or LNET, which provides the communication

infrastructure required by the Lustre filesystem. LNET supports

many commonly used network types, which include InfiniBand and IP

networks. It allows simultaneous availability across multiple network

types with routing between them. In this example, let's use

tcp, so use your favorite editor to append the following line to the

/etc/modprobe.conf file:

options lnet networks=tcp

This step restricts LNET to be using only the specified network interfaces

and prevents LNET from using all network interfaces.

Before moving on, it is important to discuss the role of the Management

Server or MGS. The MGS stores configuration information for all Lustre

filesystems in a clustered setup. An OST contacts the MGS to

provide information, while the client(s) contact the MGS to retrieve

information. The MGS requires its own disk for storage, although there is

a provision that allows the MGS to share a disk with a single MDT. Type

the following command to create a combined MGS/MDT node:

[petros@lustre-host ~]$ sudo /usr/sbin/mkfs.lustre ↪--fsname=lustre --mgs --mdt /dev/sda1 Permanent disk data: Target: lustre-MDTffff Index: unassigned Lustre FS: lustre Mount type: ldiskfs Flags: 0x75 (MDT MGS needs_index first_time update ) Persistent mount opts: iopen_nopriv,user_xattr,errors=remount-ro Parameters: mdt.group_upcall=/usr/sbin/l_getgroups device size = 1019MB 2 6 18 formatting backing filesystem ldiskfs on /dev/sda1 target name lustre-MDTffff 4k blocks 261048 options -i 4096 -I 512 -q -O dir_index,uninit_groups -F mkfs_cmd = mke2fs -j -b 4096 -L lustre-MDTffff -i 4096 -I 512 -q -O ↪dir_index,uninit_groups -F /dev/sda1 261048 Writing CONFIGS/mountdata

If nothing has been provided, by default, the fsname is lustre. If one or

more of these filesystems are created, it is necessary to use unique

names for each labeled volume. These names become very important for

when you access the target on the client system.

Create the OST by executing the following command:

[petros@lustre-host ~]$ sudo /usr/sbin/mkfs.lustre --ost ↪--fsname=lustre --mgsnode=10.0.2.15@tcp0 /dev/sda2 Permanent disk data: Target: lustre-OSTffff Index: unassigned Lustre FS: lustre Mount type: ldiskfs Flags: 0x72 (OST needs_index first_time update ) Persistent mount opts: errors=remount-ro,extents,mballoc Parameters: mgsnode=10.0.2.15@tcp checking for existing Lustre data: not found device size = 1027MB 2 6 18 formatting backing filesystem ldiskfs on /dev/sda2 target name lustre-OSTffff 4k blocks 263064 options -J size=40 -i 16384 -I 256 -q -O ↪dir_index,extents,uninit_groups -F mkfs_cmd = mke2fs -j -b 4096 -L lustre-OSTffff ↪-J size=40 -i 16384 -I 256 -q -O ↪dir_index,extents,uninit_groups -F /dev/sda2 263064 Writing CONFIGS/mountdata

When the target needs to provide information to the MGS or when the client

accesses the target for information lookup, all also will need to be

aware of where the MGS is, which you have defined for this target

as the server's IP address followed by the network interface for

communication. This is just a reminder that the interface was defined

earlier in the /etc/modprobe.conf file.

Now you easily can mount these newly formatted devices local to the host

machine. Mount the MDT:

[petros@lustre-host ~]$ sudo mkdir -p /mnt /mdt [petros@lustre-host ~]$ sudo mount -t lustre /dev/sda1 /mnt/mdt/

Verify that it is mounted with the df command:

[petros@lustre-host ~]$ df -t lustre Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 913560 17752 843600 3% /mnt/mdt

The /var/log/messages file will log the following messages:

Jun 25 10:26:54 lustre-host kernel: Lustre: lustre-MDT0000: ↪new disk, initializing Jun 25 10:26:54 lustre-host kernel: Lustre: lustre-MDT0000: ↪Now serving lustre-MDT0000 on /dev/sda1 with recovery enabled Jun 25 10:26:54 lustre-host kernel: Lustre: ↪3285:0:(lproc_mds.c:271:lprocfs_wr_group_upcall()) ↪lustre-MDT0000: group upcall set to /usr/sbin/l_getgroups Jun 25 10:26:54 lustre-host kernel: Lustre: lustre-MDT0000.mdt: ↪set parameter group_upcall=/usr/sbin/l_getgroups

Mount the OST:

[petros@lustre-host ~]$ sudo mkdir -p /mnt /ost [petros@lustre-host ~]$ sudo mount -t lustre /dev/sda2 /mnt/ost/

Verify that it is mounted with the df command:

[petros@lustre-host ~]$ df -t lustre Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda1 913560 17768 843584 3% /mnt/mdt /dev/sda2 1035692 42460 940620 5% /mnt/ost

The /var/log/messages file will log the following messages:

Jun 25 10:41:33 lustre-host kernel: Lustre: ↪lustre-OST0000: new disk, initializing Jun 25 10:41:33 lustre-host kernel: Lustre: ↪lustre-OST0000: Now serving lustre-OST0000 on ↪/dev/sda2 with recovery enabled Jun 25 10:41:39 lustre-host kernel: Lustre: ↪3417:0:(mds_lov.c:1155:mds_notify()) MDS lustre-MDT0000: ↪add target lustre-OST0000_UUID Jun 25 10:41:39 lustre-host kernel: Lustre: ↪3188:0:(quota_master.c:1716:mds_quota_recovery()) ↪Only 0/1 OSTs are active, abort quota recovery Jun 25 10:41:39 lustre-host kernel: Lustre: lustre-OST0000: ↪received MDS connection from 0@lo Jun 25 10:41:39 lustre-host kernel: Lustre: MDS lustre-MDT0000: ↪lustre-OST0000_UUID now active, resetting orphans

Even though they are mounted like typical filesystems, you will notice

that the MDT and OST labeled volumes do not contain the typical filesystem

characteristics. That is where the client will come into play:

[petros@lustre-host ~]$ ls /mnt/ost/ ls: /mnt/ost/: Not a directory [petros@lustre-host ~]$ sudo ls /mnt/ost/ ls: /mnt/ost/: Not a directory [petros@lustre-host ~]$ sudo ls -l /mnt/ total 4 d--------- 1 root root 0 Jun 25 10:22 ost d--------- 1 root root 0 Jun 25 10:22 mdt

Configuring Lustre Client(s)

Remember that you named the Lustre-enabled volume lustre. When you mount

the volume over the network on the client node, you must specify this

name. This can be observed below following the network method (tcp0)

by which you are mounting the remote volume. Again, TCP was defined in

the /etc/modprobe.conf file for the supported LNET networks interfaces:

[petros@client1 ~]$ sudo mkdir -p /lustre [petros@client1 ~]$ sudo mount -t lustre ↪10.0.2.15@tcp0:/lustre /lustre/

After successfully mounting the remote volume, you will see the

/var/log/messages file append:

Jun 25 10:44:17 client1 kernel: Lustre: ↪Client lustre-client has started

Use df to list the mounted Lustre-enabled volumes:

[petros@client1 ~]$ df -t lustre Filesystem 1K-blocks Used Available Use% Mounted on 10.0.2.15@tcp0:/lustre 1035692 42460 940556 5% /lustre

Once mounted, the filesystem now can be accessed by the client node. For

instance, you cannot read or write from or to files located on the mounted

OST. In the following example, let's write approximately 40MB of

data to a new file, on /lustre:

[petros@client1 lustre]$ sudo dd if=/dev/zero ↪of=/lustre/test.dat bs=4M count=10 10+0 records in 10+0 records out 41943040 bytes (42 MB) copied, 1.30103 seconds, 32.2 MB/s

A df listing will reflect the changes of available and used capacities

to the /lustre mountpoint:

[petros@client1 lustre]$ df -t lustre Filesystem 1K-blocks Used Available Use% Mounted on 10.0.2.15@tcp0:/lustre 1035692 83420 899596 9% /lustre [petros@client1 lustre]$ ls -l total 40960 -rw-r--r-- 1 root root 41943040 Jun 25 10:47 test.dat

Summary

Although I covered a simple introduction and tutorial of the Lustre

distributed filesystem, there still is so much uncharted territory

and methods by which the filesystem can be configured for a truly

rich computing environment. For instance, Lustre can be configured

for high availability to ensure that in the situation of failure,

the system's services continue without interruption.

That is, the accessibility between the client(s), server(s) and external

target storage is always made available through a process called failover.

Additional HA is provided through an implementation of disk drive

redundancy in some sort of RAID (hardware or software via mdadm)

configuration for the event of drive failures. These high-availability techniques normally would apply to server nodes hosting

the MDS and OSS. It is suggested to place the OST storage on a RAID 5 or,

preferably, RAID 6, while the MDT storage should be RAID 1 or RAID 0+1.

It isn't a difficult technology to work with; however, there still

exists an entire community with excellent administrator and developer

resources ranging from articles, mailing lists and more to aid the

novice in installing and maintaining a Lustre-enabled system. Commercial

support for Lustre is made available by a non-exhaustive list of vendors

selling bundled computing and Lustre storage systems. Many of these same

vendors also are contributing to the Open Source community surrounding

the Lustre Project.

Failover

Failover is a process with the capability to switch over automatically

to a redundant or standby computer server, system or network upon the

failure or abnormal termination of the previously active application,

server, system or network. Many failover solutions exist

on the Linux platform to cover both SAN and LAN environments.

Resources

Lustre Project Page: http://wiki.lustre.org/index.php

Wikipedia: Lustre: http://en.wikipedia.org/wiki/Lustre_(file_system)

Wikipedia: Distributed File Systems: http://en.wikipedia.org/wiki/Distributed_file_system