|Initial release||December 16, 2003|
2.11.50 / April 3, 2018
|Operating system||Linux kernel|
|Type||Distributed file system|
|Founder||Peter J. Braam|
|Phil Schwan, Eric Barton (HPC), Andreas Dilger|
|Products||Lustre file system|
|Introduced||December, 2003 with Linux|
|Directory contents||Hash, Interleaved Hash with DNE in 2.7+|
|Min. volume size||32 MB|
|Max. volume size||100 PB (production), over 16 EB (theoretical)|
|Max. file size||3.2 PB (ext4), 16 EB (ZFS)|
|File size granularity||4 KB|
|Max. number of files||Per Metadata Target (MDT): 4 billion files (ldiskfs backend), 256 trillion files (ZFS backend), up to 128 MDTs per filesystem|
|Max. filename length||255 bytes|
|Max. dirname length||255 bytes|
|Max. directory depth||4096 bytes|
|Allowed characters in filenames||All bytes except NUL ('\0') and '/' and the special file names "." and ".."|
|Dates recorded||modification (mtime), attribute modification (ctime), access (atime), delete (dtime), create (crtime)|
|Date range||2^34 bits (ext4), 2^64 bits (ZFS)|
|Date resolution||1 s|
|Attributes||32bitapi, acl, checksum, flock, lazystatfs, localflock, lruresize, noacl, nochecksum, noflock, nolazystatfs, nolruresize, nouser_fid2path, nouser_xattr, user_fid2path, user_xattr|
|File system permissions||POSIX, POSIX.1e ACL, SELinux|
|Transparent compression||y (ZFS only)|
|Transparent encryption||y (network only)|
|Data deduplication||y (ZFS only)|
|Copy-on-write||y (ZFS only)|
|Supported operating systems||Linux kernel|
Lustre is a type of parallel distributed file system, generally used for large-scale cluster computing. The name Lustre is a portmanteau word derived from Linux and cluster. Lustre file system software is available under the GNU General Public License (version 2 only) and provides high performance file systems for computer clusters ranging in size from small workgroup clusters to large-scale, multi-site clusters.
Because Lustre file systems have high performance capabilities and open licensing, it is often used in supercomputers. Since June 2005, it has consistently been used by at least half of the top ten, and more than 60 of the top 100 fastest supercomputers in the world, including the world's No. 2 and No. 3 ranked TOP500 supercomputers in 2014, Titan and Sequoia.
Lustre file systems are scalable and can be part of multiple computer clusters with tens of thousands of client nodes, tens of petabytes (PB) of storage on hundreds of servers, and more than a terabyte per second (TB/s) of aggregate I/O throughput. This makes Lustre file systems a popular choice for businesses with large data centers, including those in industries such as meteorology, simulation, oil and gas, life science, rich media, and finance.
The Lustre file system architecture was started as a research project in 1999 by Peter J. Braam, who was on the staff of Carnegie Mellon University (CMU) at the time. Braam went on to found his own company Cluster File Systems in 2001, starting from work on the InterMezzo file system in the Coda project at CMU. Lustre was developed under the Accelerated Strategic Computing Initiative Path Forward project funded by the United States Department of Energy, which included Hewlett-Packard and Intel. In September 2007, Sun Microsystems acquired the assets of Cluster File Systems Inc. including its intellectual property. Sun included Lustre with its high-performance computing hardware offerings, with the intent to bring Lustre technologies to Sun's ZFS file system and the Solaris operating system. In November 2008, Braam left Sun Microsystems, and Eric Barton and Andreas Dilger took control of the project. In 2010 Oracle Corporation, by way of its acquisition of Sun, began to manage and release Lustre.
In December 2010, Oracle announced they would cease Lustre 2.x development and place Lustre 1.8 into maintenance-only support creating uncertainty around the future development of the file system. Following this announcement, several new organizations sprang up to provide support and development in an open community development model, including Whamcloud,Open Scalable File Systems, Inc. (OpenSFS), EUROPEAN Open File Systems (EOFS) and others. By the end of 2010, most Lustre developers had left Oracle. Braam and several associates joined the hardware-oriented Xyratex when it acquired the assets of ClusterStor, while Barton, Dilger, and others formed software startup Whamcloud, where they continued to work on Lustre.
In August 2011, OpenSFS awarded a contract for Lustre feature development to Whamcloud. This contract covered the completion of features, including improved Single Server Metadata Performance scaling, which allows Lustre to better take advantage of many-core metadata server; online Lustre distributed filesystem checking (LFSCK), which allows verification of the distributed filesystem state between data and metadata servers while the filesystem is mounted and in use; and Distributed Namespace (DNE), formerly Clustered Metadata (CMD), which allows the Lustre metadata to be distributed across multiple servers. Development also continued on ZFS-based back-end object storage at Lawrence Livermore National Laboratory. These features were in the Lustre 2.2 through 2.4 community release roadmap. In November 2011, a separate contract was awarded to Whamcloud for the maintenance of the Lustre 2.x source code to ensure that the Lustre code would receive sufficient testing and bug fixing while new features were being developed.
In July 2012 Whamcloud was acquired by Intel, after Whamcloud won the FastForward DOE contract to extend Lustre for exascale computing systems in the 2018 timeframe.OpenSFS then transitioned contracts for Lustre development to Intel.
In February 2013, Xyratex Ltd., announced it acquired the original Lustre trademark, logo, website and associated intellectual property from Oracle. In June 2013, Intel began expanding Lustre usage beyond traditional HPC, such as within Hadoop. For 2013 as a whole, OpenSFS announced request for proposals (RFP) to cover Lustre feature development, parallel file system tools, addressing Lustre technical debt, and parallel file system incubators.OpenSFS also established the Lustre Community Portal, a technical site that provides a collection of information and documentation in one area for reference and guidance to support the Lustre open source community. On April 8, 2014, Ken Claffey announced that Xyratex/Seagate is donating the lustre.org domain back to the user community, and was completed in March, 2015.
Lustre 1.0.0 was released in December 2003, and provided basic Lustre filesystem functionality, including server failover and recovery.
Lustre 1.2.0, released in March 2004, worked on Linux kernel 2.6, and had a "size glimpse" feature to avoid lock revocation on files undergoing write, and client side data write-back cache accounting (grant).
Lustre 1.6.0, released in April 2007, allowed mount configuration ("mountconf") allowing servers to be configured with "mkfs" and "mount", allowed dynamic addition of object storage targets (OSTs), enabled Lustre distributed lock manager (LDLM) scalability on symmetric multiprocessing (SMP) servers, and provided free space management for object allocations.
Lustre 1.8.0, released in May 2009, provided OSS Read Cache, improved recovery in the face of multiple failures, added basic heterogeneous storage management via OST Pools, adaptive network timeouts, and version-based recovery. It was a transition release, being interoperable with both Lustre 1.6 and Lustre 2.0.
Lustre 2.0, released in August 2010, was based on significant internally restructured code to prepare for major architectural advancements. Lustre 2.x clients cannot interoperate with 1.8 or earlier servers. However, Lustre 1.8.6 and later clients can interoperate with Lustre 2.0 and later servers. The Metadata Target (MDT) and OST on-disk format from 1.8 can be upgraded to 2.0 and later without the need to reformat the filesystem.
Lustre 2.1, released in September 2011, was a community-wide initiative in response to Oracle suspending development on Lustre 2.x releases. It added the ability to run servers on Red Hat Linux 6 and increased the maximum ext4-based OST size from 24 TB to 128 TB, as well as a number of performance and stability improvements. Lustre 2.1 servers remained inter-operable with 1.8.6 and later clients.
Lustre 2.2, released in March 2012, focused on providing metadata performance improvements and new features. It added parallel directory operations allowing multiple clients to traverse and modify a single large directory concurrently, faster recovery from server failures, increased stripe counts for a single file (across up to 2000 OSTs), and improved single-client directory traversal performance.
Lustre 2.3, released in October 2012, continued to improve the metadata server code to remove internal locking bottlenecks on nodes with many CPU cores (over 16). The object store added a preliminary ability to use ZFS as the backing file system. The Lustre File System ChecK (LFSCK) feature can verify and repair the MDS Object Index (OI) while the file system is in use, after a file-level backup/restore or in case of MDS corruption. The server-side IO statistics were enhanced to allow integration with batch job schedulers such as SLURM to track per-job statistics. Client-side software was updated to work with Linux kernels up to version 3.0.
Lustre 2.4, released in May 2013, added a considerable number of major features, many funded directly through OpenSFS. Distributed Namespace (DNE) allows horizontal metadata capacity and performance scaling for 2.4 clients, by allowing subdirectory trees of a single namespace to be located on separate MDTs. ZFS can now be used as the backing filesystem for both MDT and OST storage. The LFSCK feature added the ability to scan and verify the internal consistency of the MDT FID and LinkEA attributes. The Network Request Scheduler (NRS) adds policies to optimize client request processing for disk ordering or fairness. Clients can optionally send bulk RPCs up to 4 MB in size. Client-side software was updated to work with Linux kernels up to version 3.6, and is still interoperable with 1.8 clients.
Lustre 2.5, released in October 2013, added the highly anticipated feature, Hierarchical Storage Management (HSM). A core requirement in enterprise environments, HSM allows customers to easily implement tiered storage solutions in their operational environment. This release is the current OpenSFS-designated Maintenance Release branch of Lustre. The most recent maintenance version is 2.5.3 and was released in September 2014.
Lustre 2.6, released in July 2014, was a more modest release feature wise, adding LFSCK functionality to do local consistency checks on the OST as well as consistency checks between MDT and OST objects. Single-client IO performance was improved over the previous releases. This release also added a preview of DNE striped directories, allowing single large directories to be stored on multiple MDTs to improve performance and scalability.
Lustre 2.7, released in March 2015, added LFSCK functionality to verify DNE consistency of remote and striped directories between multiple MDTs. Dynamic LNet Config adds the ability to configure and modify LNet network interfaces, routes, and routers at runtime. A new evaluation feature was added for UID/GID mapping for clients with different administrative domains, along with improvements to the DNE striped directory functionality.
Lustre 2.8, released in March 2016, finished the DNE striped directory feature, including support for migrating directories between MDTs, and cross-MDT hard link and rename. As well, it included improved support for Security-Enhanced Linux (SELinux) on the client, Kerberos authentication and RPC encryption over the network, and performance improvements for LFSCK.
Lustre 2.9 was released in December 2016 and included a number of features related to security and performance. The Shared Secret Key security flavour uses the same GSSAPI mechanism as Kerberos to provide client and server node authentication, and RPC message integrity and security (encryption). The Nodemap feature allows categorizing client nodes into groups and then mapping the UID/GID for those clients, allowing remotely-administered clients to transparently use a shared filesystem without having a single set of UID/GIDs for all client nodes. The subdirectory mount feature allows clients to mount a subset of the filesystem namespace from the MDS. This release also added support for up to 16MiB RPCs for more efficient I/O submission to disk, and added the
ladvise interface to allow clients to provide I/O hints to the servers to prefetch file data into server cache or flush file data from server cache. There was improved support for specifying filesystem-wide default OST pools, and improved inheritance of OST pools in conjunction with other file layout parameters.
Lustre 2.10 was released in July 2017 and has a number of significant improvements. The LNet Multi-Rail feature allows bonding multiple network interfaces (InfiniBand, OPA, and/or Ethernet) on a client and server to increase aggregate I/O bandwidth. File layouts can now be constructed of multiple components, based on the file offset, which allow different parameters such as stripe count, OST pool, etc. to be determined based on the file size. The NRS Token Bucket Filter (TBF) server-side scheduler has implemented new rule types, including RPC-type scheduling and the ability to specify multiple parameters such as JobID and NID for rule matching. Tools for managing ZFS snapshots of Lustre filesystems have been added, to simplify the creation, mounting, and management of MDT and OST ZFS snapshots as separate Lustre mountpoints.
Lustre 2.11 was released in April 2018 and contains two significant new features, and several smaller features. The File Level Redundancy (FLR) feature expands on the 2.10 PFL implementation, adding the ability to specify mirrored file layouts for improved availability in case of storage or server failure. The Data-on-MDT feature allows small (few MiB) files to be stored on the MDT to leverage typical flash-based RAID-10 storage for lower latency and reduced IO contention, instead of the typical HDD RAID-6 storage used on OSTs. As well, the LNet Dynamic Discovery feature allows auto-configuration of Multi-Rail between peers that share an LNet network. The LDLM Lock Ahead feature allows appropriately-modified applications and libraries to pre-fetch DLM extent locks from the OSTs for files, if the application knows (or predicts) that this file extent will be modified in the near future, which can reduce lock contention for multiple clients writing to the same file.
A Lustre file system has three major functional units:
The MDT, OST, and client may be on the same node (usually for testing purposes), but in typical production installations these devices are on separate nodes communicating over a network. Each MDT and OST may be part of only a single filesystem, though it is possible to have multiple MDTs or OSTs on a single node that are part of different filesystems. The Lustre Network (LNet) layer can use several types of network interconnects, including native InfiniBand verbs, TCP/IP on Ethernet, Omni-Path, and other proprietary network technologies such as the Cray Gemini interconnect. In Lustre 2.3 and earlier, Myrinet, Quadrics, Cray SeaStar and RapidArray networks were also supported, but these network drivers were deprecated when these networks were no longer commercially available, and support was removed completely in Lustre 2.8. Lustre will take advantage of remote direct memory access (RDMA) transfers, when available, to improve throughput and reduce CPU usage.
The storage used for the MDT and OST backing filesystems is normally provided by hardware RAID devices, though will work with any block devices. Since Lustre 2.4, the MDT and OST can also use ZFS for the backing filesystem, allowing them to effectively use JBOD storage instead of hardware RAID devices. The Lustre OSS and MDS servers read, write, and modify data in the format imposed by the backing filesystem and return this data to the clients. This allows Lustre to take advantage of improvements and features in the underlying filesystem, such as compression and data checksums in ZFS. Clients do not have any direct access to the underlying storage.
An OST is a dedicated filesystem that exports an interface to byte ranges of file objects for read/write operations. An MDT is a dedicated filesystem that stores inodes, attributes, and directories; controls file access, and tells clients the layout of the object(s) that make up each regular file. MDTs and OSTs currently use either an enhanced version of ext4 called ldiskfs, or ZFS/DMU for back-end data storage to store files/objects using the open source ZFS-on-Linux port.
When a client accesses a file, it performs a filename lookup on the MDS. When the MDS filename lookup is complete, either the layout of an existing file is returned to the client or a new file is created on behalf of the client, if requested. For read or write operations, the client then interprets the file layout in the logical object volume (LOV) layer, which maps the file offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSS nodes. With this approach, bottlenecks for client-to-OSS communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the filesystem.
After the initial lookup of the file layout, the MDS is not involved in file IO operations since all block allocation is managed internally by the OST. Clients do not directly modify the objects on the OST filesystems, but instead delegate this task to OSS nodes. This approach ensures scalability for large-scale clusters and supercomputers, as well as improved security and reliability. In contrast, shared block-based filesystems such as Global File System and OCFS must allow direct access to all of the underlying storage by all of the clients in the filesystem, which requires a large back-end SAN attached to all clients, and increases the risk of filesystem corruption from misbehaving/defective clients.
In a typical Lustre installation on a Linux client, a Lustre filesystem driver module is loaded into the kernel and the filesystem is mounted like any other local or network filesystem. Client applications see a single, unified filesystem even though it may be composed of tens to thousands of individual servers and MDT/OST filesystems.
On some massively parallel processor (MPP) installations, computational processors can access a Lustre file system by redirecting their I/O requests to a dedicated I/O node configured as a Lustre client. This approach is used in the Blue Gene installation at Lawrence Livermore National Laboratory.
Another approach used in the early years of Lustre is the liblustre library on the Cray XT3 using the Catamount operating system on systems such as Sandia Red Storm, which provided userspace applications with direct filesystem access. Liblustre was a user-level library that allows computational processors to mount and use the Lustre file system as a client. Using liblustre, the computational processors could access a Lustre file system even if the service node on which the job was launched is not a Linux client. Liblustre allowed data movement directly between application space and the Lustre OSSs without requiring an intervening data copy through the kernel, thus providing access from computational processors to the Lustre file system directly in a constrained operating environment. The liblustre functionality was deleted from Lustre 2.7.0 after having been disabled since Lustre 2.6.0, and was untested since Lustre 2.3.0.
In a traditional Unix disk file system, an inode data structure contains basic information about each file, such as where the data contained in the file is stored. The Lustre file system also uses inodes, but inodes on MDTs point to one or more OST objects associated with the file rather than to data blocks. These objects are implemented as files on the OSTs. When a client opens a file, the file open operation transfers a set of object identifiers and their layout from the MDS to the client, so that the client can directly interact with the OSS node where the object is stored. This allows the client to perform I/O in parallel across all of the OST objects in the file without further communication with the MDS.
If only one OST object is associated with an MDT inode, that object contains all the data in the Lustre file. When more than one object is associated with a file, data in the file is "striped" in chunks in a round-robin manner across the OST objects similar to RAID 0. Striping a file over multiple OST objects provides significant performance benefits if there is a need for high bandwidth access to a single large file. When striping is used, the maximum file size is not limited by the size of a single target. Capacity and aggregate I/O bandwidth scale with the number of OSTs a file is striped over. Also, since the locking of each object is managed independently for each OST, adding more stripes (one per OST) scales the file I/O locking capacity of the file proportionately. Each file created in the filesystem may specify different layout parameters, such as the stripe count (number of OST objects making up that file), stripe size (unit of data stored on each OST before moving to the next), and OST selection, so that performance and capacity can be tuned optimally for each file. When many application threads are reading or writing to separate files in parallel, it is optimal to have a single stripe per file, since the application is providing its own parallelism. When there are many threads reading or writing a single large file concurrently, then it is optimal to have one stripe on each OST to maximize the performance and capacity of that file.
In the Lustre 2.10 release, the ability to specify composite layouts was added to allow files to have different layout parameters for different regions of the file. The Progressive File Layout (PFL) feature uses composite layouts to improve file IO performance over a wider range of workloads, as well as simplify usage and administration. For example, a small PFL file can have a single stripe for low access overhead, while larger files can have many stripes for high aggregate bandwidth and better OST load balancing. The composite layouts are further enhanced in the 2.11 release with the File Level Redundancy (FLR) feature, which allows a file to have multiple overlapping layouts for a file, providing RAID 0+1 redundancy for these files as well as improved read performance.
When multiple MDTs are configured for a filesystem, the client determines on which MDT a given filename resides on by looking up the 128-bit Lustre File Identifier (FID, composed of the Sequence number and Object ID) of the parent directory in the FID Location Database (FLDB), which maps the Sequence to a specific MDT. Once the MDT of the parent directory is determined, further directory operations (for non-striped directories) take place exclusively on that MDT, avoiding contention between MDTs. For striped directories, the per-directory layout stored on the parent directory provides a hash function and list of MDT objects across which the directory is distributed. The filename is hashed and mapped to a specific MDT shard, which will handle further operations on that file in an identical manner to a non-striped directory. For readdir operations, the entries from each directory shard are returned interleaved in hash order so that a single 64-bit cookie can be used to determine the current offset within the directory.
The Lustre distributed lock manager (LDLM), implemented in the OpenVMS style, protects the integrity of each file's data and metadata. Access and modification of a Lustre file is completely cache coherent among all of the clients. Metadata locks are managed by the MDT that stores the inode for the file, using FID as the resource name. The metadata locks are split into separate bits that protect the lookup of the file (file owner and group, permission and mode, and access control list (ACL)), the state of the inode (directory size, directory contents, link count, timestamps), layout (file striping, since Lustre 2.4), and extended attributes (xattrs, since Lustre 2.5). A client can fetch multiple metadata lock bits for a single inode with a single RPC request, but currently they are only ever granted a read lock for the inode. The MDS manages all modifications to the inode in order to avoid lock resource contention and is currently the only node that gets write locks on inodes.
File data locks are managed by the OST on which each object of the file is striped, using byte-range extent locks. Clients can be granted overlapping read extent locks for part or all of the file, allowing multiple concurrent readers of the same file, and/or non-overlapping write extent locks for independent regions of the file. This allows many Lustre clients to access a single file concurrently for both read and write, avoiding bottlenecks during file I/O. In practice, because Linux clients manage their data cache in units of pages, the clients will request locks that are always an integer multiple of the page size (4096 bytes on most clients). When a client is requesting an extent lock the OST may grant a lock for a larger extent than originally requested, in order to reduce the number of lock requests that the client makes. The actual size of the granted lock depends on several factors, including the number of currently-granted locks on that object, whether there are conflicting write locks for the requested lock extent, and the number of pending lock requests on that object. The granted lock is never smaller than the originally-requested extent. OST extent locks use the Lustre FID of the object as the resource name for the lock. Since the number of extent lock servers scales with the number of OSTs in the filesystem, this also scales the aggregate locking performance of the filesystem, and of a single file if it is striped over multiple OSTs.
In a cluster with a Lustre file system, the system network connecting the servers and the clients is implemented using Lustre Networking (LNet), which provides the communication infrastructure required by the Lustre file system. Disk storage is connected to the Lustre MDS and OSS server nodes using direct attached storage (SAS, FC, iSCSI) or traditional storage area network (SAN) technologies.
LNet can use many commonly-used network types, such as InfiniBand and TCP (commonly Ethernet) networks, and allows simultaneous availability across multiple network types with routing between them. Remote Direct Memory Access (RDMA) is permitted when available on the underlying networks such as InfiniBand, RoCE, iWARP, and Omni-Path. High availability and recovery features enable transparent recovery in conjunction with failover servers.
LNet provides end-to-end throughput over Gigabit Ethernet networks in excess of 100 MB/s, throughput up to 3 GB/s using InfiniBand quad data rate (QDR) links, and throughput over 1 GB/s across 10 Gigabit Ethernet interfaces.
Lustre file system high availability features include a robust failover and recovery mechanism, making server failures and reboots transparent. Version interoperability between successive minor versions of the Lustre software enables a server to be upgraded by taking it offline (or failing it over to a standby server), performing the upgrade, and restarting it, while all active jobs continue to run, experiencing a delay while the backup server takes over the storage.
Lustre MDSes are configured as an active/passive pair, or one or more active/active MDS pairs with DNE, while OSSes are typically deployed in an active/active configuration that provides redundancy without extra overhead. Often the standby MDS for one filesystem is the MGS and/or monitoring node, or the active MDS for another file system, so no nodes are idle in the cluster.
Lustre provides the capability to have a primary tier and an archive tier of storage. The ability to migrate things from the, typically more expensive, primary tier to the, typically less performant, archive tier and leave a stub so that files can be restored to primary is an important part of balancing price and performance.
HSM includes some additional Lustre components:
HSM also defines new states including: 
Also, there are new flags associated with HSM files: 
<this section needs expansion>
Lustre is used by many of the TOP500 supercomputers and large multi-cluster sites. Six of the top 10 and more than 60 of the top 100 supercomputers use Lustre file systems. These include: K computer at the RIKEN Advanced Institute for Computational Science, the Tianhe-1A at the National Supercomputing Center in Tianjin, China, the Jaguar and Titan at Oak Ridge National Laboratory (ORNL), Blue Waters at the University of Illinois, and Sequoia and Blue Gene/L at Lawrence Livermore National Laboratory (LLNL).
There are also large Lustre filesystems at the National Energy Research Scientific Computing Center, Pacific Northwest National Laboratory, Texas Advanced Computing Center, Brazilian National Laboratory of Scientific Computing, and NASA in North America, in Asia at Tokyo Institute of Technology, in Europe at CEA, and others.
Commercial technical support for Lustre is often bundled along with the computing system or storage hardware sold by the vendor. Some vendors include Cray,Dell,Hewlett-Packard (as the HP StorageWorks Scalable File Share, circa 2004 through 2008),Groupe Bull, Silicon Graphics International,Fujitsu. Vendors selling storage hardware with bundled Lustre support include Hitachi Data Systems,DataDirect Networks (DDN),NetApp, Seagate Technology, and others. It is also possible to get software-only support for Lustre file systems from some vendors, including Intel's High Performance Data Division.