Lustre (file system)

Lustre (file system)

Infobox software
name = Lustre

developer = Sun Microsystems
latest release version = 1.6.5.1
latest release date = release date|2008|07|10
operating system = Linux
genre = Shared disk file system
license = GPL
website = http://www.lustre.org, http://www.sun.com/software/products/lustre/

Lustre is a shared disk file system, generally used for large scale cluster computing. The name Lustre is a blend of the words Linux and cluster. The project aims to provide a file system for clusters of tens of thousands of nodes with petabytes of storage capacity, without compromising speed or security. Lustre is available under the GNU GPL.

Lustre is designed, developed and maintained by Sun Microsystems, Inc. with input from many other individuals and companies. Sun completed its acquisition of Cluster File Systems, Inc., including the Lustre file system, on October 2, 2007, with the intention of bringing the benefits of Lustre technologies to Sun's ZFS file system and the Solaris operating system.

Lustre file systems are used in computer clusters ranging from small workgroup clusters to large-scale, multi-site clusters. Fifteen of the top 30 supercomputers in the world use Lustre file systems, including the world's second fastest supercomputer, the Blue Gene/L at Lawrence Livermore National Laboratory (LLNL) [cite web
url=http://top500.org/list/2008/06/100
title=TOP500 List - June 2008|date=2006-06-18
publisher=TOP500.Org
] . Other supercomputers that use the Lustre file system include systems at Oak Ridge National Laboratory, Pacific Northwest National Laboratory, and NASA [cite web
url=http://www.nas.nasa.gov/Resources/Systems/pleiades.html
title=Pleiades Supercomputer
date=2008-08-18
publisher=www.nas.nasa.gov
] in North America, the largest system in Asia at Tokyo Institute of Technology [cite web
url=http://www.top500.org/system/8216
title=TOP500 List - November 2006
publisher=TOP500.Org
] , and one of the largest system in Europe at CEA [cite web
url=http://www.top500.org/system/8237
title=TOP500 List - June 2006
publisher=TOP500.Org
] .

Lustre file systems can support up to tens of thousands of client systems, petabytes (PBs) of storage and hundreds of gigabytes per second (GB/s) of I/O throughput. Businesses ranging from Internet service providers to large financial institutions deploy Lustre file systems in their data centers. Due to the high scalability of Lustre file systems, Lustre deployments are popular in the oil and gas, manufacturing, rich media and finance sectors. [cite web
url = http://video.google.ca/videoplay?docid=965817030102279091
title = Lustre File System presentation
accessdate = 2008-01-28
publisher = Google Video
]

History

The Lustre file system architecture was developed as a research project in 1999 by Peter Braam, who was a Senior Systems Scientist at Carnegie Mellon University at the time. Braam went on to found his own company Cluster File Systems, which released Lustre 1.0 in 2003. Cluster File Systems was acquired by Sun Microsystems in October 2007.

The Lustre file system was first installed for production use in March 2003 on the MCR Linux Cluster at LLNL [url=http://goliath.ecnext.com/coms2/gi_0199-2958660/LINUX-NETWORX-CLUSTER-BECOMES-3RD.html] , one of the largest supercomputers at the time [url=http://www.top500.org/system/6085] .

Lustre 1.2.0, released in March 2004, provided Linux kernel 2.6 support, a "size glimpse" feature to avoid lock revocation on files undergoing write, and client side write-back cache accounting.

Lustre 1.4.0, released in November 2004, provided protocol compatibility between versions, Infiniband network support, and support for extents/mballoc in the ldiskfs on-disk filesystem.

Lustre 1.6.0, released in April 2007, supported mount configuration (“mountconf”) allowing servers to be configured with "mkfs" and "mount", supported dynamic addition of "object storage targets" (OSTs), enabled Lustre distributed lock manager (LDLM) scalability on symmetric multiprocessing (SMP) servers, and supported free space management for object allocation.

The Lustre file system and associated open source software has been adopted by many partners. Both Red Hat and SUSE (Novell) offer Linux kernels that work without patches on the client for easy deployment.

Architecture

A Lustre file system has three major functional units:

* A single "metadata target" (MDT) per filesystem that stores metadata, such as filenames, directories, permissions, and file layout, on the "metadata server" (MDS)

* One or more "object storage targets" (OSTs) that store file data on one or more "object storage servers (OSSes)". Depending on the server’s hardware, an OSS typically serves between two and eight targets, each target a local disk filesystem up to 8 terabytes (TBs) in size. The capacity of a Lustre file system is the sum of the capacities provided by the targets

* "Client(s)" that access and use the data. Lustre presents all clients with standard POSIX semantics and concurrent read and write access to the files in the filesystem.

The MDT, OST, and client can be on the same node or on different nodes, but in typical installations, these functions are on separate nodes with two to four OSTs per OSS node communicating over a network. Lustre supports several network types, including Infiniband, TCP/IP on Ethernet, Myrinet, Quadrics, and other proprietary technologies. Lustre can take advantage of remote direct memory access (RDMA) transfers, when available, to improve throughput and reduce CPU usage.

The storage attached to the servers is partitioned, optionally organized with logical volume management (LVM) and/or RAID, and formatted as file systems. The Lustre OSS and MDS servers read, write, and modify data in the format imposed by these file systems.

An OST is a dedicated filesystem that exports an interface to byte ranges of objects for read/write operations. An MDT is a dedicated filesystem that controls file access and tells clients which object(s) make up a file. MDTs and OSTs currently use a modified version of ext3 to store data. In the future, Sun's ZFS/DMU will also be used to store data.

When a client accesses a file, it completes a filename lookup on the MDS. As a result, a file is created on behalf of the client or the layout of an existing file is returned to the client. For read or writer operations, the client then passes the layout to a "logical object volume" (LOV), which maps the offset and size to one or more objects, each residing on a separate OST. The client then locks the file range being operated on and executes one or more parallel read or write operations directly to the OSTs. With this approach, bottlenecks for client-to-OST communications are eliminated, so the total bandwidth available for the clients to read and write data scales almost linearly with the number of OSTs in the filesystem.

Clients do not directly modify the objects on the OST filesystems, but, instead, delegate this task to OSSes. This approach ensures scalability for large-scale clusters and supercomputers, as well as improved security and reliability. In contrast, shared block-based filesystems such as Global File System and OCFS must allow direct access to the underlying storage by all of the clients in the filesystem and risk filesystem corruption from misbehaving/defective clients.

Implementation

In a typical Lustre installation on a Linux client, a Lustre filesystem driver module is loaded into the kernel and the filesystem is mounted like any other local or network filesystem. Client applications see a single, unified filesystem even though it may be composed of tens to thousands of individual servers and MDT/OST filesystems.

On some massively parallel processor (MPP) installations, computational processors can access a Lustre file system by redirecting their I/O requests to a dedicated I/O node configured as a Lustre client. This approach was used in the LLNL Blue Gene installation [cite web
url=http://www.hpcwire.com/hpcwire/hpcwireWWW/04/1015/108577.html
title = DataDirect SELECTED AS STORAGE TECH POWERING BLUEGENE/L
publisher = "HPC Wire", October 15, 2004: Vol. 13, No. 41.
] .

Another approach uses the "liblustre" library to provide userspace applications with direct filesystem access. Liblustre is a user-level library that allows computational processors to mount and use the Lustre file system as a client. Using liblustre, the computational processors can access a Lustre file system even if the service node on which the job was launched is not a Lustre client. Liblustre allows data movement directly between application space and the Lustre OSSs without requiring an intervening data copy through the kernel, thus providing low latency, high bandwidth access from computational processors to the Lustre file system directly. Good performance characteristics and scalability make this approach the most suitable for using Lustre with MPP systems. Liblustre is the most significant design difference between Lustre implementations on MPPs such as Cray XT3 and Lustre implementations on conventional clustered computational systems.

Lustre 1.8, which is expected to be released in Q4 2008 will focus on improving recovery in the face of multiple failures, and storage management via OST Pools [cite web
url=http://blogs.sun.com/HPC/resource/ISC08-Lustre_Update.pdf
title = Lustre Roadmap and Future Plans
accessdate = 2008-08-21
publisher = Sun Microsystems
] .

Lustre technology has been integrated in the HP StorageWorks Scalable File Share product [cite web
url=http://www.hp.com/techservers/hpccn/linux_gfs/
title = Global File systems (SFS) collaboration
publisher = Hewlett Packard
] .

Data objects and file striping

In a traditional UNIX disk file system, an inode data structure contains basic information about each file, such as where the data contained in the file is stored. The Lustre file system also uses inodes, but inodes on MDSs point to one or more objects associated with the file rather than to data blocks. These objects are implemented as files on the OSSs. When a client opens a file, the file open operation transfers a set of object pointers and their layout from the MDS to the client, so that the client can directly interact with the OSS node where the object is stored, allowing the client to perform I/O on the file without further communication with the MDS.

If only one object is associated with an MDS inode, that object contains all the data in the Lustre file. When more than one object is associated with a file, data in the file is “striped” across the objects similar to RAID_0. Striping provides significant performance benefits. When striping is used, the maximum file size is not limited by the size of a single target. Also, capacity and aggregate I/O bandwidth scale with the number of OSTs.

Networking

In a cluster with a Lustre file system, the system network connecting the servers and the clients is implemented using Lustre Networking (LNET), which provides the communication infrastructure required by the Lustre file system. Disk storage is connected to the Lustre file system MDSs and OSSs using traditional storage area network (SAN) technologies.

LNET supports many commonly-used network types, such as InfiniBand and IP networks, and allows simultaneous availability across multiple network types with routing between them. Remote Direct Memory Access (RDMA) is permitted when supported by underlying networks such as Quadrics Elan, Myrinet, and InfiniBand. High availability and recovery features enable transparent recovery in conjunction with failover servers.

LNET provides end-to-end throughput over Gigabit Ethernet (GigE) networks in excess of 100 MB/s [cite web
author=Lafoucrière, Jacques-Charles
url = http://hepix.caspur.it/afs/hepix.org/project/strack/hep_pdf/2007/Spring/Lustre-CEA-hepix2007.pdf
title = Lustre Experience at CEA/DIF
publisher = HEPiX Forum, April 2007
] , throughput up to 1.5 GB/s over InfiniBand double data rate (DDR) links, and throughput over 1 GB/s across 10GigE interfaces.

High availability

Lustre file system high availability features include a robust failover and recovery mechanism, making server failures and reboots transparent. Version interoperability between successive minor versions of the Lustre software enables a server to be upgraded by taking it offline (or failing it over to a standby server), performing the upgrade, and restarting it, while all active jobs continue to run, merely experiencing a delay.

Lustre MDSs are configured as an active/passive pair, while OSSs are typically deployed in an active/active configuration that provides redundancy without extra overhead. Often the standby MDS is the active MDS for another Lustre file system, so no nodes are idle in the cluster.

ZFS integration

Lustre 3.0 will allow users to choose between ZFS and ldiskfs as back-end storage. [ cite web | url = http://arch.lustre.org/index.php?title=Architecture_ZFS_for_Lustre | title = Architecture ZFS for Lustre | accessdate = 2008-02-18 | publisher = Sun Microsystems]

References

See also

*GlusterFS
*Gfarm file system
*Google file system
*BigTable
*Parallel Virtual File System

External links

* [http://www.sun.com/software/products/lustre/ Lustre File System product page at sun.com]
* [http://www.lustre.org/ Website]
* [http://wiki.lustre.org/ Lustre Users Home]


Wikimedia Foundation. 2010.

Игры ⚽ Нужно решить контрольную?

Look at other dictionaries:

  • File System — Système de fichiers Pour les articles homonymes, voir FS et SGF. Un système de fichiers (file system ou filesystem en anglais) ou système de gestion de fichiers (SGF) est une structure de données permettant de stocker les informations et de les… …   Wikipédia en Français

  • File system — Système de fichiers Pour les articles homonymes, voir FS et SGF. Un système de fichiers (file system ou filesystem en anglais) ou système de gestion de fichiers (SGF) est une structure de données permettant de stocker les informations et de les… …   Wikipédia en Français

  • Shared disk file system — A shared disk file system, also known as clustered file system or SAN file system, is an enterprise storage file system which can be shared (concurrently accessed for reading and writing) by multiple computers. Such devices are usually clustered… …   Wikipedia

  • Parallel Virtual File System — The Parallel Virtual File System (PVFS) is an Open Source parallel file system. A parallel file system is a type of distributed file system that distributes file data across multiple servers and provides for concurrent access by multiple tasks of …   Wikipedia

  • Clustered file system — A clustered file system is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system. While many computer clusters don t use… …   Wikipedia

  • Moose File System — Developer(s) Gemius SA Stable release 1.6.20 / January 17, 2011; 9 months ago (2011 01 17) …   Wikipedia

  • InterMezzo (file system) — InterMezzo is an obsolete distributed file system written for Linux, distributed under the GPL. The kernel component is not included in the current 2.6 kernel. It was included in the standard Linux kernel from kernel version 2.4.15 but was… …   Wikipedia

  • Moose File System — Développeur Gemius SA Dernière version …   Wikipédia en Français

  • Network file system — Pour les articles homonymes, voir NFS. Pile de protocoles 7 • Application 6 • …   Wikipédia en Français

  • Be File System — BFS Developer Be Inc. Full name Be File System Introduced May 10, 1997 (BeOS Advanced Access Preview Release[1]) Partition identifier Be BFS (Apple Partition Map) 0xEB (MBR) …   Wikipedia

Share the article and excerpts

Direct link
Do a right-click on the link above
and select “Copy Link”