Improving Performance of a Distributed File System Using OSDs and Cooperative Cache

PROJECT REPORT SUBMITTED IN PARTIAL FULFILLMENT FOR THE DEGREE OF B.Sc(H) Computer Science

Hans Raj College University Of Delhi Delhi – 110 007 India

Submitted by: Parvez Gupta Roll No. - 6010027

Varenya Agrawal Roll No. - 6010044

Certificate
This is to certify that the project work entitled Improving Performance of a Distributed File System Using OSDs and Cooperative Cache being submitted by Parvez Gupta and Varenya Agrawal, in partial fulfillment of the requirement for the award of the degree of B.Sc (Hons) Computer Science, University of Delhi, is a record of work carried out under the supervision of Ms. Baljeet Kaur at Hans Raj College, University of Delhi, Delhi.

It is further certified that we have not submitted this report to any other organization for any other degree.

Parvez Gupta Roll No: - 6010027 Varenya Agrawal Roll No: - 6010044

Project Supervisor Ms. Baljeet Kaur

Principal Dr. S.R. Arora Hans Raj College University of Delhi

Dept. of Computer Science Hans Raj College University of Delhi

Acknowledgment
We would sincerely like to thank Ms. Baljeet Kaur for her invaluable support and guidance in carrying out this project to successful completion. Also, we would like to thank Head of the Computer Science Department, Ms Harmeet Kaur, who was always there with her invaluable knowledge and experience that helped us greatly during the research work. We would also like to extend our gratitude and special thanks to Mr. I.P.S. Negi, Mr. Sanjay Narang and Ms. Anita Mittal for their help in the computer laboratory. Lastly, we would like to thank all our friends and well wishers who directly or indirectly influenced the successful compilation of the project.

5 Cooperative Cache 2.2 About the Work 3 4 4 5 6 Chapter 2 z-Series File System 2.Table of Contents List of Figures Chapter 1 Introduction 1.2.3 Lease Manager 2.6 Transaction Server 8 8 9 10 10 11 11 12 13 13 16 16 17 18 Chapter 3 Cooperative Cache 3.2.1 Prominent Features 2.1 Object Store 2.4 File Manager 2.1 Background 1.1 Working of Cooperative Cache 3.2.2.2 Cooperative Cache Algorithm 1 .2.2.2 Architecture 2.2 Front End 2.

3 Choosing the Proper Third Party Node 3.1 Test Environment 4.2 Network Delays 20 21 25 26 3.2.2 Comparing zFS and NFS 28 28 29 32 Conclusion Bibliography 35 36 2 .4 Pre-fetching Data in zFS Chapter 4 Testing 4.3.2.1 Node Failure 3.

List of Figures Figure 1: zFS Architecture Figure 2: Delayed Move Notification Messages Figure 3: System configuration for testing zFS performance Figure 4: System configuration for testing NFS performance Figure 5 : Performance results for large server cache Figure 6 : Performance Results for small server cache 15 24 31 31 32 33 3 .

Chapter 1 Introduction 4 .

In the beginning. but an additional layer between the disk file system and the user processes. system administrators can make files distributed across multiple servers appear to users as if they reside in one place on the network. distributed file systems were developed and new features such as file locking were added to existing file systems. zFS evolved from DSF (Data Sharing Facility) project which aimed at building a server-less file system that distributes all aspects of file and storage management over cooperating machines interconnected by a 5 . and a second time from the server onto the destination computer. With DFS. Separate nodes have direct access to only a part of the entire file system. In a Distributed File System (DFS) a single file system can be distributed across several physical computer nodes. Although this method avoided the time consuming physical movement of removable media.1 Background As computer networks started to evolve in the 1980s it became evident that the old file systems had many limitations that made them unsuitable for multiuser environments. zFS (z-Series File System).1. The new systems were not replacements for the old file systems. Additionally. As computer companies tried to solve the shortcomings above. many users started to use FTP to share files. users had to know the physical addresses of every computer involved in the file sharing process. is used in the z/OS operating system. files still needed to be copied twice: once from the source computer onto a server. a distributed file system developed by IBM.

Another file system. They have also demonstrated that using pre-fetching in zFS increases performance significantly. which can withstand network delays and node failures. Thus. despite the fact that the object store devices have their own caches. This is done by comparing the system s performance to NFS using the IOZONE benchmark.2 About the Work This work describes a cooperative cache algorithm used in zFS. xFS is more scalable than AFS and NFS due to four different caching techniques used by it that contribute significantly to the load reduction. The researchers have also investigated whether using a cooperative cache results in better performance. zFS performance scales well when the number of participating clients increases linearly. the scalability of xFS was limited by the strength of the server. Thus. and the load of the server increases as the number of clients increase.fast switched network. The work explores the effectiveness of this algorithm and of zFS as a file system. zFS was designed to achieve a scalable file system that operates equally well on only a few and thousands of machines and in which the addition of new machines leads to a linear increase in performance. 6 . However. 1. Their results show that zFS performs better than NFS when cooperative cache is activated and that zFS provides better performance even though the OSDs have their own caches. There are several other related works that have researched cooperative caching in network file systems. xFS uses a central server to coordinate between the various clients.

There is no hierarchy of cluster servers. no caching is done on the local disk. The set of file managers dynamically adapts itself to the load on the cluster. 7 . Thus. All control information is exchanged between clients and file managers. zFS does not have a central server that can become a bottleneck. zFS is more scalable because it has no central server and file managers can dynamically be added or removed to respond to load changes in the cluster. if two clients work on the same file they interact with the same file manager. In zFS. caching is done on a per page basis rather than using whole files. performance is better due to zFS s stronger sharing capability. Moreover. In zFS.There are three major differences between the zFS architecture and xFS architecture: zFS does not have a central server and the management of files is distributed among several file managers. This increases sharing since different clients can work on different parts of the same file. Clients in zFS only pass data among themselves (in cooperative cache mode).

Chapter 2 z-Series File System 8 .

zFS achieves its high performance and scalability by avoiding groupcommunication mechanisms and clustering software and using distributed transactions and leases instead. instead of going to the disk for a block of data already in one of the machine memories. the objectives of zFS are: A file system that operates equally well on only on few or on thousands of machines Built from off-the-shelf components with Object disks (ObSs) Makes use of the memory of all participating machines as a global cache to increase performance The addition of machines leads to an almost linear increase in performance 9 . Thus. To maintain file system consistency. The design and implementation of zFS is aimed at achieving a scalable file system beyond those that exist today. zFS uses distributed transactions and leases to implement meta data operations and coordinate shared access to data.1 Prominent Features zFS is a scalable file system which uses Object Store Devices (OSD) and a set of cooperative machines for distributed file management. These are its two most prominent features. More specifically. zFS integrates the memory of all participating machines into one coherent cache. zFS retrieves the data block from the remote machine.2.

a File Manager (FMGR). and an Object Store (ObS). a Lease Manager (LMGR).2 Architecture zFS has six components: a Front End (FE). 2. Using object disks allows zFS to focus on management and scalability issues. These components work together to provide applications or users with a distributed file system.zFS will achieve scalability by separating storage management from file management and by dynamically distributing file management. Having ObSs handle storage management implies that functions usually handled by file systems are done in the ObS itself. it does not distinguish between files and directories. while letting the ObS handle the physical disk chores of block allocation and mapping. The Object Store recognizes only those objects that are sparse streams of bytes. a Transaction Server (TSVR). Now we describe the functionality of each component and how it interacts with the other components. safe writes and other capabilities. It is the responsibility of the file system management to handle them correctly. and are transparent to other components of zFS. and writing and reading byte-ranges from the object. The ObS API enables creation and deletion of objects (files).2. a Cooperative Cache (Cache). Thus. Object disks provide file abstractions. 2. 10 . security.1 Object Store The object disk (ObS) is the storage device on which files and directories are created. and from where they are retrieved.

the use of leases incurs the overhead of lease renewal on the client that acquired the lease and still needs the resource. the following mechanism is used: Each ObS maintains one major lease for the whole disk. the ObSs themselves have to support some form of locking. the majority of lease management overhead is offloaded from the ObS.3 Lease Manager The need for a Lease Manager (LMGR) stems from the following facts: File systems use one form or another of locking mechanism to control access to the disks in order to maintain data integrity when several users work on the same files. we are able to acquire a new lease after the lease of the failed machine expires. Thus.2. while still maintaining the ability to protect data. In distributed environments. To reduce the overhead of the ObS. where network connections and even machines themselves can fail. To work in SAN file systems where clients can write directly to object disks. Leases are locks with an expiration period that is set up in advance. when a machine holding a lease on a resource fails. 2. To find out which 11 . Leases for specific objects (files or directories) on the ObS are managed by the ObS s LMGR. Each ObS also has one lease manager (LMGR) which acquires and renews the major lease. two clients could damage each other s data.2 Front End The zFS front-end (FE) runs on every workstation on which a client wants to use zFS.2. it is preferable to use leases rather than locks.2. Thus. It presents the client with the standard POSIX file system API and provides access to zFS files and directories. Otherwise. Obviously. The ObS stores in memory the network address of the current holder of the major-lease.

Any machine that needs to access an object obj on ObS O. and maintains the information regarding where each file s blocks reside in internal data structures. no file has an associated file-manager(FMGR). the first machine to open a file. after acquiring the major-lease. it checks whether the file has already been opened by another client (on another machine). first figures out who is it s LMGR. a client simply asks O for the network address of its current LMGR. If one exists. grants exclusive leases on objects residing on the ObS. 12 .4 File Manager Each opened file in zFS is managed by a single file manager assigned to the file when the file is opened. and until that file manager is shut-down. the FMGR directs the cache on that machine to forward the data to the requesting cache. In case the data requested resides in the cache of another machine. If one does not exist. the requesting machine creates a local instance of an LMGR to manage O for it. the FMGR acquires the proper exclusive lease from the lease manager and directs the request to the object disk. each lease request for any part of the file will be mediated by that FMGR. When an open() request arrives at the file manager.2. For better performance. This allows looking up file-managers. The set of all currently active file managers manage all opened zFS files. the object-lease for obj is requested form from the LMGR. The first machine to open a file will create an instance of a file manager for the file. 2. Henceforth. If not. The lease-manager. will create a local instance of the file manager for that file. Initially. It also maintains in memory the current network address of each object-lease owner.machine is currently managing a particular ObS O. The FMGR keeps track of each accomplished open() and read() request.

It also creates and keeps track of all range-leases it distributes. Cachem.The file manager interacts with the lease manager of the ObS where the file resides to obtain an exclusive lease on the file. the consis- 13 . directory operations are implemented as distributed transactions.5 Cooperative Cache The cooperative cache (Cache) of zFS is a key component in achieving high scalability. and (b) creating a new file object. This is where a cooperative cache is useful. Each of these operations can fail independently. When a client on machine A requests a block of data via FEa and the file manager (FMGRB on machine B) realizes that the requested block resides in the Cache of machine M . it sends a message to Cachem to send the block to Cachea and updates the information on the location of that block in FMGRB . These leases are kept in internal FMGR tables. Hence. 2. each directory operation should be protected inside a transaction. Such occurrences can corrupt the file system. a create-file operation includes.2. (a) creating a new entry in the parent directory. which passes it to the client. Due to the fast increase in network speed nowadays. 2. such that in the event of failure. and are used to control and provide proper access to files by various clients. it takes less time to retrieve data from another machine s memory than from a local disk. The Cache on A then receives the block. updates its internal tables (for future accesses to the block) and passes the data to the FEa .6 Transaction Server In zFS. For example. at the very least. and the initiating host can fail as well.2.

target directory. (a) locking the source directory. This means either rolling the transaction forward or backward. and (d) releasing the locks. (c) erasing the old entry. 14 . It acquires all required leases and performs the transaction. (b) creating a new directory entry at the target.tency of the file-system can be restored. Since such transactions are complex. zFS uses a special component to manage them: a transaction server (TSVR). and file (to be moved). The most complicated directory operation is rename(). This requires. at the very least. The TSVR works on a per operation basis. The TSVR attempts to hold onto acquired leases for as long as possible and releases them only for the benefit of other hosts.

the file manager manages ranges of leases. The set of all currently active file managers manage all opened T th 3 1 . which it grants the clients (FEs)2. Several file managers and transaction managers run on various nodes in the cluster. h e S zF su A 1 2 3 4 Figure 1: zFS architecture Each participating nodeFigure 1: zFS Architecture cooperative cache.sive file lease. Each runs the front-end and OSD has only one lease manager associated with it. W ti ro c ti Every file opened in zFS is managed by a single file man15 ager that is assigned to the file when it is first opened.

Chapter 3 Cooperative Cache 16 .

regardless of the file system type leading to fairness between the file systems. There is no need for a special zFS mechanism to detect it. all other applications/users on that machine can try to read and write to the page using 17 . The pages of zFS and other file systems are treated equally by the kernel algorithm. First. When a file is closed. When eviction is invoked and a zFS page is the candidate page for eviction then the decision is passed to a specific zFS routine. the researchers achieved the following: The kernel evokes page eviction according to its internal algorithm̶when free available memory is low.1 Working of Cooperative Cache In zFS. The implementation of zFS page cache supports the following optimizations: An application using a zFS file to write a whole page acquires only the write lease when no read is done from the OSD. its pages remain in the cache until memory pressure causes the kernel to discard them. it provides comparable local performance between zFS and other local file systems supported by Linux. the cooperative cache is integrated with the Linux kernel page cache for two main reasons. As a result. which decides whether to forward the page to the cache of another node or to discard it. If one application or user on a machine has a write lease. Caching is not done on whole files but on per-page basis.3. by doing this the operating system does not have to have two seperate caches with different cache policies which may interfere with each other. Second. All the supported file systems use the kernel page cache.

zFS grants the leases to the client A. marking each page as a singlet (as A is the only node having this range of pages in its cache). zFS allows it. which enables the client to read the range of pages from the OSD directly. zFS requests the page and its read lease from the zFS file manager. without requesting the file manager for another lease. When a client has a write lease and another client requests a read lease for the same page. Each page that exists in the cooperative cache is said to be either singlet or replicated. If the file manager finds that the range of pages requested resides in the memory of some other node say B. 3. a write to the object store device is done if the page has been modified and the lease on the first client is downgraded from write to read lease without discarding the page. The file manager checks if a range of pages starting with the requested page has already been read into the memory of another machine in the network. thus increasing performance. If the mode bits allow the operation. The client A then reads the range of pages from the OSD. based on the permissions specified in the mode (read or write or both) parameter when the file is opened. it sends a message to B requesting that B send the range of pages and leases to 18 . In case of a cache miss.the same lease. When a client wants to open a file for reading. The kernel then checks the permission to read/write. the local cache is checked for the page. This increases the probability of a cache hit by a client requesting the same page. A replicated page is the one that is present in the memory of several nodes. If not. A singlet page is the one that is present in the memory of only one of the nodes in the connected network.2 Cooperative Cache Algorithm In this paper a data block is considered to be a page.

This daemon scans and discards inactive pages from the memory of the client. In this case. the page is forwarded to another node using the following steps: 1. it is discarded. 3. Node B is called a thirdparty node. When a file manager is notified about a discarded page. The effects of node failure and network delays are also considered in this algorithm. the Linux kernel invokes the kswapd() daemon. and if the singlet page has not been accessed after two hops. it updates the lease and page location and checks whether the page has become a singlet. if the page is a replicated zFS page. If only one other node N holds the page.A. 19 . a message is sent to the zFS file manager indicating that machine A no longer holds the page and the page is discarded. since A gets the requested data not from the OSD but from a thirdparty. If the zFS page is a singlet. the file manager sends a singlet message to N to that effect. When memory becomes scarce for a client. 2. A message is sent to the zFS file manager indicating that the page is sent to another machine B. Once the page has been accessed. zFS records the fact that A also has this particular range internally and both A and B mark the pages as replicated. The page is discarded from the page cache of A. The page is forwarded to B. the node with the largest free memory known to A. zFS uses a recirculation counter. the recirculation counter is reset. In our modified kernel.

20 . If the node fails to execute Step 1 and notify the file manager. 3. Node failed after Step 1:. 2. the order of steps for forwarding a singlet page to another node is important and is to be followed as described above. Because of this. it does not forward the page and only discards it.3. the file manager is informed that the page is on B. 1. although it does not. its request will be rejected and thus it will update its records eventually. However. This is acceptable since it can be corrected without data corruption. we have a situation where the file manager assumes the page is on B. this may cause data trashing and thus is not allowed. Node fails before Step 1:.2. we end up with a situation where the file manager assumes the page exists on node A.The file manager will eventually detect this and update its data to reflect that the respective node does not hold pages and leases. If the file manager is wrong in its assumption that a page exists on a node. although in reality that is not true. where the file manager is unaware of their existence.In this case. if there are pages on nodes that the file manager is not aware of. Thus. Again. the researchers take the approach that it is acceptable for the file manager to assume the existence of pages on nodes even if this is not true but it is unacceptable to have pages on nodes. Failure after Step 2 does not pose any problem. but node A may have crashed before it was able to forward the page to B.1 Node Failure To take care of node failure.

In such cases. The file manager asked N to forward the page to B and N sent a reject message 21 . In this case. However. while in reality it is a singlet and should have been forwarded to another node. the page has not arrived yet. it sends a message to N with this information. the cooperative cache algorithm on N will ignore the singlet message. They handle this situation as follows: If a singlet message arrives at N and the page is not in the cache of N. N will send back a reject message to the file manager.2. But on the node N this page is marked as replicated. network delays will cause performance degradation. it may ask N to forward the page to a requesting client B. 2. the file manager updates its internal tables and retries to respond to the request from B by finding another client who in the meantime read the page from the OSD or by telling B to read the page from the OSD. but not inconsistency. this message may arrive after memory pressure developed on N.3. due to network delays. Another possible scenario is that no memory pressure occurred on N. The first case that the authors have considered is where a replicated page residing on two nodes M and N is discarded from the memory of M:- When the zFS file manager sees that a page has become singlet and only resides in the memory of N now. Because the file manager still knows that the page resides on N.2 Network Delays In this paper the following cases are considered for network delays: 1. Upon receiving a reject message. and a singlet message arrived and was ignored.

When a forwarded page arrives at N and the page is on the reject list. T is the maximum time it can take a page to reach its destination node. it incurs overhead from extra messages. and is determined experimentally depending on the network topology. To avoid this situation. a consistency problem may occur if a write lease exists. Moreover. A reject list is kept in the node N. This leaves two clients having the same page with write leases on two different nodes. This is unacceptable for two reasons. An alternative method for handling these network delay issues is to use a complicated synchronization mechanism to keep track of the state of each page in the cluster. this synchronization delays the kernel when it needs to evict pages quickly. However. Thus. First. The reject list is scanned periodically (by the FE) and each entry whose time on the list exceeds T. which records the pages (and their corresponding leases) that were requested but rejected. 3. and M does the same forwarding the page to O. Another problem caused by network delays is that suppose node N notifies the zFS file manager upon forwarding a page to M. the link from N to the file manager is slow compared to the other links. is deleted. if the page arrives after the reject message was sent. the file manager does 22 . and second. there is no problem. the page and its entry on the reject list are discarded.back to the file manager. another node may get the write lease and the page from the OSD. If the page never arrives at N due to sender failure or network failure. Because the file manager is not aware of the page on N. the file manager may receive a message that a page was moved from M to O before receiving the message that the singlet page was moved from N to M. However. thus keeping the information in the file manager accurate.

not have in its records that this specific page and lease reside on M. which counts the number of times the lease and its corresponding page were moved from one node to another. Two fields are reserved in each lease record in the file manager s internal tables for handling move notification messages: Last_hop_count initially set to -1. 23 . and target_node initially set to NULL. the hop_count in the corresponding lease is set to zero and is incremented whenever the lease and page are transferred to another node. the researchers used the following data structures: Each lease on a node has a hop_count. To solve this problem. When a node initiates a move. the move notification passed to the file manager includes the hop_count and the target_node. Initially. when the page is read from the OSD. The problem is further complicated by the fact that M may decide to discard the page and this notification may arrive at the file manager before the move notification.

O then sends a release_lease message. When message (1) arrives. the information from message (5) is stored and used when message (1) arrives. Suppose the page was moved from N to M and then to O. it is ignored due to its smaller hop count. If message (3) arrives first and then message (5) arrives. 24 . 4.Figure 2: Delayed Move Notification Messages If message (3) arrives first. where it was discarded due to memory pressure on O and its recirculation count exceeded its limit. This is done by updating the information stored in the internal tables of the file manager. therefore. its hop count and target node are saved in the lease record. This is done since node M is not registered as holding the lease and page. N is the registered node. using the hop count enables us to ignore late messages that are irrelevant. due to the larger hop count. the lease is moved to the target node stored in the target_node field. When message (3) arrives. which arrives at the file manager before the move notifications. In other words.

Ni is marked as a potential provider for the requested range and the next node. it starts the scan from node Nmax+1. the next node is checked. the pages reside at the chosen node for sure. Once all nodes are checked. When the move operation is resolved and this flag is set. which holds a range of pages starting with the requested page. the file manager scans the list of all nodes holding a potential range. 25 . the marked node with the largest range. For each range granted to a node N. N0…Nk. is chosen. the release_lease message is moved to the input queue and executed. the file manager records the time it was granted t(N). the release_lease message is placed on a pending queue and a flag is raised in the lease record. Nmax. When a request arrives. thus reducing the probability of reject messages. If this is true. no single node is overloaded with requests and becomes a bottleneck. the file manager checks if currentTime‒t(Ni) > C to check whether enough time passed for the range of pages granted to Ni to reach the node. For each selected node Ni. Ni+1. 3.3 Choosing the Proper Third Party Node The zFS file manager uses an enhanced round robin method to choose the third-party node.This case is resolved as follows: Since O is not registered as holding the page and lease. is checked. First. Two goals are achieved using this algorithm. The next time the file manager is asked for a page and lease. Second. otherwise.

it is passed a pre-fetching parameter. it seems that it is more efficient to transmit k pages in one message rather than transmitting them in a separate message for each page as the setup overhead is amortized over k pages.4 Pre-fetching Data in zFS The Linux kernel uses a read ahead mechanism to improve file reading performance. the kernel dynamically calculates how many pages to read ahead. and when k is larger (16 and above) the size of the L2 cache starts to affect the performance. R. This method of operating is not efficient when the pages are transmitted over the network. When k is smaller. the setup overhead is a significant part of the total overhead. indicating the maximum range of pages to grant. Similar performance gains were achieved by the zFS pre-fetching mechanism. and invokes the readpage() routine n times. Intuitively. the setup time is significant. For comparatively small blocks. To confirm this.3. Based on the read pattern of each file. reading the file in N…N/k messages. That is. Using a file size of N pages. TCP performance decreases when the transmitted block size exceeds the size of the L2 cache. The overhead for transmitting a data block is composed of two parts: the network setup overhead and the transmission time of the data block itself. When the file manager is instantiated. they tested reading it in chunks of 1…k pages in each TCP message. When a client A requests a page (and lease) the file manager searches for a client B having the largest con- 26 . the researchers wrote client and server programs that test the time it takes to read a file residing entirely in memory from one node to another. n. They found that the best results are achieved for k=4 and k =8.

The next request initiated by the kernel read-ahead mechanism will be granted from client B and the next from client C. the file manager sends B a message to send r pages (and their leases) to A. Thus. If such a client B is found. starting with the requested page p and r <= R. there is no interference with the kernel read ahead mechanism. while the next one resides on client B. In this case. The selected range r can be smaller than R if the file manager finds a page with a conflicting lease before reaching the full range R. However.tiguous range of pages. The requested page may reside on client A. it will ignore the subsequent requests that are initiated by the kernel read-ahead mechanism and covered by the granted range. the file manager grants R leases to client A and instructs A to read R pages from the OSD. if the file manager finds that client A has a range of k pages. r. 27 . the granted range will be only the requested page from client A. and the next on client C. If no range is found in any client.

Chapter 4 Testing 28 .

they configured the system differently.4.4. the researchers configured the system much like a SAN (Storage Area Network) file system. The transaction manager implemented all operations in memory. The PCs in the cluster and the server PC were connected via 1 Gbit LAN. The kernel was a modified 2. 256 KB L2 cache and 15 GB IDE (Integrated Drive Electronics) disks. All of the PCs in the cluster ran the Linux operating system. To begin testing zFS. However. this fact does not influence the results because only the results of read operations using cooperative cache are recorded and not the meta data operations. The server PC ran a simulator of the Antara OSD when the researchers tested zFS performance and ran an NFS (Network File System) server when they compared the results to NFS. file manager and transaction manager processes (thus acting as a meta data server) and four PCs ran the zFS front-end. without writing the log to the OSD. while all other components were implemented as user mode processes. The server PC had a 2 GHz Pentium 4 processor with 512 MB memory and 30 GB IDE disk running vanilla Linux kernel. a separate PC ran the lease manager.1 Test Environment The zFS performance test environment consisted of a cluster of client PCs and one server PC. The server PC ran an NFS server with eight NFS daemons (nfsd) and the four PCs 29 . The client running the zFS front end was implemented as a kernel mode process. Each of the PCs in the cluster had an 800 MHz Pentium III processor with 256 MB memory. The server PC ran an OSD simulator. The file manager and lease manager were fully implemented. When testing NFS performance.19 kernel with VFS (Virtual File System) implementing zFS and some patches to enable the integration of zFS with the kernel's page cache.

16 pages and compared its performance with reading zFS mounted files with record sizes of one page but with prefetching parameter R=1. IOZONE was configured to read the NFS mounted file using record sizes of n=1. IOZONE is useful for performing a broad filesystem analysis of a computer platform. where the caches of the machines were cleared before each run.8. To evaluate zFS performance relative to an existing file system. The benchmark generates and measures a variety of file operations.4. The final results are an average over several runs.ran NFS clients. The comparison to NFS was difficult because NFS does not carry out prefetching. The benchmark tests file I/O performance for operations such as read.write.etc. using the IOZONE benchmark. the researchers compared it to the widely-used NFS system. To make up for this feature. IOZONE is a filesystem benchmark tool.16 pages.4. 30 .8.

r. 31 This f the se . we compared it to the widely-used NFS system. e next n this from ahead from Figure 3: System configuration for testing zFS performance e page e read write Figure 3: System configuration for testing zFS performance When testing (nfsd) and the four configured the system NFS daemonsNFS performance. The server PC ran over several runs.cating A refor a es. eight reported results are an average an NFS server withwhere the caches of the machines were cleared before each run. 5 (sinc zFS w 3 2 2 KB/Sec 1 1 Figure 4: System configuration for testing NFS performance Figure 4: System configuration for testing NFS performance 5 Comparison to NFS To evaluate zFS performance relative to an existing file system. wePCs ran NFS clients. a clisend r an be a conrange ses to separate PC ran the lease manager. file manager and transaction manager processes (thus acting as a meta data server) and four PCs ran the zFS front-end. The differently.

00 20000. the file size was much larger than the size of the server s cache. In the first one.00 KB/Sec NFS 15000.2 Comparing zFS and NFS The primary aim of this research was to test whether and how much performance is gained when the total amount of free memory in the cluster exceeds the server s cache size. the performance for a range of 16 was lower than for ranges of 4 . 32 not on- We also saw that when using cooperative cache. Figure 5: Performance results for large server cache Figure 5 : Performance results for largeresults when This figure shows the performance server cache the data fits entirely in the server’s memory. 512MB serv er me mory 30000. respectively.00 25000.00 5000. The the (since the pages were pre-fetched) and the performance of zFS was slightly better than that of NFS. The graphs show the relative performance of zFS to NFS.00 0. The results appear in Figure 5 and Figure 6 given below.4. File smaller than available se rver memory 256M B file. the file size was smaller than the cache of the server and all the data resided in the server s cache. In the second.00 1 4 Range 8 16 No c_cache c_cache ce file em. To this end. two scenarios were investigated.00 10000. with and without cooperative cache.

00 10000. [ [ [ [ [ ! 2]. Although we cannot cover all these works. shar 6 Related Work Several existing works have researched cooperative caching to the case when the file size is larger than the available memory.File larger than av ailable serve r me mory 1GB file. When the file can fit entirely Add ing tech it ha be a ter. for larger ranges. xFS uses a R=1 is lower than that of NFS. 512M B Serv er M e mory 25000. the performance of zFS for into the memory. We see that cooperative cache provides much better performance than NFS. But when cooperative cache was for parallel machines ! 3]. al. Hence. different behaviors were observed for different ranges of pages. Figure 6 : Performance Results for small server cache Figure 6: Performance results for small server cache This figure shows the performance results when the data size is greater than the server cache size and the server’s local disk has to be used. 3. it was observed that the performance of NFS was almost the same for different block sizes. the scalability of xFS was limited by the strength of the server. However. [ ! 2].00 1 4 Range 8 16 The ture 1.00 15000. Deactivating cooperative cache results in worse performance than NFS. ! 12]. KB/Sec NFS No c_cache c_cache 2. In both the cases. and the load of the server increases as the number of clients 33 increase. the performance of NFS is almost four times better compared Dahlin et. or those that use a micro-kernel [ deactivated. Thus.00 0. ! 4] describe four caching techniques to im[ prove the performance of the xFS file system.00 5000. This is due to the fact that extra messages are passed between the file manager and the client for larger ranges of pages. the performance is greatly influenced by the data size compared to the available memory. there are fewer mescentral server to coordinate between the various clients. rather than those describing caching cooperative cache is much better than NFS. Implementing these techniques and In ! [ find sion perf main evic by c chos mem Sark mec state com and . ! 3]. ! 5].00 20000. However. ! 4]. we will concentrate on those systems that describe When the file fits entirely into memory ( Figure 1) the performance of zFS with network file systems.

When the cooperative cache is active. the size of the L2 cache. we get 256 KB. The performance with cooperative cache enabled is lower in this case when compared to the case when the file fits into memory. it was expected that memory pressure would occur in the server (NFS and OSD) and the server s local disk would be used. the performance for a range of 16 was lower than for ranges of 4 and 8. the anticipations that the cooperative cache would exhibit improved performance proved to be correct. The results are shown in Figure 2. Since almost the entire file is in memory. In such a case. is lower than that of NFS but it gets better for larger ranges. This is due to the fact that the file was larger than the available memory. when cooperative cache is deactivated. This is due to the fact that IOZONE starts the requests of each client with a fixed time delay relative to other clients. The researchers also observed that when cooperative cache was used. 34 . resulting in reduced performance. the L2 cache is cleared and reloaded for each new granted request. This stems from the following calculation: For four clients with 16 pages each. We can see that zFS performance. hence the clients suffered memory pressure.sages to the file manager (due to pre-fetching in zFS) and the performance of zFS was slightly better than that of NFS.and discarded pages and responded to the file manager with reject messages. zFS performance is significantly better than NFS and increases with increasing range. When the cache of server was smaller than the requested data. sending data blocks to clients was interleaved with reject messages to the file manager and the probability that the requested data is in memory was also smaller than when the file was almost entirely in memory. each new request was for different 256 KB. Thus.

which chooses the node with the largest free memory as the target node. In zFS. It may be the case that there is an idle machine with a large free memory that is not connected to this file manager and thus will not be used. 35 . the file manager chooses target nodes only from the ones interacting with it. However. This is evident when using pre-fetching with a range of one page. the selection of the target node for forwarding pages during page eviction is done by the file manager.Conclusion The results show that using the cache of all the clients as one cooperative cache gives better performance as compared to NFS as well as to the case when cooperative cache is not used. It is also noted from the results that using pre-fetching with ranges of four and eight pages results in much better performance.

html Iozone. Mount Carmel. San Diego. I.ibm. http://www. Anderson and David A. Z. Henis. J." Proceedings of the First Symposium on Operating Systems Design and Implementation. Hartman.org/docs/whitepaper. University of Arizona. Labarta. USA. Weit IBM Haifa Labs. Drezin.lustre. D. Efficient Cooperative Caching Using Hints. Satran and D. Teperman. "The Antara Object-disk Design. Israel. N. Haifa. E. Haifa University.1994.com/projects/systems/dsf. Dahlin. Dubitzky. V. Sarkar and J. Mount Carmel. Yerushalmi. See http://iozone. " Cooperative Caching: Using Remote Client Memory to Improve File System Performance. Mount Carmel. Teperman." In Proceedings of the IEEE Mass Storage Systems and Technologies Conference. Cortes. Haifa 31905.il. Haifa University Campus. Wang. A. 36 .pdf P. Rinetzky. M. 2003. Girona and J. Universitat Politecnica de Catalunia ‒ Barcelona. 2000. Israel. 2001. Patterson. Thomas E. T. Tavory and E. Department d Arquitectura de Computadros. Sheinwald.org/ http://www. A.Bibliography Improving Performance of a Distributed File System Using OSDs and Cooperative Cache A. Israel O. CA. Gold. Haifa University Campus. Rodeh and A. Avoiding the Cache Coherence Problem in a Parallel/Distributed File System. "zFS ‒ A Scalable distributed File System using Object Disks. Haifa. "DSF ‒ Data Sharing Facility. pages 207-218." Technical report IBM Research Labs." Technical report IBM Research Labs. Department of Computer Science. Tucson.haifa. S. Randolph Y.

Sign up to vote on this title
UsefulNot useful