BlueArc’s Titan Architecture

Architecting A Better Network Storage Solution

Abstract
BlueArc’s Titan Architecture creates a revolutionary step in file servers by creating a hardware-based file system that can scale throughput, IOPS and capacity well beyond conventional software-based file servers. With its ability to virtualize a massive storage pool of up to 2 petabytes of multi-tiered storage, Titan can scale with growing storage requirements and become a competitive advantage for business processes by consolidating applications while still ensuring optimal performance. This document details the technical details of the architecture to better help technical readers understand the unique hardware-accelerated design and the object based file system.

Introduction With the massive increase in the number of desktop users, high-end workstations, application servers, High Performance Computing (HPC) and nodes in compute clusters over the last decade, conventional network attached storage (NAS) solutions have been challenged to meet the resulting acceleration in customer requirements. While file server vendors have offered systems with faster off-the-shelf components and CPUs as they become available, storage demands have far outpaced the ability of these CPU-based appliances to keep up. To meet the increasing performance and capacity requirements, companies have been forced to deploying multiple NAS appliances concurrently – reducing the benefit of NAS, decentralizing data and complicating storage management. Many customers looking for a solution to this performance deficit turned to SAN implementations, but there are challenges with SAN that they did not experience with NAS. The first is one of high infrastructure cost. Adding one or two expensive Fibre Channel HBAs to each high-end workstation, each application and database server, and each cluster node, is an expensive proposition compared to using existing Ethernet NICs. Expensive file system license fees and maintenance costs and complexity add to the burden. But by far the biggest challenge to customers is that a SAN alone does not provide the standards-based shared file access needed for simple data management. Another solution that some customers are beginning to look at is the concept of storage clusters or grids. We shall refer to both as storage-grids for the remainder of this discussion. For conventional NAS appliance vendors that cannot scale performance or capacity, this strategy is not an option, but rather a necessity. Although storage-grids are interesting, they are far from ready for prime time. Consider the rise of compute-clusters as an allegory. In the summer of 1994 Thomas Sterling and Don Becker, working as contractors to NASA, built a clustered computer consisting of 16 DX4 processors connected by channel bonded Ethernet. They called their machine Beowulf. Now, years later, compute-clusters are commonly used in research and are gaining wider acceptance in commercial enterprises. The key to this acceptance is that the complex software that ties compute-clusters together and distributes tasks to the nodes has finally begun to mature to a point where companies can rely upon them for stable services. Some aspects of compute-clusters will translate directly to storage-grids; however there are enormous complexities that are introduced as well. Locking, cache coherency, client side caching and many other aspects of sharing a file system make it a daunting task. This will be solved over time, but as with compute-clusters, will take a significant amount of time to mature. The Internet Engineering Task Force (IETF) is proposing a new standard called pNFS. pNFS is an extension to NFSv4 and will help focus the industry towards a standards-based solution. This approach perfectly matches BlueArc’s commitment to stick to standards based protocols and methods while taking advantage of its unique hardware architecture for acceleration of data flow. Moreover, BlueArc will be able to deliver a significantly less complex solution by keeping the node count low, while still achieving the desired performance. Further discussion of storage-grids will be covered in a separate paper; however BlueArc is committed to providing the fastest nodes for storage grids, ensuring reduced complexity and cost, while delivering best in class performance and scalability. Having assessed the limitations of scaling an individual CPU-based NAS server, BlueArc chose, in 1998, to take a fresh approach to the problem with the fundamental belief that file services could be accelerated using hardware based state machines and massive parallelization, in the same way Ethernet and TCP/IP had been accelerated by switch and router vendors. The network vendors moved from software-based solutions to hardware-accelerated solutions, to accelerate packet flow processing. It seemed logical that file services should follow this same evolution, and an evolutionary new design would be required to attain the same benefits as experienced in the networking. BlueArc’s Founding Design Principles BlueArc had the unique technical acumen to be able to build a new platform for data access acceleration, but this is only one aspect of creating a worldclass network storage offering. By adhering to key principles - keeping it simple, maintaining a standards-based approach, ensuring data availability while enabling significant increases in performance and capacity - BlueArc has evolved the TitanArchitecture into the pinnacle of Enterprise Class file sharing. Titan 2000 series is BlueArc’s 4th Generation hardware-based file server, delivers significant performance, throughput, and capacity gains for our customers, as well as an excellent return on investment. This paper discusses the Titan Architecture in detail, its hardware architecture, unique
BlueArc White Paper – Titan Architecture 2007 Page 2

object-based file system structure, methods of acceleration, and how this architecture allows BlueArc to design products to achieve performance and capacity levels dramatically higher than competitors in the network storage market, and in fact allows Titan to be a hybrid product that delivers the benefits of both SAN and NAS. BlueArc has delivered on the goal of the fastest filer node to date, and will continue to enhance our offerings both at the node level, as well as the ensuring a path to storage grids for customers that exceed the throughput of a BlueArc’s high performance Titan. It Starts with the Hardware Design Looking at the network sector’s technology advancements, networking progressed from routing functionality on standard UNIX servers, to specialized CPU-based appliances, to hardware-based router and switch solutions, where all processing is handled in the hardware, utilizing custom firmware and operating systems to drive the chipsets. BlueArc saw that a radical approach would be needed for storage performance to keep pace with the advancements in computer and networking technologies. BlueArc created an evolutionary design and applied these proven market fundamentals to the NAS market developing the Titan Architecture. The goals were to solve the many pain-points experienced by current NAS and SAN customers in the areas of performance, scalability, backup, management, and ease of use. The requirements were to build the best possible single or clustered NAS server that could track and even exceed the requirements of most customers, and could also be leveraged to create a simpler storage-grid to scale even further as the data crunch continued to grow. To achieve the goal of a next-generation NAS Server, the architecture required: • Significantly higher throughput and IOPS than CPU-based appliances & servers • Highly scalable storage capacity without reducing system performance • The ability to virtualize storage in order to extract maximum performance • Adherence to traditional protocol standards • Flexibility to add new innovative features and new protocol standards Unlike conventional CPU-based NAS architectures, where you never really know how fast they are going to go until you try them, Titan is designed from the start to achieve a certain level of performance. BlueArc engineers decide upfront how fast they want it to go based on what they think is achievable at acceptable cost with the appropriate technologies. This is the simple beauty of a silicon based-design. The on-paper goals translate directly into the final product capabilities. Now selling our 3rd generation systems, BlueArc’s products have consistently achieved the sought-after performance anticipated in the design process and has met or exceeded the customer expectations for a network storage solution.

The Modular Chassis Titan’s chassis design was the first critical design consideration, as it would need to scale through newer generations of Titan modules supporting increased throughput, IOPS and scalability. Titan’s modular chassis design therefore needed to scale to 40Gbps total throughput. BlueArc chose to implement a passive backplane design to support these requirements. The backplane has no active components and creates the foundation for a high availability design, which includes dual redundant hot-pluggable power supplies and fans, as well as dual battery backup for NVRAM. Titan’s passive backplane incorporates pathways upon which Low Voltage Differential Signaling (LVDS) guarantees low noise and very high throughput. The ANSI EIA/TIA-644 standard for Low Voltage Differential Signaling (LVDS) is well suited for a variety of applications, including clock distribution, point-to-point and point-to-multipoint signal distribution. Further discussion of LVDS is beyond the scope of this paper; however a simple Internet search will return significant information if you wish understand LVDS technology. BlueArc’s hardware-based logic or Field Programmable Gate Arrays (FPGA) connect directly to these high speed LVDS Pipelines (also known as the FastPath Pipeline) meeting the high throughput requirement of current and future Titan designs. A key advantage of this design is the point-to-point relationship between the FPGAs along the pipelines. While traditional computers are filled with shared buses requiring arbitration between processes, this pipeline architecture allows data to transfer between logical blocks in a point-to-point fashion, ensuring no conflicts or bottlenecks. For example data being processed and transferred from a network process to a file system process is completely independent of all other data transfers. It would have no impact on data moving to the storage interface for example. This is vastly different from
BlueArc White Paper – Titan Architecture 2007 Page 3

conventional file servers where all IO must navigate through shared buses and memory, which can cause significant performance reductions and fluctuations. Titan’s backplane provides separate pipelines for transmit and receive data, meeting only on the storage module, in order to guarantee full duplex performance. The convergence at the storage module is to allow a read directly from cache after a write, and is covered further into the paper.

The Modules Titan employs four physical modules that are inserted into rear of the Titan chassis. These are the Network Interface Module (NIM), two File System Modules (FSA & FSB), and the Storage Interface Module (SIM). The design goal of the first Titan module set was to deliver 5Gbps throughput, more than twice BlueArc’s already high-performing second generation system, while maintaining a competitive price point. Each module has clear responsibilities and typically operates completely independently from the others, although the FSA and FSB modules do have a cooperative relationship. Next generation modules will continue the advancement of performance, port count, updated memory and FPGA speeds.
Network Interface Module (NIM) Responsible for: • High Performance GigEthernet or 10GigE Connectivity • Hardware processing of protocols • OSI Layers 1-4 • Out of Band Management Access

Titan’s Network Interface Module (NIM) is responsible for handling all Ethernet-facing I/O functions corresponding to OSI Layer 1-4. The functions implemented on the NIM include handling Ethernet and Jumbo Ethernet frames up to 9000 bytes, ARP, IP protocol and routing, and of course the TCP and UDP protocols. The NIM works as an independent unit within the architecture. It has its own parallel state machines and memory banks. Like the overall architecture design, the TCIP/IP stack is serviced in hardware on the NIM module. This design allows it to handle 64,000 sessions concurrently. Multiple hardware state machines, programmed into FPGAs, running in a massively parallel architecture, ensure that there are no wait-states. This results in nearly instantaneous network response, the highest performance, and the lowest latency. In fact, the predecessor to the NIM (used in BlueArc’s first and secondgeneration systems) was one of the world’s first TCP Offload Engines (TOE), similar to the ones used in some PC based appliances today. The purpose-built NIM provides an ideal network interface to the high performance Titan. A key difference between the NIM and an off-the-shelf TOE card is the substantial amount of resources available. While most TOE cards have no more than 64 megabytes of buffer memory, the NIM has more than 2.75 Gigabytes of buffer memory supporting the parallel state machines in the FPGAs. This allows the NIM to handle significantly higher throughput and more simultaneous connections. TOE cards used in PC-based architectures are also limited by PCI and memory bus contentions in the server; where as the Titan’s pipelines are contention free. Also, TOE cards used on NAS filers usually only handle certain protocols, putting the burden on the central CPU to handle other protocols, which affects overall performance and functions, whereas in the Titan, FPGAs handle virtually all protocols. Titan 2000 Series offer several different NIM modules depending on the model to match the performance and connectivity requirements. There are two GigE modules with 4 or 6 GigE ports with SFP (Small Form-factor Pluggable) media to allow for either optical or copper physical interconnects. There is also a 10GigE module for 10GigE infrastructures. The 10GigE module is standard on high-end modules, and an upgrade option on other models. The NIM supports link-aggregation (IEEE802.3ad) including the Link Aggregate Control Protocol (LACP) supporting dynamic changes to the aggregation and enabling higher availability and higher throughput to the data, critical for high performance shared data environments. The NIM card also has Fast-Ethernet ports for out-of-band management, which allows for direct access and/or connection to BlueArc’s System Management Unit (SMU) and the other devices that make up BlueArc’s total solution.

BlueArc White Paper – Titan Architecture

2007

Page 4

File System Modules (FSA, FSX & FSB) Responsible For: • Advanced Features • OSI Layer 5, 6 & 7 Protocols o NFS, CIFS, iSCSI, NDMP • Security and Authentication • SiliconFS (Hardware File System) • Object Store Layer • File System Attribute Caching • MetaData Cache Management • NVRAM Logging

The two File System Modules work collaboratively to deliver the advanced features of the Titan. The FSB board handles data movement and the FSA handles data management. The FSA is not inline with the pipeline. Rather, this module controls the several advanced management and exception processing functions of the file system much like the supervisor module of a high-end network switch controls the higher order features of the switch. Snapshot, quotas, File and Directory Locking are a few examples of processes managed by the FSA module. It will accomplish these tasks by sending instructions to the FSB module, which will actually handle the data control and movement associated with these tasks. The FSA module has dedicated resources in support of its supervisory role including 4GB of memory. As mentioned, the FSB module handles all data movement and sits directly on the FastPath pipeline, transforming, sending and receiving data to and from the NIM and the storage interface module (SIM). The FSB Module contains the highest population of FPGAs in the system and also contains 19.5 GB of memory distributed across different functions. It is the FSB module that moves and organizes the data via BlueArc’s Object Based SiliconFS file system. The file system is discussed in detail later in this paper. When the FSB module receives a request from the NIM module it will inspect the request to determine what is required to fulfill the request, notify the FSA module of the arrival of the new request, and take any action that the FSA may deem necessary if any. The protocol request is decoded and transformed into BlueArc’s Object Store API for further processing. This critical point is NVRAM an example of where Titan’s parallel state-machine architecture really shows its benefit. Several functions will execute simultaneously: • The data is pushed into NVRAM to guarantee the data is FPGA captured. Fastpath Fastpath • The data is pushed across the High Speed Cluster Interconnect to update the cluster partner NVRAM if it exists • The data is sent over the Fastpath pipeline to the SIM for further processing High Speed • A response packet is formed Cluster Upon successful completion of all of these elements, the response packet can be transmitted by the FSB back to the NIM which will in turn send the response back to the client, thus what would be four serial steps in a traditional file server are collapsed into a single atomic parallel step. This, kind of parallelization occurs throughout the entire system whenever possible. The FSA module handles non-performance centric functionality as well as protocol overhead for protocols like CIFS, a stateful protocol requiring additional processing; however the FSA module is not involved in data movement. For large scale CIFS or other high-end environments a faster version of the FSA module, the FSX module is also available. It is standard on the highest-end Titan systems and available as an upgrade on other models so administrators can scale up their Titan as needed.

Storage Interface Module (SIM) Responsible for: • Fibre Channel Processing • SCSI Command Processing • Sector Cache Management • Parallel RAID Striping • Cluster Inerconnect • NVRAM Mirroring

BlueArc White Paper – Titan Architecture

2007

Page 5

The SIM module has two distinct responsibilities. The first role is the handling and management of raw data on the SAN storage back-end. The second responsibility is for the high availability features of the SAN and the cluster interconnect (when configured in a cluster.) The SIM provides the redundant backend SAN connection to the storage pool using four 4 Gigabit Fibre Channel ports. The SIM logically organizes RAID components on the Titan SAN into a virtualized pool of storage so that data is striped across an adequate number of drives in order to provide the high-speed throughput required for the Titan server. This parallel RAID striping is a key advantage as it allows more drive spindles to be involved in all data transfers, ensuring the best storage performance. The virtualized storage is covered in detail later in this paper. The SIM provides the high-availability failover capabilities for clustered Titan systems. The SIM has 2 HSCI (High Speed Cluster Interconnect) ports used for both the cluster communications as well as the avenue for mirroring NVRAM between Titan’s for the highest degree of data protection. The SIM card uses two 10GigE ports for clustering which provide an extremely fast HSCI connection. These connections are required for the additional N-way cluster performance and can handle the increased inter-node communication required support a high performance storage-grid.

Memory (Buffers and Caches) In order to achieve the performance design goal, there are a number of considerations to take into account. In particular, the minimum memory bandwidth requirements throughout the system are critical. Titan has a robust set of memory pools, each dedicated to certain tasks, and these pools must operate within certain tolerances in order to achieve the desired performance. The amount of memory in each module is summarized below. This memory is distributed across various tasks on each module. By segregating memory pools (there are several dozen in the entire system) and ensuring that each has adequate bandwidth, BlueArc ensures that memory access will never be a bottleneck and is critical to sustain Titan’s high throughput performance. Titan 2000 series has up to 36 gigabytes (GB) of distributed memory, cache and NVRAM distributed across the various modules. The high-end Titan model has: • NIM Module 2.75 GB Network processing • FSA 4 GB Protocol handshaking and files system management • FSB 19.5 GB Metadata NVRAM & control memory • SIM 9.75GB Sector cache & control memory • Total Memory 36 GB In designing memory requirements for a high speed system two key things must be taken into consideration. First, peak transfer rates on an SDRAM interface cannot be sustained due to various per-transfer overheads. Secondly, for the various memories contained in each of the modules, the memory bandwidth must be doubled to support simultaneous high performance reads and writes, as data is written into memory and pulled out. Thus memory bandwidth in the architecture is designed to have approximately 2.5x the bandwidth required to sustain throughput. There are also areas of the architecture where memory bandwidth is even greater. The SIM, for example, has 8 Gigabytes of raw block sector cache. On this module, the transmit and receive Fastpath pipelines intersect, as data that is written into the sector cache must be immediately available for reading by other users even though the data may not yet have made it to disk. In this scenario, four simultaneous types of access into the memory must be considered. • Writes coming from the FSB • Reads being returned to the FSB • Updates of the cache from data coming from the SAN • Reads from the cache in order to flush data to the SAN Thus the SIM’s sector cache must deliver 5x the bandwidth of desired throughput of the overall system, which it does.

Field Programmable Gate Arrays (FPGA) At the heart of BlueArc’s Titan architecture is a unique implementation of parallel state machines FPGAs. An FPGA is an integrated circuit, which can be reprogrammed in the field, enabling it to have the flexibility to perform new or updated tasks, support new protocols or resolve issues. Upgrades are done via simple upgrade as performed on switches or routers today, which can change the FPGA configuration to perform new functions or protocols.

BlueArc White Paper – Titan Architecture

2007

Page 6

Today’s FPGAs are high-performance hardware components with their own memory, input/output buffers, and clock distribution - all embedded within the chip. FPGAs are similar to ASICs (Application Specific Integrated Circuits), used in high-speed switches and routers, but ASICs are not reprogrammable in the field, and are generally used in a high-volume, non-changing product. Hardware developers sometimes do their initial designs and releases on an FPGA as they allow for quick adhoc changes during the design phase and short production runs. Once the logic is locked down, they move the logic to an ASIC as product volumes ramp and all features are locked in, to get to a fixed lower cost design. Yet in the Titan architecture, the FPGA is the final design implementation, in order to provide the flexibility to add new features and support new protocols in hardware as they are introduced to the market. High-performance switches and routers use FPGAs and ASICs to pump network data for obvious reasons. Now, with BlueArc, the same capability exists for network storage. For an analogy of how the Clock Cycles TCP/IP FPGAs work, think of them NFS as little factories. There are a Block Retrieval Memory FPGA Memory FPGA Metadata number of loading docks Block Allocation called Input/Output blocks, workers called logic blocks, NVRAM iSCSI and connecting everything up Memory Metadata FPGA Memory FPGA are the assembly lines called Fibre Channel Snapshots Programmable Interconnects. Data enters through an input TCP/IP NFS block, much like a receiving Memory Memory FPGA Block Retrieval FPGA dock. The data is examined CIFS by a logic block and routed Virtual Volumes along the Programmable NDMP Interconnect to another logic iSCSI Memory Memory Metadata FPGA block. Each logic block is FPGA Block Allocation capable of doing its task NDMP unfettered by whatever else is happening inside the FPGA. These are individual tasks, such as looking for a particular pattern in a data stream, or performing a math function. The logic blocks perform their tasks within strict time constraints so that all finish at virtually the same time. This period of activity is gated by the clock-cycle of the FPGA. Titan’s FPGAs operate at 50 million cycles per second. Given the 750,000+ logical blocks inside the Titan modules, this yields a peak processing capability of approximately 50 trillion tasks per second - over 10,000 times more tasks than the fastest general purpose CPU. (NOTE: As of this writing, Intel’s fastest microprocessor was rated for 3.8 billion tasks per second). This massive parallel processing capability is what drives the BlueArc design and allows us to continue to improve throughput nearly 100% per product generation. This contrasts sharply with conventional network storage servers, which rely on general purpose CPUs that have only been able to scale at approximately 30% per product generation. The fundamental failing of the CPU Block OS RAID Metadata NVRAM Metadata Block is that with each atomic step, a Allocation Operation RAID Retrieval Lookup Rebuild Write Fetch software delay is introduced as tasks, which are serially queued-up to be processed, demand system resources in order to execute. When a client machine makes a Clock Cycle request to a software appliance, every attempt is made to fulfill that request as far through the compute Main Memory Main Memory process as possible. The steps include the device driver of the network card initially receiving the request through error checking and translation of the request into the file system interface. However, this is a best effort strategy. In fact, it can only be a best effort strategy, because the CPU at the heart of software appliances is limited to performing only one task at a time and must, by definition, time-share.

CPU CPU

BlueArc White Paper – Titan Architecture

2007

Page 7

This issue is exacerbated when advanced features or processes such as snapshot, mirroring, clustering, NDMP backup, and in some cases even RAID protection must be handled by the CPU. Each of these processes cause variations and slowdowns that adversely impact the throughput of the traditional the traditional architectures as the CPUs processing capability is diluted from having to time-share between these various tasks. BlueArc’s Virtualized Storage, Object-Store & File System Now that we have covered the exclusive hardware advantage of the Titan architure, it is equally important to understand the external storage system and the unique file system that take advantage of Titan’s high performance and highly scalable architecture. As data passes through layers of FPGAs, the data itself is organized and managed via several layers within the Silicon File System. The best way to understand these layers is to work from the physical storage up to the actual file system seen by the end-users.

Cluster Name Space (CNS) • Global Name Space of all file systems • Clients access a single mount point • CNS across single or clustered Titans • Supports multiple name spaces for multi-tenent environments Titan or Titan Cluster • Up to 126 file systems • Up to 2 petabytes of usable storage Object Store File System • Up to 256 terabyte each • Auto-grow as capacity is needed • Supports thin provisioning • Dynamically move file systems between Titans Shared Storage Pool • Up to 2 petabytes of usable storage • Parallel RAID striping eliminates hot-spots • Dynamically expand the storage pool Physcial RAID Volumes • Up to 256 RAID volumes • Up to 64 terabytes each SAN Storage Arrays • Dedicated Hardware RAID Controllers • Modular storage expansion • Redundant SAN switch connectivity • Scale capacity, bandwidth or both • Tiered storage with Fiber Channel & SATA drives • High Density SATA Archive option

Parallel RAID Striping & Multi-Tiered Storage (MTS) For the physical structure of the storage, BlueArc has two requirements. First, protect the data, and second, provide the high throughput and performance needed to “feed” the Titan fast enough to keep up with its throughput potential and customer’s requirements. To accomplish this, BlueArc integrates multiple sets of dual redundant RAID controllers on a back-end SAN. The SAN is usually two or more redundant Fibre Channel switches, which allows for a scalable backend and high availability. The switches are cross-connected to the RAID controllers and the Titan SIM module, providing the high availability failover paths. These redundant RAID controllers are configured in an active/active configuration to enabling both high-performance and availability. They serve up shelves of protected RAID volumes to the Titan or clustered Titan. Rather than having each Titan own their own set of physical disks, a Titan cluster actually shares this storage in a true cluster configuration. This allows for much better storage efficiency and performance, compared to other systems that implement RAID in software in the filer, with each filer only acting as a failover for the other, but not sharing storage.

BlueArc White Paper – Titan Architecture

2007

Page 8

The RAID volumes are usually configured for RAID 5 protection, providing failed disk protection and reconstruction through rotating parity, again ensuring both high availability and good read/write performance. RAID 1 and RAID 6 are also supported depending on the type of storage systems used. The SIM Module then stripes up to 128 RAID volumes into a larger logical unit called a Stripe. Stripes are organized into a higher-level entity known as a storage pool. New Stripes may be added to a storage pool at any time without requiring any downtime, allowing dynamically scalable storage pools. This design allows Titan to scale in both capacity and back-end performance. Customers can also scale performance by adding more RAID arrays or scale capacity by simply adding more disks to existing RAID controllers. This allows optimal performance-to-capacity, without the slow-down often seen in software-based filer and RAID configurations. I/O is issued to the storage pool, which is in turn sent to an underlying Stripe and all of the disk drives within that Stripe are brought to bear to achieve the enhanced performance required to sustain the Titan throughput. This feature, called SiliconStack, allows the Titan to scale storage without a reduction in performance. In fact, as more storage is added, this actually increases the performance of Titan, as it provides more spindles and controllers to “feed” Titan’s high throughput capability. This combined with the SiliconFS, allows scalability up to 2 petabytes. This capacity is carved up by the NAS administrators into dynamically scalable file systems supporting varying block sizes, capacity, and drive technology. As the file systems are not tied to any specific physical drives and are virtualized across the storage pool they can also be scaled dynamically as required, through automatic triggers or manually. Also to increase storage efficiency thin provisioning can be used to present more storage than is physically available, allowing administrators to promise more storage to their end-users or projects, and then buy as they grow into projected capacity. The Titan’s backend SAN has the intrinsic property of allowing customers to more granularly control their storage expenditure as well. Titan’s unique Multi-Tiered Storage (MTS) feature, allows any type of storage technology to reside behind a single Titan. This means that administrators can choose the right disk for the specific application and customer requirements. High performance Fibre Channel drives or even Solid State Disks can be used for the highest throughput and I/O requirements, while lower cost, higher capacity Serial ATA drives can be used for lower throughput applications or near line storage. As storage technology continues to get faster and higher capacity, Titan will continue to accommodate and enhance the value of these mixed media types, and as well as reduce the cost of storage management via its ability to migrate storage between the storage tiers. BlueArc’s Unique Object Store The Object Store is a layer between the normal presentation of a file system view to the user and the raw blocks of storage managed by the SIM. An object is an organization of one or more raw blocks into a tree structure. Each element of the object is called an Onode. Objects are manipulated by logic residing in the FPGAs located on the FSB module. The primary element at the base of an object is called the Root Onode. Each Root Onode contains a unique 64-bit identifier called the Object Identifier (OID) as well as the meta-data information relevant to the object. Root Onodes point either directly to Data Onodes, to Direct Pointer Onodes, or to Indirect Pointer Onodes depending on the amount of content to be stored. These pointer Onodes are simply the connectors that ultimately lead to the Data Onodes. Via this extensibility, the Titan can support a single object of as large as the entire file system, or billions of smaller files in a very efficient method.

Data 0

Root Onode Left

Data 1

Data 2 Indirect Onode D Inidirect Onode B Direct Onode A Data 3 Root Onode RIght Data 4 Direct Onode C Data 5

Data 6 Inidirect Onode E Direct Onode F Data 7

Data 8 Direct Onode G Data 9

For each object, two versions of the Root Onode are maintained. They are referred to as the Left and Right Root Onodes. At any given moment, one of these Root Onodes is atomically correct while its partner is subject to updates and changes. In combination with the Titan’s NVRAM implementation, this ensures that data integrity is preserved even in the case of a system failure. NVRAM recovery is discussed later in this paper. Finally, Root Onodes are “versioned” when snapshots are taken so that previous incarnations of the object can be accessed.

BlueArc White Paper – Titan Architecture

2007

Page 9

Different kinds of object serve different purposes. User data is contained in a file_object. A directory_name_table_object contains file and directory names in various formats (dos short names, POSIX, etc.), file handles, a crc32 hash value and the associated OID that points to the location of another object such as a subdirectory (another directory_name_table_object) or a file (file_object). Directory and file manipulation, Snapshot and other features benefit from this object implementation versus a more traditional file level structure. One key example of this is delivered via unique object called a directory_tree_object. For each directory_name_table_object, there exists a peer called the directory_tree_object. This is a sorted binary search tree (BST) of Onodes containing numeric values (hashes). First converting the directory/file name to lower case and then applying a crc32 algorithm against it derive these hashes. The payoff comes when it is time to find a directory or a file. When a user request asks for a particular file/directory by name that value is again converted to lower case, the crc32 algorithm is applied and then an FPGA on the FSB module executes a binary search of numeric values (as opposed to having to do string compares of names) to locate the position within the directory_name_table_object at which to begin the search for the required name. The result is a quantum improvement in lookup speed. Where all other network storage servers break down Titan maintains its performance even with very densely populated directories. Titan can support over 4 million files in a single directory, while keeping directory search times to a minimum and sustaining overall system performance. This is one of the reasons that Titan is often used in Internet services companies, as they have millions and millions of files, and fewer directories allows for a simplified data structure. In addition to a lot of files within a directory, Titan’s Object Store allows the file system itself to be significantly larger, currently supported up to 256TB. This compared to other file systems that theoretically support 16 to 32TB, however are often limited to less than half this size due to performance penalties. This combined with our global name space feature, Cluster Name Space, allows Titan to support up to 2 petabytes in a single virtual file system. These capabilities can be increased as customer requirements grow as these are not architectural limits, but rather currently supported configurations. Client machines have no concept of objects, but rather only see the standards based representation of files. Via the NFS or CIFS protocols they expect to work with string names and file handles, thus Titan presents what is expected by these clients and handles all the conversion to objects transparently to ensure perfect compatibility. This “view” of what the clients expects is the job of yet another FPGA, also located on the FSB module, which presents the Virtual File System layer to the clients. For those clients that require or prefer block level access, Titan supports iSCSI. iSCSI requires a “view” of raw blocks of storage. The client formats, and lays down its own file system structure upon this view. To make this happen Titan simply creates a single large Object up to 2 terabytes in size (this is an iSCSI limitation) within the Object Store, which is presented as a run of logical blocks to the client. Since the iSCSI volume is just another object, features like Snapshot or dynamic growth of the object are possible.

By implementing an Object Store file system, BlueArc delivers many outstanding File System characteristics beyond just performance: • • • • Max supported volume size: Currently 256 TB, architected for 2PB Max supported object size: Currently 256 TB, architected for 2PB Max supported capacity: Currently 2PB, architectured for over 16PB Max objects per directory : 4 million with full 32 character names o Dependant on the amount of attributes the object contains and the file name lenghs themselves o Note that the Titan can perform at its maximum capability even with this kind of directory population as long as the backend physical storage can sustain the Titan’s throughput Max number of snapshots per file system: 1024

Virtual Volumes In addition to presenting an entire file system, Titan delivers flexible partitioning, called Virtual Volumes. Administrators may not wish to expose the entirety of the file system to everyone, and through the use of Virtual Volumes, they can present a subset of the file system space to a specific group or user. Virtual Volumes are logical containers, which can be
BlueArc White Paper – Titan Architecture 2007

NVRAM

FPGA
Fastpath Fastpath

High Speed Cluster

Page 10

grown and contracted with a simple size control implementation. Client machines see changes to the size of Virtual Volumes instantly. When shared as an NFS export or CIFS share, the user or application sees only the available space assigned to the Virtual Volume. Administrators can use them to granularly control directory, project or user space. The sum of the space controlled by the Virtual Volumes may be greater than the size of the entire file system. This over-subscription approach, sometimes referred to as thin provisioning, provides additional flexibility when project growth rate is indeterminate. This allows administrators to present the appearance of a larger volume to their users, and purchase the additional storage as needed, while showing a much larger storage pool than is actually available. Further granular control can be realized by assigning user and group quotas to a Virtual Volume. Each Virtual Volume may have its own set of user and group quotas and default quota values can be assigned undefined users and groups. Of course, both hard and soft quotas are supported by the system as well as quota by file count. NVRAM Protection The current FSB module contains 2 Gigabyte of NVRAM for storing writes and returning fast acknowledgements to clients. The NVRAM is partitioned in half so that one half is receiving data while the other is flushed to disk (check-pointed). The NVRAM halves are organized into smaller pages, which are dynamically assigned to the various file systems based on how heavily they are being accessed. Check pointing, is the process of flushing writes to disk. At checkpoint time, either the left or right Root Onode is written to while the other Onode is frozen, becoming the atomically correct version. This process cycles back and forth every few seconds. In the event a file system recovery is needed later, this frozen version of the Root Onode is used to restore the file system quickly to a consistent check-pointed state. For example, in the case of a power outage, that atomically correct version of the Root Onode becomes critical. First, the alternate Root Onode is made to look exactly the same as the atomically correct version. This process is called a rollback. Then, the contents of NVRAM are replayed against the objects. In this way, customer data is guaranteed to be complete and intact at the end of the recovery. In a Titan cluster, the NVRAM is further partitioned, half for storing local write data, and half to store the cluster partner’s write data. In this way, even if one of the Titans fails completely, the remaining partner Titan can complete the recovery process.

BlueArc White Paper – Titan Architecture

2007

Page 11

Life of a Packet To tie the hardware and software architecture together it is a good exercise to understand how a typical read or write operation is handled through the system. The following diagram and the steps detailed after, walk through both a write and a read operation. The diagram is a simplification of the design, but highlights the major blocks of the architecture. For simplicity the inbound and outbound FPGAs are shown as a single block, but are actually separate.
100Mbps Mgmt

Titan Chassis
M Processor
Management Plane

M

M Processor
Management Plane

FSA FSB
12

Processor

NIM
14

M
13

M
11

SIM

M

GE GE GE GE GE GE

Fastpath 8

Fastpath 10 10

Storage Interface
FC

Protocol
IP TCP UDP ICMP

Ba ck pla ne

Protocol
NFS CIFS NDMP iSCSI

Metadata Cache
7

Objec Stor
NVRAM
7

Ba ck pla ne

Sector Cache

Storag Interfac
9

FC FC FC

3

Fastpath 1 2 4 5 6 7

Fastpath

M NIM Module
FPGA’s Memory

M

Cluster FS Modules

Cluster
GE GE

M SIM Module

Write Example 1. A network packet is received from one of the GE interfaces on the NIM. 2. The incoming packet is saved on the NIM into memory by the FPGA 3. If the incoming packet is a network only request, such as a TCP session setup, it is processed to completion and sent back out to the requesting client. 4. Otherwise the FPGA will gather additional related incoming packets. 5. The complete request is passed over the LVDS FastPath to the FSB module. 6. The first FPGA on the FSB Module stores the message in its own memory and then attempts to decode the incoming message, simultaneously notifying the FSA of the arrival in case exception processing will be required. While most requests are handled directly by this FPGA, the FSA Module processor handles exception cases, however only the header information required for decoding the request is sent to the FSA for processing. 7. Once the request is decoded, the Object Store then takes over and this FPGA will send the data in parallel to the NVRAM, update the Meta-data Cache, send an additional copy of the write request over the cluster interconnect pipeline if there is a cluster partner, begin the formulation of a response packet AND pass the request to the SIM module via the FastPath pipeline. 8. Once the NVRAM acknowledges that the data is safely stored, the response packet is shipped back to the NIM, letting it know that the data has been received and is protected (See Step 12 and 13 below). This allows the client to go on processing without having to wait for the data to actually be put on disk. 9. In parallel with the above operations, an FPGA on the SIM receives the write request and updates the Sector Cache with the data. At a specified timed interval of just a few seconds, or when half of the NVRAM becomes full, the SIM will be told by the FSB module to flush any outstanding data it has to disks. This is done in such a way as to maximize large I/Os whenever possible in order to achieve the highest throughput to the storage.
BlueArc White Paper – Titan Architecture 2007 Page 12

Read Example Steps 1 through 6 of are virtually the same as the previous example. 7. Since this is a read request NVRAM is not involved 8. There are certain kinds of read requests that have to do with Meta-Data lookups. The Object Store on the FSB module will attempt to find the relevant data in this case in its Meta-Data Cache and if successful will respond rapidly without having to retrieve the Meta-Data from disk. Otherwise the lookup mechanism kicks in as described earlier, taking advantage of the directory_tree_object BST method. At various points in this process request will be passed onto the SIM module for any data necessary to the lookup. Once the OID of the target object is found, processing moves to the SIM. 9. The SIM module has an ample Sector cache. The FPGA on the SIM will look to see if it can satisfy read requests from here. Otherwise, it will formulate a Fibre Channel request to retrieve the data from disk and store it in the Sector Cache. Both Meta-Data and Data requests are handled this way. 10. Once the relevant data has been retrieved, the SIM will pass the data back to the FSB module. 11. The Object Store will update the Meta-Data cache as necessary and re-apply the RPC layers in order to create a well-formed response. 12. The response packet is passed to the NIM Module. 13. The FPGA on the NIM will appropriately organize the response into segments that comply with TCP/IP or UDP/IP and of course Ethernet formatting. 14. Finally the NIM transmits the response out the Gigabit Ethernet interface. This packet walk-through should help to tie together the hardware and software architecture and interactions of how data flows through Titan.

Benefits of BlueArc’s Titan Architecture Industry-Leading Performance, Throughput & Capacity First and foremost, the benefits of the Titan Architecture are the dual benefits of high performance transaction, high throughput and high capacity. The 4th generation BlueArc Titan delivers the highest performance of any single filer on the market, and will continue to increase this advantage through its modular design. Titan has achieved SPECsfs97_R1.v3® benchmark results exceeding those of any network storage solution, utilizing a single Titan projecting a single file system. Titan’s published results are 98,131 ops/sec (SPECsfs97_R1.v3) and an overall response time (ORT) of 2.34 ms. This is more than 272% higher throughput than high-end system results from other leading NAS vendors. Dual Clustered Titans presenting a single file system using global name space, achieved 195,502 ops/sec and an ORT of 2.37 ms, 286% higher than other dual clustered systems. Titan highly efficient clustering provided unheard of linear performance over the single Titan results with less than 1% loss in efficiency. These results are available at the SPEC.org website, http://www.spec.org/sfs97r1. These tests clearly demonstrate that the Titan can sustain the responsiveness in both high concurrent user and cluster computing environments, speeding applications such as Life Sciences research, 3D Computer Generated Imagery, Visualization, Internet services and other compute intensive environments. . While the test results are a proof point for transactional performance, raw throughput is also critical for many of today’s large digital content applications. Here, the Titan also excels providing a full 10Gbps in aggregate throughput. The combination of both high transaction rates and throughput allows Titan to excel in mixed environments where there are users and batch processes demanding both aspects of performance with minimal wait times. Titan is the pinnacle of network storage performance in both dimensions. Titan currently supports up to 2PB under a global namespace with file systems of up to 256TB, however, it is architected to support up to 16PB. The actual amount of storage supported will continue to scale as testing, memory, and customer requirements continue to grow. Although the unique Object-store based file system is the enabler for this capability, the hardware allows these large file systems to continue to be accessed at the highest performance levels even as the file system begins to fill with a high number of files. The combined hardware and software also enable the support of over a million files per directory, critical for larger file systems. Since these types of inquiries are converted to a binary search, the system delivers exceptional responsiveness to users, regardless of the type of directory structures and files. This benefit allows storage administrators to consolidate their filers to reduce hardware, software license and support costs. It also allows customers to purchase an initial system and grow the system without buying additional filers, and having to manage the migration or separation of data between multiple filers or tiers of storage.

BlueArc White Paper – Titan Architecture

2007

Page 13

Scalability Titan can scale to meet customers’ future requirements in any of the three key storage dimensions. First is the need to support higher performance in terms of IOPS, to “feed” the high-speed applications and compute clusters. Second, is the need to support higher bandwidth throughput in terms of Gbps, as file sizes and the number of users of these large data sets continue to grow. Third, is the need to store significantly greater amounts of data in terms of terabytes, driven by growing file sizes, increased data sets, and changing regulatory requirements to retain data for longer periods of time. To cope with all three dimensions of scalability, most customers have been forced to do forklift upgrades or deploy many individual or HA-clustered filers. However these approaches only lead to filer proliferation, increased complexity and higher support costs. Filer proliferation causes many of the same challenges of DAS, such as unused space on some volumes, not enough on others, excessive headroom across all the volumes, and management issues. Data sets that once had to be separated between the different filers to support the greater aggregate capacity, performance, and throughput demands are no longer necessary with Titan and because of this, clients no longer need to be determine which filer to access. Clustering storage is another solution, however with the lower performance of other systems often requires as many as 8-10 cluster storage nodes or more. With clustering storage software in its infancy, and the number of nodes increasing complexity and loosing efficiency, fewer nodes are clearly an advantage. Titan was designed to address these three dimensions of scalability today as well as into the future. First, Titan was designed with the highest performance, capacity and throughput, which meets or exceeds most customers’ current requirements. BlueArc understands that these requirements will continue to grow as data sets grow, and compute clusters become more prevalent, putting more demands on the storage subsystem. BlueArc foresees the potential of customer requirements nearly doubling year over year in the very near future. The Titan architecture was designed with this in mind. From a design perspective, the long-term benefit of the architecture is that it allows the BlueArc engineering team to increase performance and throughput of each product generation by approximately 100%. This compares with approximately 30% achieved by conventional CPU-based appliance vendors, unless they increase their system costs. This allows BlueArc to have an ever-increasing advantage at the atomic (single filer node) level – doubling the advantage each product cycle at the current rates. This is critical both for customers who wish to stay on single and HA-clustered systems as well as for customers who want to eventually deploy a storage-grid. Both the current Titan and future module upgrades will continue to provide the fastest file serving nodes and are designed to be most capable and simple storage-grid. For customers who require scalability beyond the capabilities of a single Titan, Titan is the ideal foundation for larger storage-grids; in that fewer nodes will be required to achieve the desired performance. As explained in the previous sections, Titan is able to cluster in an up to 8-way configuration, with nearly no loss of efficiency, providing the highest SpecSFS IOPS solution of any single name space solution. Other configurations take 16, 24, or even more clustered file servers to achieve similar results. Building a storage-grid from larger, faster nodes reduces the hardware, software and support costs, as well as significantly reducing the complexity required to achieve the desired capacity and performance, and this is often theoretical as many clusters do not support these higher numbers and loose efficiency as they scale. Fewer storage cluster nodes reduce the back-end inter-cluster communications and data transfers. Smaller storage clusters or grids with faster nodes will provide a more efficient and cost effective scalable storage solution. BlueArc engineering is continuing to drive towards higher node cluster storage solutions, however with Titan’s significant advantage in per node performance, and the doubling of performance with module upgrades, their task is significantly less daunting than our competitors. Competitive engineering efforts must cluster more nodes, with standard PC-based architectures, which is significantly more complex and less efficient. Features The Titan architecture delivers advanced features, layered onto the hardware file system, without significantly impacting performance, as is often the case with a CPU based, shared memory appliance. Features such as snapshot, policy based migration, mirroring, and other data mover features are executed in hardware operating on objects within the object store, allowing them to be done at a binary level within the hardware. This significantly reduces the overhead of these features. Combined with the virtualized storage and back-end SAN implementation of the architecture, enables many storage functions to be handled without affecting the performance and throughput of the system. This is especially true in scenarios such as a drive failure, where a CPU-based filer would have to get involved in the rebuilding the hot-spare, while the Titan off-loads this function to the hardware RAID controllers.
BlueArc White Paper – Titan Architecture 2007 Page 14

The Titan architecture allows any type of backend storage to be used, including Fibre Channel, SATA and even Solid State Disk. This feature, called Multi-Tiered Storage (MTS), departs from to other vendors, which require a separate filer to handle different tiers of storage and disk, causing a proliferation of filers. Titan further delivers the capability to do data migration between these storage tiers simplifying storage management and providing a more cost controlled environment as administrators can provide the right type of disk, performance and throughput based on the application requirements. To preserve the user or application experience, migration between tiers can also be accomplished transparently using Data Migrator, allowing simplified ILM functionality that does not affect the end-users or applications. Titan delivers multi-protocol access into its MTS including NFS, CIFS, and even block-level iSCSI. Block level access is enabled through the object design of the file system, allowing even a block level partition, to be viewed as an object. For management, Titan supports SSL, HTTP, SSH, SMTP, and SNMP as well as remote scripting capability via utilities provided at with no additional license fees to the customer. Going forward, BlueArc will continue to innovate in both the hardware and software areas. Titan’s modular design will allow customers to have the purchase protection of an upgradeable system, increasing throughput, IOPS, and capacity, with simple blade changes. In terms of software, the foundation of the SiliconFS and the virtualized storage allows advanced features such as Virtual Servers, Data Migrator, WORM, and Remote block-level replication.

Conclusion The fastest, most reliable and longest-lived technology in the data center is typically the network switch, whether it is Ethernet or Fibre Channel. A switch purchased five years ago is still fast and useful because it was built with scalability in mind. The speed, reliability and scalability of the network switch are directly attributable to the parallelism inherent in the hardware-accelerated implementation, the high-speed backplane and the replaceable blade design. BlueArc, having learned well these lessons has delivered on the promise of hardware-accelerated file services and will capitalize on this unique capability to enable its customers to continually scale their storage infrastructure as their requirements grow.

Contact BlueArc today at 1-866-864-1040 to learn more, or visit www.bluearc.com

BlueArc Contact Information
BlueArc Corporation Corporate Headquarters 50 Rio Robles San Jose, CA 95134 USA BlueArc UK Ltd. Queensgate House Cookham Road Bracknell RG12 1RB United Kingdom

info@bluearc.com
T 408 576 6600 F 408 576 6601

uk_info@bluearc.com
T +44 (0) 1344 408 200 F +44 (0) 1344 408 202

Or visit our Web site at: http://www.bluearc.com

BlueArc White Paper – Titan Architecture

2007

Page 15