EMC Navisphere Analyzer: A Case Study

May 2001

Navisphere Analyzer: A Case Study

0

EMC CORPORATION DISCLAIMS ALL IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. the products. CLARiiON. The information contained in this document is subject to change without notice. described in this document are furnished under a license. or stored in a database or retrieval system. data or profits. In no event shall EMC Corporation be liable for (a) incidental. including but not limited to microcode. arising out of this document. Copyright © 2001 EMC Corporation. EMC2. or consequential damages or (b) any damages whatsoever resulting from the loss of use. without the prior written consent of EMC Corporation. EMC either owns or has the right to license the computer software programs described in this document. title and interest in the computer software programs. EMC Corporation assumes no responsibility for any errors that may appear. expressed or implied. C844 Navisphere Analyzer: A Case Study 1 . EMC. All rights reserved. indirect.No part of this publication may be reproduced or distributed in any form or by any means. EMC Corporation makes no warranties. All other brands or products may be trademarks or registered trademarks of their respective holders. EMC Corporation retains all rights. and Navisphere are registered trademarks and where information lives is a trademark of EMC Corporation. and may be used or copied only in accordance with the terms of such license. or the computer software programs described herein. by operation of law or otherwise. relating to this document. even if advised of the possibility of such damages. special. All computer software programs.

.....................................................................................11 Navisphere Analyzer: A Case Study 2 .................5 Conclusions of the Analysis .........................................................................................................................................................0 Executive Summary....................................................4 Customer’s Problem............................................................................................................................10 Summary .................................................10 Appendix 1 ...................3 Introduction ......................................................................................................................Table of Contents EMC Navisphere Analyzer: A Case Study............................................................................................

As more and more companies rely on CLARiiON products. the need increases to be able to quickly determine the following: The array is being used efficiently The array is working properly Sufficient resources exist on the array for normal day-to-day operations. or storage processors. The case study presents the functionality available with Navisphere Analyzer. Collecting historical data of this type is helpful in determining the cause of lingering performance problems. Analyzer can be used to continuously monitor and analyze performance. This is most helpful in determining how to fine tune array performance for maximum utilization. the problem that the customer experienced. Analyzer is of further assistance in helping to assess whether fine tuning parameters of the array will solve the problem or whether hardware components. the user can compare realtime data to data recorded previously to help analyze performance issues. and where the data should be stored. LUNs. to determine the cause behind a bottleneck and/or a performance issue. Data can be collected automatically from selected arrays.Executive Summary This white paper describes the functionality of EMC Navisphere Analyzer through the discussion of a customer case study. such as cache memory or disks. as well as potential growth Navisphere Analyzer software is a host-based performance analysis tool that is intended to be used as a microscope to examine specific data in as much detail as necessary. Navisphere Analyzer: A Case Study 3 . need to be added. it can be used to analyze data collected earlier. The user can specify when to record data. and the methodology that was used to resolve the problem using Analyzer. Once the cause has been isolated. from which hosts to gather data. Alternately. the types of data that were collected. Finally.

That is. a case study is presented here in which Navisphere Analyzer was used to determine the underlying cause of a performance problem. the “Basic” data types (see Appendix 1) are typically quite sufficient for the analysis of most performance problems. the department’s usage of the array has not changed significantly. Navisphere Analyzer permits the administrator to collect data over different blocks of time and then analyze that data to see if there is any hint about the underlying causes of the problem. if a particular LUN’s utilization is high. For many problems. first looking in general at the utilization of each LUN. when.Introduction A typical problem for a system administrator to encounter is a complaint by one or more departments that the performance of an array drastically changes from time to time and that it seems unrelated to what that department is doing at the time. To illustrate this point. and/or disks can help give a specific direction to pursue in researching a performance issue. That is. storage processor. Navisphere Analyzer: A Case Study 4 . That is. The administrator will usually try to gather information about when the problem occurred to see if he or she can determine what else was going on at the time. Then. looking in more and more detail at the performance characteristics of that particular LUN. looking at the utilization of different components of the array is usually sufficient to quickly narrow down the basis of the problem. utilization of the LUN. This is usually difficult to do because it’s rare that every department remembers precisely what they were doing. When using Navisphere Analyzer for such an investigation.

Figure 1 is a printout of the utilization of the LUNs. Data was stored starting at 07:59 to 10:06 to overlap with the running of the sales report. Its average utilization is at 90 percent and the latest utilization was close to 100 percent. This would obviously affect the performance of the array in general. The configuration consisted of two FC5700 CLARiiON arrays with a combined storage of two terabytes. the performance of the arrays was severely affected. The arrays were configured as RAID 5. While the arrays performed well most of the time. It is clear that LUN 0x02 is close to 100 percent utilization. Figure 1.Customer’s Problem The customer is a major CLARiiON account who was experiencing a severe performance problem each month when a large sales report was run. LUN Utilization Report Navisphere Analyzer: A Case Study 5 . when a particular large sales report was executed. The system administrator used Navisphere Analyzer to first look at the utilization of all of the LUNs on the array.

Figure 2 shows that shortly after the sales report started to run at 08:00. utilization of the LUN Navisphere Analyzer: A Case Study 6 . Figure 2. This makes it very clear that this LUN is over-utilized.The next step was to look in detail at the utilization for that LUN. Report showing in detail. the utilization for the LUN reached 100 percent and more or less stayed there for the duration of the report.

that is. the storage processor is not the limiting factor in this performance issue. because its utilization is only around 40 percent. It is clear that the storage processor is not being over utilized.The next step that the administrator took was to look at the utilization of the storage processor. This data is shown in the lower portion of Figure 3. Report showing LUN and storage processor utilization combined Navisphere Analyzer: A Case Study 7 . Figure 3.

because its queue length is around 5. the storage processor is not the issue.The next step was to look at the queue length of the storage processor. LUN utilization and storage processor queue length Navisphere Analyzer: A Case Study 8 . Figure 4. which is the number of disks that constitute the LUN. This is shown in Figure 4. It is clear that again.

This queue length demonstrates the location of the problem. It is clear that this is the location of the bottleneck. Figure 5 – LUN utilization and LUN queue length Navisphere Analyzer: A Case Study 9 . because it is close to 20. It is shown in Figure 5. This exceeds the number of disks that constitute the LUN.The queue length for the LUN itself was then looked at.

while the sales report was being run. looking at the overall utilization report. The queue lengths for both the storage processor and the LUN were examined. the problem was fixed. the storage processor was not the performance bottleneck. however. it was clear that the storage processor was not being bogged down because its utilization was less than 50percent. That is. it was clear that the LUN was being over-utilized. the LUN was over-utilized. reports were used to move closer and closer to the problem. The next step looked specifically at the LUN. Next. It was concluded. What wasn’t clear. the utilization of the storage processor was examined. however. The queue length for the storage processor was less than or equal to the number of disks that constitute the LUN. and therefore. Additional disks were added and the problem was resolved. was whether the issue was the load on the storage processor or whether it was due to a lack of disks that constitute the LUN. The queue length for the LUN. Starting at the level of the LUN. was close to 20. the load on the LUN is four times higher than it can handle. using a few very clear reports. Once this was done. The conclusion that was drawn from this data is that the number of disks on the system should be increased. Navisphere Analyzer: A Case Study 10 . When the data was examined. Summary Navisphere Analyzer was used to determine the cause of a performance problem.Conclusions of the Analysis First. It was obvious looking at that report (Figure 2) that while the report was running. that the LUN in question needed more storage in it. which is four times the number of disks in the LUN.

• • • • Navisphere Analyzer: A Case Study 11 . thus increasing the average response time of a single request. If three requests arrive at an empty service center at the same time. Total throughput includes both read and write requests. Since a LUN is considered busy if any of its disks are busy. a high LUN utilization value does not necessarily indicate that the LUN is approaching its maximum capacity. required for one request to pass through a system component. storage processor. Total throughput (IO/s) – The average number of requests that pass through a system component per second. the other two must wait in the queue. An SP or disk that shows 100 percent (or close to 100 percent) utilization is a system bottleneck. only one of them can be served immediately. Larger requests usually result in a higher total bandwidth than smaller requests. since an increase in the overall workload will not affect the component throughput. The higher the queue length for a component. LUN utilization usually represents a pessimistic view. the more requests are waiting in its queue. Since smaller requests need a shorter time for this. queue length and response time are directly proportional. including its waiting time. including the one in service. Total bandwidth includes both read and write requests. Total bandwidth (MB/s) – The average amount of data in Mbytes that is passed through a system component per second. and LUN) • Utilization – The fraction of a certain observation period that the system component is busy serving incoming requests. in milliseconds. For a given workload. That is. Response time (ms) – The average time. they usually result in a higher total throughput than larger requests do. resulting in a queue length of three. the component has reached its saturation point. Queue length – The average number of requests within a certain time interval waiting to be served by the component.Appendix 1 Navisphere Analyzer collects and analyzes data on the following performance properties: Basic (for disk. An (average) queue length of zero indicates an idle system.

LUN) – The average read or write size respectively in Kbytes. Miss rate (LUN) – The rate of read requests that could not be satisfied by the storage processor cache. front-end activity. This number indicates whether the read workload is oriented more toward throughput (I/Os per second) or bandwidth (MB per second). disk. A relatively high number over a long period of time suggests that you spread the load over more disks. thereby filling the read cache with data before it is requested. Hit rate (LUN) – The number of read requests that was satisfied by either the write or read cache. two consecutive requests trigger prefetching.LUN) – This measure is an indication of prefetching efficiency. storage processor. LUN) – The average number of reads or writes respectively passed through a component per second. Hit ratio ( LUN) – The fraction of write requests served from the write cache vs. LUN) – The average number of Mbytes read or written respectively that was passed through a component per second. As the percentage of sequential requests rises. and therefore. the better the write performance.storage processor) – The percentage of cache pages owned by this storage processor that was modified since it was last read from. Dirty page percentages (percent. storage processor. Flush ratio (storage processor) – The fraction of the number of flush operations performed vs. This value also indicates the worst instantaneous response time due to the maximum number of waiting requests. • • • - Write cache • • • • Miss rate (LUN) – The number of write requests per second that could not be satisfied by the write cache. Read/write size (KB – disk. Since the ratio is a measure for the back-end activity vs. the dirty pages percentages will not exceed the high watermark for a long period. • • • Navisphere Analyzer: A Case Study 12 . within a second. Read/write throughput (I/Os – disk. or written to. so does the percentage of used prefetches. storage processor. Hit ratio (LUN) – The fraction of read requests served from both read and write caches vs.- Workload • • • • • • Maximum outstanding requests (storage processor) – The largest number of commands on the storage processor at one time since statistics logging was enabled. Large requests usually result in a higher bandwidth than smaller ones. Hit rate (LUN) – The number of write requests per second that was satisfied by the write cache. the total number of write requests to this LUN. which results in a lower response time and higher throughput. Maximum request count (LUN) – The largest number of requests queued to this LUN at one time since statistics logging was enabled. since the data was not currently in the cache from a previous disk access. required a disk access. The higher the ratio. Write cache hits occur when recently accessed data is referenced again while it is still in the write cache. Maximum requests in queue (disk) – The maximum number of requests waiting to be serviced by this specific disk since statistics logging was enabled. In an optimal environment. The higher the ratio. A read cache hit occurs when recently accessed data is referenced while it is still in the cache. the number of write requests. Block flush rate Forced flush rate (LUN) – Number of times per second the cache had to flush pages to disk to free space for incoming write requests. the number of read requests to this LUN. - Read cache • Used prefetches ( percent . the better the read performance. Thus sequential requests will receive the data from the read cache instead of from the disks. since they have been referenced before and not yet flushed to the disks. they usually result in a higher read or write throughput than larger requests. Since smaller requests need less processing time. A flush operation is a write of a portion of the cache to make room for incoming write data. Forced flushes indicate that the incoming workload is higher than the back-end workload. This value measures the biggest burst of requests sent to this storage processor at a time. Read/write bandwidth (MB/s – disk. To improve real bandwidth. a lower number indicates better performance.

Disk crossing rate (LUN) – Indicates how many back-end requests per second used an average of at least two disks. since the last sample. The higher the number. Since the queue length is counted only when the component is busy. Writes to snapshot cache – The number of writes to the source LUN this session that triggered a copy-on-write operation (the first write to each snapshot cache chunk region). Generally. - Miscellaneous • • • • • Average seek distance (disk) – Average seek distance in gigabytes. Writes to snapshot source LUN – The number of writes during this snapshot session to the source LUN (on the pertinent storage processor). including the request that is currently in service. more than two stripe element crossings. a low value is needed for good performance.• • • • High watermark flush on rate (storage processor) – Number of times. since the last sample. storage processor) – Average number of requests waiting for a busy system component to be serviced. It does not include time waiting in a queue. Reads from snapshot source LUN – The number of reads during this snapshot session from the source LUN. at which point the storage processor stops flushing the cache. This value indicates back-end workload. Writes larger than cache chunk size – The number of writes to the source LUN during this session. larger I/Os take longer and therefore usually result in lower throughput (I/Os) but better bandwidth (MB/s). The higher the value. A flush operation is a write of a portion of a cache for any reason. Disk crossings are counted for read and write requests. Disk crossings relate to the LUN stripe element size. Idle flushes indicate a low workload. Service time (disk. The higher the number. and flushes from an idle state. Low watermark flush off rate (storage processor) – Number of times. A disk crossing may involve more than two disks. LUN) – Time. in milliseconds. that the number of modified pages in the write cache reached the high watermark. Flush rate (storage processor) – Number of times per second that the write cache performed a flush operation. Idle flush on rate (storage processor) – Number of times. that the number of modified pages in the write cache reached the low watermark. the bigger the burst. the value indicates the frequency variation (burst frequency) of incoming requests. Defragmentation might help to reduce seek distances. flushes resulting from high watermark. Disk crossing percentage (LUN) – Percentage of requests that requires I/O to at least two disks vs. Longer seek distances result in longer seek times and therefore higher response times. the greater the write workload coming from the host. Reads from snapshot LUN – The number of read requests on SnapView during this snapshot session. It is calculated by the difference between the total reads in session and reads from cache. Cache chunks used in snapshot session – The number of chunks that this session has used. Generally. it includes forced flushes. that is. storage processor. a low value is needed for good performance. Service time is mainly a property of the system component. that the write cache started flushing dirty pages to disk due to a given idle period. However. • - SnapView • • • • • • • Reads from snapshot cache –The number of reads during this session that have resulted in a read from the snapshot cache rather than reading from the sourceLUN. the total number of server requests. a request spent being serviced by a component. Average busy queue length (disk. since the last sample. Navisphere Analyzer: A Case Study 13 . which were larger than the chunk size (they have resulted in multiple writes to the cache). and the longer the average response time at this component. the greater the write workload coming from the host. This number should be close to the high watermark flush on number.

Sign up to vote on this title
UsefulNot useful