You are on page 1of 8

Cloud Storage Engineer

Case Study Questions


Please note that for each of the questions below, it is acceptable to propose various solutions
and address the pros and cons of each solution. We are interested not only in your answers, but
in your thought process. You are free to use the Internet as a reference, however you are not
allowed to directly solicit help from another person.

1. Describe how you would troubleshoot issues reported from DBAs that storage is causing
poor database performance? What are the key performance metrics you would track
that show how the storage is performing?
The fundamental items to consider for database performance on SAN are
● .I/O
● RAid type
● Disk count
● Systems
● Luns
● Cache
● Stripes

For cache utilization on SAN I will check the Disk parity Group, the DB instance, the Table
Space and the index.

On the SAN Switches for performance I will monitor, the bytes transmitted/received, frames
transmitted/received, CRCerrors, link errors, buffer errors, RX traffic, RX throughput, TX traffic,
TX throughput and total throughput.

On the Servers I will generally monitor, the server capacity, disk utilization, I/O performance,
total mb/s, queue lengths, read/write, IOPs, I/O wait time, file system space, Device file
performance, CPU busy wait, paging, swapping, Semaphores, Locks, Threads, NFs client, NFS
server and HBA performance bytes RX/TX

For application/DB I will monitor, oracle tablespace performance and capacity, Buffer pools
cache and data blocks, read writes, Tablespace used free and logs

On Ms SQL I will monitor, the server cache usage, current cache hit%, Trends, page writes/sec,
Lazy writes/sec, Redo log I/O per second and NW packets sent and received
On DB2, I will monitor the tablespace performance capacity, Buffer pool, Cache data, Block
read/ writes, Tablespace used free and logs
On Ms exchange I will monitor the shared memory, Que information, Store mailbox, Store
public, Store exchange and the server process

For best performance I would recommend that the write hit should always be 100%, Hard disk
busy rate% should be less than 75% otherwise I/O will wait in the backend, to fix that I will
change RAID levels, spread RAID groups, use faster disks and add more Disks to RAID groups.
Always tune your storage, use the process of elimination until a solution is found and
bottlenecks are located. Tuning provides a compromise between cost and performance. I will
perform a performance analysis, I will ensure maximum write throughput to LUNs by setting up
RAID groups with HDD roaming across disk trays, Map all LUNs to the same controller,
controllers should share the write workload. For Database I would suggest fewer larger LUNs
per disk group this helps to minimize the LUN management overhead, I/O performance is best
when the LUNs are spread across more Disks. For optimum performance on SAN stripe Size
for Database I would recommend that the RAID stripe size at the SAN layer should match the
Database stripe size.

2. Explain the differences between VMware standard switch and distributed switch.
Describe the benefits of one over the other?
Distributed switches: individual host level virtual switches are abstracted into a single large
distributed switch. It contains advanced features like network Vmotion, PVLANs and
bidirectional traffic shaping. Distributed switches are independent of the number of hosts. Allows
a single virtualswitch to connect multiple hosts in a cluster for centralized management of
network configurations in a vSphere environment.
Benefits: Support vmware vsphere distributed switch helps businesses to realize software-
defined networking by delivering centralized provisioning administration
And monitoring it provides administrators with data center level network aggregation and tight
traffic management and control to streamline configuration. It extends its ports and management
across all the servers in the cluster, supporting upto 500 hosts per distributed switch.

Standard Switch: are individually managed and configured by hosts and VMs only.

3. An NFS export has been created for a Linux server. The server cannot mount it.
Describe how you would troubleshoot this issue?
The NFS client and server communicate using remote procedure call (RPC) messages over the
network. Both the host>client and client>host communication paths must be functional. You can
use common tools such as ping, traceroute or tracepath to verify that the client and server
machine can reach each other. If not, examine the network interface card(NIC) settings using
either ifconfig or ethtool to verify the IP settings. There might be no route to host this can be
caused by the RPC messages being filtered by either the host firewall, the client firewall, or
network switch. Verify if the firewall is active and if NFS traffic is allowed. Normally nfs is using
port 2049. As a quick test one can switch the firewall of by #service iptables stop” on both the
client and the server. Try mounting the NFS directory again. Don't forget to switch it back on
and configure it correctly to allow NFS traffic. Linux NFS implementation requires that both the
NFS service and the portmapper(RPC) service be running on both the client and the server #
rpcinfo -p

4. Explain the difference between NFS, CIFS, iSCSI, and FCP


Why would you choose one over the other?
CIFS protocol is an open standard for network file service, it's a file access protocol designed
for the internet and is based on the server message block(SMB) protocol that the microsoft
windows operating system uses for distributed file sharing, it lets remote users to access the file
system over the internet.

NFS network file system is a network file system protocol originally developed by Sun
microsystems in 1984 allowing a user on a client computer to access files over the internet as
easy as if the network device was attached to its local disk. NFS, like any other protocol, builds
in the open network computing remote procedure call (ONCRPC)system. NFS(v2&v3) protocol
itself is stateless from the server point of view, meaning that the NFS server does not know
what the NFS client is doing to the services file system

iSCSI stands for internet SCSI or internet small computer systems interface. iSCSI is the
transmission of SCSI commands and data over IP networks, A host cannot be connected to
both iSCSI and a fibre channel storage system. Furthermore, while an iSCSI array may be
capable of accepting connections from HBAs and NICs a host cannot have both NIC and HBA
connections for iSCSI Storage systems. A host requires an iSCSI software, an iSCSI service
should be running e.g. qlogic 400 series 1GBE iSCSI adapters will provide SAN connectivity
over ethernet and TCP/IP network infrastructures. An iSCSCI client is called an initiator(can be
software or hardware) an iSCSI server is called a target. It allows organizations to utilize their
existing TCP/IP network infrastructure without investing in expensive Fibre Channel Switches.
Provides block-level access to storage devices
Advantages
Uses TCP/IP as an alternative to FCP to overcome the limitations of FCP such as high expense
and distance limitations, facilitates data transfers over intranets, manages storage over long
distances, is inexpensive to implement, provides high availability, is robust and reliable.
iSCSI is best suited for Web Server, Email and Departmental applications.

FCP( fibre channel protocol) defines a multilayered architecture for moving data, FCP packages
SCSI commands into the fibre channel frame ready for transmission. FCP allows data
transmission over twisted pairs and over fibre optic cables. It is mainly used in larger data
centers for applications running high availability, such as transaction processing and
databases.SCSI Interface protocol utilising an underlying fibre channel connection.

CIFs and NFS are distributed file systems allowing a user on a client computer to access files
over a computer network. CIFS is used on windows, while NFS is used by Linux
iSCSI and FCP are transport protocols. iSCSI uses TCP/IP and is inexpensive. FCP uses fibre
channels and is expensive.

5. What are your thoughts on when to contact technical support?


Once you utilize all your troubleshooting steps within minimal time.

6. Discuss how you keep your technical skills sharp and up-to-date.
Taking professional development Courses, utilizing online resources, attending professional
events, Networking Online and continued education.

7. While designing storage for Databases, metrics such as throughput and/or IOPS are
used. What are your thoughts on these metrics? Describe how you would benchmark
candidate solutions.
Throughput is the amount of data transferred from one place to another or processed in a
specified amount of time, data transfer rates for disk drives and networks are measured in
throughputs. Typically measured in Kbps,Mbps,Gbps.
IOPS is the time taken for a storage system to perform an input/output operation per second
from start to finish constitutes IOPS. IOPS measures the number of read and write operations
per second, while throughput measures the number of bits read or written per second. Higher
values mean a device is capable of handling more operations per second. Eg a high sequential
write IOPS value would be helpful when copying a large number of files from another drive. SSD
have significantly higher IOPS valued than HDD
Typical I/O tendencies 10% of activity is sequential, typically moving large blocks. Airline
reservations are actually a special case because small block size(less than 1kb) creates distinct
requirements one has to watch the read response rate and IOPS.
I/O is the most important metric to the email system, the higher the I/O rate the more messages
per second the email can deliver maintaining very high I/O rates requires paying attention to the
response time metric-achieving the best email system I/O response time depends good
backend distribution of mailboxes across the storage system array groups. Since the workloads
contains a higher percentage of writes,destage activity at the backend may also be very high,
larger cache sizes can help to maintain I/O response time
I/O patterns affect performance, logs are always sequential and synchronous because data is
written on commit or log buffer overrun I have noticed that application performance is dominated
by read performance. Sequential access will likely result in fewer as I/O is coalesced. Random
Access may result in more I/O
Hard disk busy rate should be less than 75% otherwise I/O will wait in the backend. To fix this I
will spread to RAID groups, change RAID levels, use faster disks and add more disks to RAID
groups
Read and write response time for database disk should be below 20ms and spikes 50ms
Higher IOPS with smaller blocks will stress microprocessor MP, large blocks will saturate the
links. The maximum load recommended on a storage array port, port busy rate should be less
than 40%.
To design for response time centric applications such as OLTP a low utilization to ensure CPU
cycle availability, queue depth to ensure no wait time.
For centric applications such as backups a high utilization to ensure maximum throughput, high
Q-depth to let CPU manage queries more efficiently.
For Random I/O environments like databases and general purpose file servers all disks should
spend equal amounts of time servicing I/O requests. I will make sure that maximum Random I/O
performance on disk is 35% or lower usage as reported by the iostat command. Disk used in
excess of 90% is a critical problem and I will create a new RAID 0 volume with more
disks(spindles)
Once cache usage write pending reaches 70% or as a setup on the array or as recommended
by the vendor it will go into priority destage mode and reject I/Os coming in from the front end
until space is cleared in the cache. I will keep CWP under 70% ( and cache read 30%) when
the storage array starts to reject I/O applications will feel the pain, databases servers will get
slow suddenly and I/O sluggish this is essentially write through mode. I have noticed that WT is
usually faster for heavy loads. WB is usually faster for light loads, 30% write pending is normal
for busy OLTP systems on SAN Storage array adequately managing the destaging of writes.

Scenario 1:
The University’s on-campus Voip environment consists of a number of Cisco virtual appliances
running on VMware. Your group is responsible for managing and administering the VMware
and the Netapp storage infrastructure. Over the course of a month, a number of ServiceNow
incidents are created related to issues with voicemail. Status lights on phones are not working
correctly, voice mails are delayed and sometimes not recorded.

The telephony engineer contacts you and asks for assistance in troubleshooting. He informs
you that a number of his virtual machines seem to be running slow and at times some are
crashing. He also informs you that the voicemail virtual appliance performs a large amount of
reads and writes that appear to be performing poorly.

Describe the steps you would take to troubleshoot this issue.

For the VMs I will check what latency the VM is getting using top or iostat commands, compare
or check IOPS and performance using vdbench. Iometer can also be used. I will also check the
IOPS requirement if the appliance is fully filled by storage.

What areas could be the root cause?Bandwidth can be the root of many of the most common
VOIP problems. Old routers may need to be replaced because they can cause transmission
problems. Router settings might not be set to prioritize the VoIP. high bandwidth applications.
Scenario 2:
You are a member of a team to evaluate storage solutions for Oracle databases. It consists of
DBAs, System Administrators, Information Security staff and Storage administrators. There are
a number of solutions being discussed:

1. Give the raw disks to the database system and let Oracle ASM manage the
storage.
2. On the storage device, create a volume and export it to the database server and
use NFS. The versions of NFS to use are up for debate.
3. Use a managed cloud solution where substantial administration is performed by
the cloud vendor.
4. The databases frequently need to be cloned for QA or Dev environments.

How would you evaluate different options? What information do you need? What benchmarks,
if any would you use? What metrics do you consider important? How could the different
solutions affect cloning, backups, and snapshots?
How many read and write IOPS needed , I/O rates and data transfers and cache read hit ratios
what is read and write% compare NFS vs RDM need to gather information from DBA about
read and write latency.
If cloning is needed then there is a need to consider physical storage.
With the cloud you will need to compare the operating costs. Because cloning in the cloud will
be too costly.
Metrics to consider Manageability, Availability, performance and cost saving.
It is important to choose appropriate RAID levels in an ASM environment. Choose RAID levels
based on your I/O performance and data availability requirements. When servicing a database
workload, the difference between RAID-1+0 and RAID-5 for random writes is not pronounced
when I/O is large or sequential. For example, for OLTP workloads with 30 percent random
writes of 4K-32K, consider using RAID-1+0. However, if your workload consists of large writes
(greater than 32K) or your database access is sequential in nature, for example, in a data
warehouse, consider using RAID-5.
When servicing large streaming sequential I/Os, an entire stripe set can be rewritten in one
operation. However, in online transaction processing (OLTP) database applications, this
happens comparatively infrequently due to the relatively large stripe width.
An OLTP application might only require the storage capacity of one RAID group, but might have
a peak IOPS load that requires four RAID groups in a RAID-1+0 (2D+2D) configuration. Best
practice Base your storage configuration decisions on performance requirements first, then
capacity requirements. For multiple, large or performance-intensive databases, your
performance requirements might justify creating a separate pool or pools for each database.
Manage extent size and block size to avoid wasting storage capacity. To achieve additional
flexibility and performance, place data into different disk groups based on the Oracle Database
file type.
ASM distributes data on all the disks in the disk group. Striping options are COARSE and FINE.
COARSE striping is laid out in allocation units (AU) of 1MB and FINE striping is laid out in finer
units of 128KB. ASM mirrors at the file level, files are partitioned in allocation units (AU) of 1MB,
and are laid out on different disks to implement mirroring. ASM provides two levels of mirroring
through NORMAL and HIGH redundancy options. In NORMAL REDUNDANCY, data is
duplicated. In HIGH REDUNDANCY, data is in triplicate. ASM also provides the ability to create
disk groups without mirroring, using EXTERNAL REDUNDANCY. This feature allows you to
leverage the storage's RAID implementation.
ACFS Read-Only Snapshots
• Dynamic, fast, space efficient, “point in time” copies of
ASM file system files
• captures ASM FS file block/extent updates
• An enabler for:
• On-line backups
• On-line, disk-based, file backup model using snapshots and
individual file recoveries
• Up to 64 snapshot images per ASM file system
• Policy based snapshots:
• Schedule snapshots on an interval basis: every 5 seconds,
every 30 minutes, daily, … with recycling (using EM)
• ACFS CLIs support creation and removal of snapshots
• ACFS Snapshot functions integrated with EM

You might also like