Professional Documents
Culture Documents
1
Who?
• NCSA
– a unit of the University of Illinois
at Urbana-Champaign
– a federal, state, university, and
industry funded center
• Academic Users
– NSF peer review
• Large amount of
applications/user needs
– 3rd party codes, user written…
– All running on same
environment
• Many research areas
2
NCSA’s 1st Dell Cluster
• Tungsten: 1750 server
cluster t
– 3.2 GHz Xeon f irs le
e a
• 2,560 processors (compute Th e-sc er!!!
only) rg s t
la l clu
• 16.4 TF; 3.8 TB RAM;122 l
TB disk De
• Dell OpenManage
– Myrinet
• Full bi-section – Production date: April 2004
– Lustre over Gig-E
• 13 DataDirect 8500 – User Environment
• 104 OSTs, 2 MDS • Platform Computing LSF
w/separate disk • Softenv
• 11.1 GB/sec sustained • Intel Compilers
– Power/Cooling • ChaMPIon Pro, MPICH,
• 593 KW / 193 tons VMI-2
3
NCSA’s 3rd Dell Cluster
• T2 – retired into:
• Tungsten-3 1955 blade cluster
– 2.6 GHz Woodcrest Dual Core
• 1,040 processors/2080 cores
• 22 TF; 4.1 TB RAM; 20 TB disk
• Warewulf
– Cisco InfiniBand
• 3 to 1 over-subscribed
• OFED-1.1 w/ HPSM subnet
manager
– Lustre over IB – Production date: March 2007
• 4 FasT controllers direct FC
• 1.2GB/s sustained – User Environment
• 8 OSTs and 2 MDS w/complete
auto failovers • Torque/Moab
– Power/Cooling • Softenv
• 148 KW / 42 tons • Intel Compilers
• VMI-2
4
NCSA’s 4th Dell Cluster
• Abe: 1955 blade cluster
– 2.33 GHz Cloverton Quad-Core
• 1,200 blades/9,600 cores
• 89.5 TF; 9.6 TB RAM; 120 TB disk
est
arg er!!!
• Perceus management; diskless boot
– Cisco Infiniband e l t
• 2 to 1 oversubscribed Th l clus
l
• OFED-1.1 w/ HPSM subnet De
manager
– Lustre over IB – Production date: May 2007
• 22 OSTs
(anticipated)
• 2 9500 DDN controllers direct FC
• 10 FasT controllers on SAN fabric – User Environment
• 8.4GB/s sustained • Torque/Moab
• 22 OSTs and 2 MDS w/complete • Sofenv
auto failovers • Intel Compiler
– Power/Cooling • MPI: evaluating Intel MPI,
• 500 KW / 140 tons MPICH, MVAPICH, VMI-2, etc.
5
NCSA Facility - ACB
• Advanced Computation Building
– Three rooms, totals:
• 16,400 sqft raised floor
• 4.5 MW power capacity
• 250 kW UPS
• 1,500 tons cooling capacity
– Room 200:
• 7,000 sqft – no columns
• 70” raised floor
• 2.3 MW power capacity
• 750 tons cooling capacity
6
NCSA’s Other Systems
• Distributed Memory Clusters
– Mercury (IBM, 1.3/1.5 GHz Itanium2):
• 1,846 processors
• 10 TF; 4.6 TB RAM; 90 TB disk
7
NCSA Storage Systems
• Archival: SGI/Unitree (5 PB total capacity)
– 72TB disk cache; 50 tape drives
– currently 2.8PB of data in MSS
• >1PB ingested in last 6 months
• project ~3.2PB by end of CY2006
• licensed to support 5PB resident data
– ~30 data collections hosted
• Databases:
– 8 processor 12GB memory SGI Altix
• 30TB of SAN storage
• Oracle 10G, mysql, Postgres
– Oracle RAC cluster
– Single-system Oracle deployments for focused projects
LCI Conference 2007 National Center for Supercomputing Applications
8
Visualization Resources
• 30M-pixel Tiled Display Wall
– 8192 x 3840 pixels composite
display
– 40 NEC VT540 projectors, arranged
in a 5H x 8W matrix
– driven by 40-node Linux cluster
• dual-processor 2.4GHz Intel Xeons
with NVIDIA FX 5800 Ultra graphics
accelerator cards
• Myrinet interconnect
• to be upgrade by early CY2007
– funded by State of Illinois
• SGI Prisms
– 8 x 8 processor (1.6 GHz Itanium2)
– 4 graphics pipes each; 1 GB RAM each
– InfiniBand connection to Altix machines
9
SAN at NCSA
• 1.3PB spinning disk
– 895TB SAN attached
• 1392 Brocade switch ports
• 7 SAN fabrics
• 2 data centers
10
Persistent Binding
• Device naming problems
• Udev solution
• Examples
• Interactive Demo
11
Device Naming Problem
Before After
• Add hardware
• SAN zoning
• New SAN luns
• Modify config
Devices assigned random names (based on next available major/minor pair for device type)
CLUSTER
- Multiple hosts that see the same disk will assign the disk to different device nodes
- may be /dev/sda on system1 but /dev/sdc on system2
- Can change with hardware changes; what used to be /dev/sda is not /dev/sdc
12
What needs to happen
• Storage target always maps to same
local device (ie. /dev/…)
• Local device name should be meaningful
– /dev/sda conveys no information about the
storage device
13
udev - Persistent Device Naming
• “Udev is … a userspace solution for a
dynamic /dev directory, with persistent
device naming” *
– Userspace: not required to remain in memory
– Dynamic: /dev not filled with unused files
– Persistent: devices always accessable using the
same device node
• Provides for custom device names
* Daniel Drake (http://www.reactivated.net/writing_udev_rules.html)
14
Setting up udev device mapper
Overview
15
1. Uniquely identify each lun
/sbin/scsi_id
device name
SCSI INQUIRY
scsi_id
Unique id
Sample usage:
root# scsi_id -g -u -s /block/sda
SSEAGATE_ST318406LC_____3FE27FZP000073302G5W
/sbin/scsi_id
- INPUT: existing local device name
- OUTPUT: string that uniquely identifies the specific device (guaranteed unique among all scsi devices)
SAMPLE:
- sda: locally installed drive
- sdb: SAN attached disk
16
2. Associate a meaningful name
New udev rules file: /etc/udev/rules.d/20-local.rules
BUS="scsi", SYSFS{vendor}="DDN", SYSFS{model}="S2A 8000",
PROGRAM="/sbin/scsi_id -g -u -s /block/%k ",
RESULT="360001ff020021101092fadc32a450100", NAME="disk/fc/sdd4c1l0"
• BUS=scsi
– /sys/bus/scsi
• SYSFS
– <BUS>/devices/H:B:T:L/<filename>
• PROGRAM & RESULT
– Program to invoke and result to look for
• NAME
– Device name to create (relative to /dev)
17
Example: Customizing for multiple paths
Problem
Multiple paths to a
single lun results in
multiple device
nodes.
Need to know which
path each device
uses.
18
Example: Customizing for multiple paths
19
Demo: udev persistent device naming
• Single HBA
• Single disk unit
– 4 luns
– Each lun presented
through both controllers
• Host sees 8 logical
luns
• Use mpio_scsi_id
to identify the ctlr-lun
20
Demo: udev persistent device naming
Original Configuation Custom device names
• udev config file • Custom rules file
– /etc/udev/udev.conf – 20-local.rules
• scsi_id config file • Restart udev
– /etc/scsi_id.config – udevstart
• Scan fc luns • Custom device
– {sysfs}/hostX/scan
names created
– /dev/disk/by-id
– /dev/disk/fc
BEGIN
- tail -f /var/log/messages
1. Enable udev logging
2. Enable scsi_id for all devices (options -g)
3. /proc/partitions
4. Scan fc luns (echo “- - -” > /sys/class/scsi_host/hostX/scan)
5. See udev log lines in messages file ; See fc disks in /dev/disk/by-id
6. Enable 20-local rules file
7. Udevstart
8. See udev log lines in messages file ; See fc disks in /dev/disk/fc
DEFAULT CONFIGURATION
Local rules file already exists. Disable it.
Default behavior for scsi_id is to blacklist everything unknown (-b option). Enable white list everything (-
g option) so scsi_id’s will be returned.
Even before custom rules are in place, see default udev rule selection activity in /var/log/messages
CUSTOM CONFIGURATION
Udev custom rules are selected (see /var/log/messages)
Examples
• udevinfo -a -p $(udevinfo -q path -n /dev/sdb)
• udevtest /block/sdb
22
Custom script: ls_fc_luns
Get HBA list sysfs /sys/class/fc_host
23
Custom script: lip_fc_hosts
24
Custom script: scan_fc_luns
25
Custom script: delete_fc_luns
26
udev - Additional Resources
• man udev
• http://www.emulex.com/white/hba/wp_linux26udev.pdf
– Excellent white paper
• http://www.reactivated.net/udevrules.php
– How to write udev rules
• http://www.us.kernel.org/pub/linux/utils/kernel/hotplug/
udev.html
– Information and links
• http://dims.ncsa.uiuc.edu/set/san
– FC tools : custom tools used in demo
27
Linux Multipath I/O
• Overview
• History
• Setup
• Demos
– Active / Passive Controller Pair
– Active / Active Controller Pair
28
Linux Multipath - History
Providers
• Storage Vendor
• HBA Vendor
• Filesystem
• OS
STORAGE VENDOR
- End to end solution (they provide disk, HBA, driver, add’l software, sometimes even FC switch)
- HBA’s (and other parts) come at a markup
- One location for support tickets, but no alternate recourse if they can’t fix the problem
- Proprietary requirements (typically require 2 HBA’s, only works with their systems)
HBA VENDOR
- QLA
> Linux support spotty
+ 2.4 kernel ok, but strict requirements (2 HBA’s, exactly 2 paths per lun, active/active controllers)
+ 2.6 kernel inconsistent behavior
> Solaris support spotty (2 months to get 1 machine working, next month stops working, machine was
untouched)
> Dropped Windows support prematurely (Windows MPIO layer not complete yet, only an API for
vendors)
> Proprietary solution, only works with their HBA’s and configuration software
- Emulex (unix philosophy, do one thing and do it well; MPIO doesn’t belong in the driver)
FILESYSTEM
- 3rd party - Veritos, others??
- Parallel Filesystems - Ibrix, Lustre, GPFS, CXFS (enable MPIO via failover hosts)
OS
- *NEW* Solaris 10 (XPATH, but requires Solaris branded QLA cards)
- *NEW* Linux (device mapper multipath) (RedHat4, Suse, others…)
29
Device Mapper Multipath
• Identify luns by scsi_id
• Create “path groups”
– Round-robin I/O on all paths
in groups
• Monitor paths for failure
– When no paths left in current
group, use next group
• Monitor failed paths for
recovery
– Upon path recovery, re-
check group priorities
– Assign new active group if
necessary
30
Linux Device Mapper Multipath
Overview
31
1. Identify unique luns
Storage Device
• vendor
• product
• getuid_callout
device {
vendor "DDN"
product "S2A 8000"
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
}
32
1. Identify unique luns
Multipath Device
• wwid
• alias
multipath {
wwid 360001ff020021101092fb1152a450900
alias sdd4l0
}
33
2. Monitor Healthy Paths for Failure
• Priority group • path_grouping_policy
– Collection of paths to – multibus
the same physical lun – failover
– I/O is split across all – group_by_prio
paths in round-robin – group_by_serial
fashion
– group_by_node
34
2. Monitor Healthy Paths for Failure
Path Grouping Policy = group_by_prio
prio_callout
35
2. Monitor Healthy Paths for Failure
• path_checker • no_path_retry
– tur – queue
– readsector0 – (N > 0)
– directio – fail
– (Custom)
• emc_clarion
• hp_sw
TUR
- SCSI Test Unit Ready
- Preferred if lun supports it (OK on DDN, IBM fastt)
- Does not cause AVT on IBM fastt
- Does not fill up /var/log/messages on failures
READSECTOR0
- physical lun access via /dev/sdX (IS THIS CORRECT???)
DIRECTIO
- physical lun access via /dev/sgY (IS THIS CORRECT???)
Both readsector0 and directio cause AVT on IBM fastt, resulting in lun thrashing
Both readsector0 and directio log “fail” messages in /var/log/messages (could be useful if you want to
monitor logs for these events)
NO_PATH_RETRY
- # of retries before failing path
- queue: queue I/O forever
- (N > 0): queue I/O for N retries, then fail
- fail: fail immediately
36
3. Monitor failed paths for recovery
• Failback
– Immediate (same as n=0)
– (n > 0)
– manual
FAILBACK
- When a path recovers, wait # seconds before enabling the path
- Recovered path is added back into multipath enabled path list
- multipath re-evaluates priority groups, changes active priority group if needed
MANUAL RECOVERY
- User runs ‘/sbin/multipath’ to update enabled paths and priority groups
37
Putting it all togehter
multipaths {
multipath {
wwid 3600a0b8000122c6d00000000453174fc
alias fastt21l0
}
multipath {
wwid 3600a0b80000fd6320000000045317563
alias fastt21l1
}
}
devices {
device {
vendor "IBM"
product "1742-900"
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
path_grouping_policy group_by_prio
prio_callout "/usr/local/sbin/path_prio.sh %n"
path_checker tur
no_path_retry fail
failback immediate
}
}
38
Putting it all together
path_prio.sh
sdb matching
line
multipath path_prio.sh Primary-paths
50
/usr/local/etc/primary-paths
0x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:0 sdb 3600a0b8000122c6d00000000453174fc 50
0x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:1 sdc 3600a0b80000fd6320000000045317563 2
0x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:2 sdd 3600a0b8000122c6d0000000345317524 50
0x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:3 sde 3600a0b80000fd6320000000245317593 2
0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:0 sdi 3600a0b8000122c6d00000000453174fc 5
0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:1 sdj 3600a0b80000fd6320000000045317563 51
0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:2 sdk 3600a0b8000122c6d0000000345317524 5
0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:3 sdl 3600a0b80000fd6320000000245317593 51
PATH_PRIO.SH
- grep device from primary-paths file
- return value from last column
39
Demo: Active/Passive Disk
• Host
– One Emulex LP11000
• Disk
– IBM DS4500
– Luns presented through
both controllers
– Luns accessible via 1
controller only at a time
– AVT enabled
AVT
- Lun will migrate to alternate controller if requested there
- Tolerance of cable/switch failure
- AVT penalty - lun inaccessible for 5-10 secs while controller ownership changing
SCREENS: /var/log/messages , multi-port-mon , command , script host
1. No luns (ls_fc_luns)
2. /etc/multipath.conf
1. Multipaths (fastt)
2. Devices (fastt)
3. /usr/local/sbin/path_prio.sh
1. Identify controller A, controller B
4. /usr/local/etc/primary-paths
5. Add luns (scan_fc_luns)
1. See multipath bindings & path_prio.sh output in /var/log/messages
6. View current multipath configuration
1. Multipath -v2 -l
7. Failover test
1. Script-host: disable disk port A
2. See multipathd reconfig in /var/log/messages
3. See I/O path change in multi-port-mon
8. Recover test
1. Script-host: enable disk port A
40
Demo: Active/Active Disk
• Host
– One Emulex LP11000
• Disk
– DDN 8500
– Luns accessible via
both controllers (no
penalty)
41
Path Grouping Policy Matrix
1 HBA 2 HBAs
(demo1)
Active/Active multibus
multibus
Active/Passive (demo2)
path_prio
with AVT path_prio
ACTIVE/ACTIVE 2 HBAs
- trivial, same as demo1
- Each HBA sees 1 ctlr
- Can let both HBAs see both ctlrs (4 paths to each lun)
+ Use path_prio if need to control path usage
ACTIVE/PASSIVE (AVT) 2 HBAs
- trivial, similar to demo2
ACTIVE/PASSIVE (no AVT) 1 HBA
- Tolerant of ctlr failure only.
- If anything else fails, luns will not AVT to alternate ctlr, host will lose access
ACTIVE/PASSIVE (no AVT) 2 HBAs
- Non-preferred paths will be failed
- Each HBA must have full access to both controllers
42
Linux Multipath Errata
• Making changes to multipath.conf
– Stop multipathd service
– Clear multipath bindings
• /sbin/multipath -F
– Create new multipath bindings
• /sbin/multipath -v2 -l
– Start multipathd service
• Cannot multipath root or boot device
• user_friendly_names
– Not really, just random names dm-1, dm-2 …
43
Linux Multipath Resources
• multipath.conf.annotated
• man multipath
• http://christophe.varoqui.free.fr/wiki/wakka.php?wiki=H
ome
– Multipath tools official home
• http://www.redaht.com/docs/manuals/csgfs/browse/rh-
cs-en/ap-rhcs-dm-multipath-usagetxt.html
– Description of output (multipath -v2 -l)
• http://kbase.redhat.com/faq/FAQ_85_7170.shtm
– Setup device-mapper multipathing in Red Hat Enterprise Linux 4?
• http://dims.ncsa.uiuc.edu/set/san
– Multi-port-mon
– Set switchport state : (en/dis)able switch port via SNMP
MULTIPATH.CONF.ANNOTATED (RedHat)
- /usr/share/doc/device-mapper-multipath-0.4.5/multipath.conf.annotated
44