Linux Mpio

SAN Persistent Binding and
Multipathing in the 2.6 Kernel
Michelle Butler, Technical Program Manager

Andy Loftus, System Engineer
Storage Enabling Technologies
NCSA
mbutler@ncsa.uiuc.edu or aloftus@ncsa.edu
Slides available at http://dims.ncsa.uiuc.edu/set/san/
LCI Conference 2007 National Center for Supercomputing Applications
1
Who?
• NCSA
– a unit of the University of Illinois
at Urbana-Champaign
– a federal, state, university, and
industry funded center
• Academic Users
– NSF peer review
• Large amount of
applications/user needs
– 3rd party codes, user written…
– All running on same
environment
• Many research areas
2
NCSA’s 1st Dell Cluster
• Tungsten: 1750 server
cluster t
– 3.2 GHz Xeon f irs le
e a
• 2,560 processors (compute Th e-sc er!!!
only) rg s t
la l clu
• 16.4 TF; 3.8 TB RAM;122 l
TB disk De
• Dell OpenManage
– Myrinet
• Full bi-section – Production date: April 2004
– Lustre over Gig-E
• 13 DataDirect 8500 – User Environment
• 104 OSTs, 2 MDS • Platform Computing LSF
w/separate disk • Softenv
• 11.1 GB/sec sustained • Intel Compilers
– Power/Cooling • ChaMPIon Pro, MPICH,
• 593 KW / 193 tons VMI-2
3
NCSA’s 3rd Dell Cluster
• T2 – retired into:
• Tungsten-3 1955 blade cluster
– 2.6 GHz Woodcrest Dual Core
• 1,040 processors/2080 cores
• 22 TF; 4.1 TB RAM; 20 TB disk
• Warewulf
– Cisco InfiniBand
• 3 to 1 over-subscribed
• OFED-1.1 w/ HPSM subnet
manager
– Lustre over IB – Production date: March 2007
• 4 FasT controllers direct FC
• 1.2GB/s sustained – User Environment
• 8 OSTs and 2 MDS w/complete
auto failovers • Torque/Moab
– Power/Cooling • Softenv
• 148 KW / 42 tons • Intel Compilers
• VMI-2
4
NCSA’s 4th Dell Cluster
• Abe: 1955 blade cluster
– 2.33 GHz Cloverton Quad-Core
• 1,200 blades/9,600 cores
• 89.5 TF; 9.6 TB RAM; 120 TB disk
est
arg er!!!
• Perceus management; diskless boot
– Cisco Infiniband e l t
• 2 to 1 oversubscribed Th l clus
l
• OFED-1.1 w/ HPSM subnet De
manager
– Lustre over IB – Production date: May 2007
• 22 OSTs
(anticipated)
• 2 9500 DDN controllers direct FC
• 10 FasT controllers on SAN fabric – User Environment
• 8.4GB/s sustained • Torque/Moab
• 22 OSTs and 2 MDS w/complete • Sofenv
auto failovers • Intel Compiler
– Power/Cooling • MPI: evaluating Intel MPI,
• 500 KW / 140 tons MPICH, MVAPICH, VMI-2, etc.
5
NCSA Facility - ACB
• Advanced Computation Building
– Three rooms, totals:
• 16,400 sqft raised floor
• 4.5 MW power capacity
• 250 kW UPS
• 1,500 tons cooling capacity
– Room 200:
• 7,000 sqft – no columns
• 70” raised floor
• 2.3 MW power capacity
• 750 tons cooling capacity
6
NCSA’s Other Systems
• Distributed Memory Clusters
– Mercury (IBM, 1.3/1.5 GHz Itanium2):
• 1,846 processors
• 10 TF; 4.6 TB RAM; 90 TB disk
• Shared Memory Clusters
– Copper (IBM p690,1.3 GHz Power4): 12 x 32

processors
• 2 TF; 64 or 256 GB RAM each; 35 TB disk
– Cobalt (SGI Altix, 1.5 GHz Itanium2): 2 x 512 processors

• 6.6 TF; 1 TB or 3 TB RAM; 250 TB disk
7
NCSA Storage Systems
• Archival: SGI/Unitree (5 PB total capacity)
– 72TB disk cache; 50 tape drives
– currently 2.8PB of data in MSS
• >1PB ingested in last 6 months
• project ~3.2PB by end of CY2006
• licensed to support 5PB resident data
– ~30 data collections hosted
• Infrastructure: 394TB Fiberchannel

SAN connected
– Fiberchannel SAN connected; FC and SATA environments
– Lustre, IBRIX, NFS filesystems
• Databases:
– 8 processor 12GB memory SGI Altix
• 30TB of SAN storage
• Oracle 10G, mysql, Postgres
– Oracle RAC cluster
– Single-system Oracle deployments for focused projects
8
Visualization Resources
• 30M-pixel Tiled Display Wall
– 8192 x 3840 pixels composite
display
– 40 NEC VT540 projectors, arranged
in a 5H x 8W matrix
– driven by 40-node Linux cluster
• dual-processor 2.4GHz Intel Xeons
with NVIDIA FX 5800 Ultra graphics
accelerator cards
• Myrinet interconnect
• to be upgrade by early CY2007
– funded by State of Illinois
• SGI Prisms
– 8 x 8 processor (1.6 GHz Itanium2)
– 4 graphics pipes each; 1 GB RAM each
– InfiniBand connection to Altix machines
9
SAN at NCSA
• 1.3PB spinning disk
– 895TB SAN attached
• 1392 Brocade switch ports
• 7 SAN fabrics
• 2 data centers
10
Persistent Binding
• Device naming problems
• Udev solution
• Examples
• Interactive Demo
11
Device Naming Problem
Before After
• Add hardware
• SAN zoning
• New SAN luns
• Modify config
Device node mapping can change with changes to

- hardware
- software
- SAN
Devices assigned random names (based on next available major/minor pair for device type)
CLUSTER
- Multiple hosts that see the same disk will assign the disk to different device nodes
- may be /dev/sda on system1 but /dev/sdc on system2
- Can change with hardware changes; what used to be /dev/sda is not /dev/sdc
Devfs helps only a little:

- Fixes device naming; on a single host, disk will always have the same device node
- But different hosts may have different device names for the same physical disk
12
What needs to happen
• Storage target always maps to same
local device (ie. /dev/…)
• Local device name should be meaningful
– /dev/sda conveys no information about the
storage device
13
udev - Persistent Device Naming
• “Udev is … a userspace solution for a
dynamic /dev directory, with persistent
device naming” *
– Userspace: not required to remain in memory
– Dynamic: /dev not filled with unused files
– Persistent: devices always accessable using the
same device node
• Provides for custom device names
* Daniel Drake (http://www.reactivated.net/writing_udev_rules.html)
Devfs provides dynamic and persistent naming, but:

- kernel based - entire device db stored in kernel memory, never swapped
- not possible to customize device names
UDEV CUSTOM
- custom names for devices
- custom scripts can be run when specifice devices attached/removed
14
Setting up udev device mapper
Overview
1. Uniquely identify each lun

2. Assign a meaningful name to each lun
15
1. Uniquely identify each lun
/sbin/scsi_id
device name
SCSI INQUIRY
scsi_id
Unique id
Sample usage:
root# scsi_id -g -u -s /block/sda
SSEAGATE_ST318406LC_____3FE27FZP000073302G5W
root# scsi_id -g -u -s /block/sdb

3600a0b8000122c6d00000000453174fc
/sbin/scsi_id
- INPUT: existing local device name
- OUTPUT: string that uniquely identifies the specific device (guaranteed unique among all scsi devices)
SAMPLE:
- sda: locally installed drive
- sdb: SAN attached disk
16
2. Associate a meaningful name
New udev rules file: /etc/udev/rules.d/20-local.rules
BUS="scsi", SYSFS{vendor}="DDN", SYSFS{model}="S2A 8000",
PROGRAM="/sbin/scsi_id -g -u -s /block/%k ",
RESULT="360001ff020021101092fadc32a450100", NAME="disk/fc/sdd4c1l0"
• BUS=scsi
– /sys/bus/scsi
• SYSFS
– <BUS>/devices/H:B:T:L/<filename>
• PROGRAM & RESULT
– Program to invoke and result to look for
• NAME
– Device name to create (relative to /dev)
Custom naming controlled by rulesets stored in /etc/udev/rules.d

A rule is a lists of keys to match against.
When all keys match, the specified action is taken (create a device name or symlink)
17
Example: Customizing for multiple paths
Problem
Multiple paths to a
single lun results in
multiple device
nodes.
Need to know which
path each device
uses.
18
Example: Customizing for multiple paths
Custom script : mpio_scsi_id

WWPN
Disk Ctlr
device name
scsi_id
udev mpio_scsi_id
WWPN + scsi_id
Sample udev rule:

BUS="scsi", SYSFS{vendor}="DDN", SYSFS{model}="S2A 8000",
PROGRAM="/root/bin/mpio_scsi_id %k",
RESULT="23000001ff03092f360001ff020021101092fadc32a450100",
NAME="disk/fc/sdd4c1l0"
Get disk controller WWPN

(Emulex) /sys/class/fc_transport/target<H>:<B>:<T>/port_name
(QLA) grep + awk to pull value from /proc/scsi/ql2xxx/<host_id>
19
Demo: udev persistent device naming
• Single HBA
• Single disk unit
– 4 luns
– Each lun presented
through both controllers
• Host sees 8 logical
luns
• Use mpio_scsi_id
to identify the ctlr-lun
20
Original Configuation Custom device names
• udev config file • Custom rules file
– /etc/udev/udev.conf – 20-local.rules
• scsi_id config file • Restart udev
– /etc/scsi_id.config – udevstart
• Scan fc luns • Custom device
– {sysfs}/hostX/scan
names created
– /dev/disk/by-id
– /dev/disk/fc
BEGIN
- tail -f /var/log/messages
1. Enable udev logging
2. Enable scsi_id for all devices (options -g)
3. /proc/partitions
4. Scan fc luns (echo “- - -” > /sys/class/scsi_host/hostX/scan)
5. See udev log lines in messages file ; See fc disks in /dev/disk/by-id
6. Enable 20-local rules file
7. Udevstart
8. See udev log lines in messages file ; See fc disks in /dev/disk/fc
DEFAULT CONFIGURATION
Local rules file already exists. Disable it.
Default behavior for scsi_id is to blacklist everything unknown (-b option). Enable white list everything (-
g option) so scsi_id’s will be returned.
Even before custom rules are in place, see default udev rule selection activity in /var/log/messages
After running delete_fc_luns, udev removes /dev/sdX devices files (/var/log/messages)
CUSTOM CONFIGURATION
Udev custom rules are selected (see /var/log/messages)
Major/Minor numbers line up for /dev/disk/fc/* and /proc/partition/*

21
Debugging
• Not all sysfs files are available immediately
– HBA target WWPN
– Add udevstart to boot scripts
• Udev tools can help
– udevinfo
– udevtest
Examples
• udevinfo -a -p $(udevinfo -q path -n /dev/sdb)
• udevtest /block/sdb
Exmaple: multiple paths on Nadir

- If luns are removed (delete_fc_luns)
- Then added (scan_fc_luns)
- No matches are found in 20-local.rules
- Add syslog output to mpio_scsi_id
+ Shows params the script is called with
+ Shows what the script returns
+ target_wwpn is not getting set
- Run udevstart (luns already attached now), matches found in 20-local.rules and device files created
Probably either a driver or udev issue.
Easiest solution is to run scan_luns and udevstart at system boot time (/etc/rc.d/rc.local)
22
Custom script: ls_fc_luns
Get HBA list sysfs /sys/class/fc_host
Get HBA type lspci
sysfs (emulex) /sys/class/scsi_host/hostX/targetX:Y:Z

Get target list
/proc (QLA) /proc/scsi/qla2xxx/X
Get lun list sysfs /sys/class/scsi_host/hostX/targetX:Y:Z/X:Y:Z:L
Get lun info sysfs /sys/class/scsi_host/hostX/targetX:Y:Z/X:Y:Z:L/*
0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:0:0 sdb 3600a0b8000122c6d00000000453174fc

0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:0:1 sdc 3600a0b80000fd6320000000045317563
0x10000000c95ebeb4 0x200200a0b8122c6e 2:0:1:0 sdi 3600a0b8000122c6d00000000453174fc
0x10000000c95ebeb4 0x200200a0b8122c6e 2:0:1:1 sdj 3600a0b80000fd6320000000045317563
23
Custom script: lip_fc_hosts
Get host list ls_fc_luns
echo “1” > /sys/class/fc_host/hostX/lip
24
Custom script: scan_fc_luns
Get host list ls_fc_luns
echo “- - -” > /sys/class/scsi_host/hostX/scan
25
Custom script: delete_fc_luns
Get lun list ls_fc_luns
echo “1” > /sys/class/scsi_host/hostX/targetX:Y:Z/X:Y:Z:L/delete
26
udev - Additional Resources
• man udev
• http://www.emulex.com/white/hba/wp_linux26udev.pdf
– Excellent white paper
• http://www.reactivated.net/udevrules.php
– How to write udev rules
• http://www.us.kernel.org/pub/linux/utils/kernel/hotplug/
udev.html
– Information and links
• http://dims.ncsa.uiuc.edu/set/san
– FC tools : custom tools used in demo
27
Linux Multipath I/O
• Overview
• History
• Setup
• Demos
– Active / Passive Controller Pair
– Active / Active Controller Pair
28
Linux Multipath - History
Providers
• Storage Vendor
• HBA Vendor
• Filesystem
• OS
STORAGE VENDOR
- End to end solution (they provide disk, HBA, driver, add’l software, sometimes even FC switch)
- HBA’s (and other parts) come at a markup
- One location for support tickets, but no alternate recourse if they can’t fix the problem
- Proprietary requirements (typically require 2 HBA’s, only works with their systems)
HBA VENDOR
- QLA
> Linux support spotty
+ 2.4 kernel ok, but strict requirements (2 HBA’s, exactly 2 paths per lun, active/active controllers)
+ 2.6 kernel inconsistent behavior
> Solaris support spotty (2 months to get 1 machine working, next month stops working, machine was
untouched)
> Dropped Windows support prematurely (Windows MPIO layer not complete yet, only an API for
vendors)
> Proprietary solution, only works with their HBA’s and configuration software
- Emulex (unix philosophy, do one thing and do it well; MPIO doesn’t belong in the driver)
FILESYSTEM
- 3rd party - Veritos, others??
- Parallel Filesystems - Ibrix, Lustre, GPFS, CXFS (enable MPIO via failover hosts)
OS
- *NEW* Solaris 10 (XPATH, but requires Solaris branded QLA cards)
- *NEW* Linux (device mapper multipath) (RedHat4, Suse, others…)
29
Device Mapper Multipath
• Identify luns by scsi_id
• Create “path groups”
– Round-robin I/O on all paths
in groups
• Monitor paths for failure
– When no paths left in current
group, use next group
• Monitor failed paths for
recovery
– Upon path recovery, re-
check group priorities
– Assign new active group if
necessary
30
Linux Device Mapper Multipath
Overview
1. Identify unique luns

2. Monitor active paths for failure
3. Monitor failed paths for recovery
Multipath handles 3 areas.

All settings are saved in /etc/multipath.conf
31
Storage Device
• vendor
• product
• getuid_callout
device {
vendor "DDN"
product "S2A 8000"
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
}
32
Multipath Device
• wwid
• alias
multipath {
wwid 360001ff020021101092fb1152a450900
alias sdd4l0
}
33
2. Monitor Healthy Paths for Failure
• Priority group • path_grouping_policy
– Collection of paths to – multibus
the same physical lun – failover
– I/O is split across all – group_by_prio
paths in round-robin – group_by_serial
fashion
– group_by_node
Multipath control creates priority groups.

Paths are grouped based on path_grouping_policy
MULTIBUS - all paths in one priority group (DDN) (no penalty to access luns via alternate controllers)
FAILOVER - one path per priority group (Use only 1 path at a time) (typically only 1 usable path, such as
IBM fastt with AVT disabled)
GROUP_BY_PRIO - Paths with same priority in same priority group, 1 group for each unique priority
(Priorities assigned by external program)
GROUP_BY_SERIAL - Paths grouped by scsi target serial (controller node WWN)
GROUP_BY_NODE - (I have not tested or researched this, never had a need to)
34
Path Grouping Policy = group_by_prio
• Path Priority • prio_callout

– Integer value assigned to a – 3rd party pgm to assign
path priority values to each path
– Higher value == higher
priority
multipath
– Directly controls priority
group selection Integer value Device name
prio_callout
Only matters if using “group_by_prio” grouping policy

DIRECTLY CONTROLS PRIORITY GROUP SELECTION
- Priority group with highest value is active group
PREVIOUS SLIDE - When all paths in a group are failed, next group becomes active. That would be the
priority group with the next highest priority value that has an active path.
PRIO_CALLOUT
- Provided by vendor or (more typically) custom script written by admin for specific setup
- If not using group_by_prio, then set this to /bin/true
35
• path_checker • no_path_retry
– tur – queue
– readsector0 – (N > 0)
– directio – fail
– (Custom)
• emc_clarion
• hp_sw
TUR
- SCSI Test Unit Ready
- Preferred if lun supports it (OK on DDN, IBM fastt)
- Does not cause AVT on IBM fastt
- Does not fill up /var/log/messages on failures
READSECTOR0
- physical lun access via /dev/sdX (IS THIS CORRECT???)
DIRECTIO
- physical lun access via /dev/sgY (IS THIS CORRECT???)
Both readsector0 and directio cause AVT on IBM fastt, resulting in lun thrashing
Both readsector0 and directio log “fail” messages in /var/log/messages (could be useful if you want to
monitor logs for these events)
NO_PATH_RETRY
- # of retries before failing path
- queue: queue I/O forever
- (N > 0): queue I/O for N retries, then fail
- fail: fail immediately
36
3. Monitor failed paths for recovery
• Failback
– Immediate (same as n=0)
– (n > 0)
– manual
FAILBACK
- When a path recovers, wait # seconds before enabling the path
- Recovered path is added back into multipath enabled path list
- multipath re-evaluates priority groups, changes active priority group if needed
MANUAL RECOVERY
- User runs ‘/sbin/multipath’ to update enabled paths and priority groups
37
Putting it all togehter
multipaths {
multipath {
wwid 3600a0b8000122c6d00000000453174fc
alias fastt21l0
}
multipath {
wwid 3600a0b80000fd6320000000045317563
alias fastt21l1
}
}
devices {
device {
vendor "IBM"
product "1742-900"
getuid_callout "/sbin/scsi_id -g -u -s /block/%n"
path_grouping_policy group_by_prio
prio_callout "/usr/local/sbin/path_prio.sh %n"
path_checker tur
no_path_retry fail
failback immediate
}
}
38
Putting it all together
path_prio.sh
sdb matching
line
multipath path_prio.sh Primary-paths
50
/usr/local/etc/primary-paths
0x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:0 sdb 3600a0b8000122c6d00000000453174fc 50
0x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:1 sdc 3600a0b80000fd6320000000045317563 2
0x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:2 sdd 3600a0b8000122c6d0000000345317524 50
0x10000000c95ebeb4 0x200200a0b8122c6e 2:0:0:3 sde 3600a0b80000fd6320000000245317593 2
0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:0 sdi 3600a0b8000122c6d00000000453174fc 5
0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:1 sdj 3600a0b80000fd6320000000045317563 51
0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:2 sdk 3600a0b8000122c6d0000000345317524 5
0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:1:3 sdl 3600a0b80000fd6320000000245317593 51
PATH_PRIO.SH
- grep device from primary-paths file
- return value from last column
39
Demo: Active/Passive Disk
• Host
– One Emulex LP11000
• Disk
– IBM DS4500
– Luns presented through
both controllers
– Luns accessible via 1
controller only at a time
– AVT enabled
AVT
- Lun will migrate to alternate controller if requested there
- Tolerance of cable/switch failure
- AVT penalty - lun inaccessible for 5-10 secs while controller ownership changing
SCREENS: /var/log/messages , multi-port-mon , command , script host
1. No luns (ls_fc_luns)
2. /etc/multipath.conf
1. Multipaths (fastt)
2. Devices (fastt)
3. /usr/local/sbin/path_prio.sh
1. Identify controller A, controller B
4. /usr/local/etc/primary-paths
5. Add luns (scan_fc_luns)
1. See multipath bindings & path_prio.sh output in /var/log/messages
6. View current multipath configuration
1. Multipath -v2 -l
7. Failover test
1. Script-host: disable disk port A
2. See multipathd reconfig in /var/log/messages
3. See I/O path change in multi-port-mon
8. Recover test
1. Script-host: enable disk port A
40
Demo: Active/Active Disk
• Host
– One Emulex LP11000
• Disk
– DDN 8500
– Luns accessible via
both controllers (no
penalty)
SCREENS: multi-port-mon , /var/log/messages , command , script-host

1. /etc/multipath.conf
1. Devices (DDN) (path_prio = /bin/true ; path_grouping_policy = multibus)
2. Multipath (DDN)
2. Luns present? (ls_fc_luns) Add luns if needed (scan_fc_luns)
1. See multipath bindings in /var/log/messages
3. View multipath configuration
1. Multipath -v2 -l
4. Failover test
1. Expected changes in multi-port-mon
2. Disable switch port for disk ctlr 1
3. See failover in /var/log/messages and multi-port-mon
5. Restore ctlr access
1. Expected changes in multi-port-mon
2. Enable switch port for disk ctlr 1
3. See failback in /var/log/messages and multi-port-mon
41
Path Grouping Policy Matrix
1 HBA 2 HBAs
(demo1)
Active/Active multibus
multibus
Active/Passive (demo2)
path_prio
with AVT path_prio
Active/Passive *multiple points

failover
w/o AVT of failure
ACTIVE/ACTIVE 2 HBAs
- trivial, same as demo1
- Each HBA sees 1 ctlr
- Can let both HBAs see both ctlrs (4 paths to each lun)
+ Use path_prio if need to control path usage
ACTIVE/PASSIVE (AVT) 2 HBAs
- trivial, similar to demo2
ACTIVE/PASSIVE (no AVT) 1 HBA
- Tolerant of ctlr failure only.
- If anything else fails, luns will not AVT to alternate ctlr, host will lose access
ACTIVE/PASSIVE (no AVT) 2 HBAs
- Non-preferred paths will be failed
- Each HBA must have full access to both controllers
42
Linux Multipath Errata
• Making changes to multipath.conf
– Stop multipathd service
– Clear multipath bindings
• /sbin/multipath -F
– Create new multipath bindings
• /sbin/multipath -v2 -l
– Start multipathd service
• Cannot multipath root or boot device
• user_friendly_names
– Not really, just random names dm-1, dm-2 …
CANNOT MULTIPATH ROOT OR BOOT DEVICE

- per ap-rhcs-dm-multipath-usagetxt.html (see references section)
43
Linux Multipath Resources
• multipath.conf.annotated
• man multipath
• http://christophe.varoqui.free.fr/wiki/wakka.php?wiki=H
ome
– Multipath tools official home
• http://www.redaht.com/docs/manuals/csgfs/browse/rh-
cs-en/ap-rhcs-dm-multipath-usagetxt.html
– Description of output (multipath -v2 -l)
• http://kbase.redhat.com/faq/FAQ_85_7170.shtm
– Setup device-mapper multipathing in Red Hat Enterprise Linux 4?
• http://dims.ncsa.uiuc.edu/set/san
– Multi-port-mon
– Set switchport state : (en/dis)able switch port via SNMP
MULTIPATH.CONF.ANNOTATED (RedHat)
- /usr/share/doc/device-mapper-multipath-0.4.5/multipath.conf.annotated
44

Linux Mpio

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linux Mpio

Uploaded by

Copyright:

Available Formats

SAN Persistent Binding and

Multipathing in the 2.6 Kernel

Michelle Butler, Technical Program Manager

Slides available at http://dims.ncsa.uiuc.edu/set/san/

LCI Conference 2007 National Center for Supercomputing Applications

LCI Conference 2007 National Center for Supercomputing Applications

LCI Conference 2007 National Center for Supercomputing Applications

LCI Conference 2007 National Center for Supercomputing Applications

LCI Conference 2007 National Center for Supercomputing Applications

LCI Conference 2007 National Center for Supercomputing Applications

• Shared Memory Clusters

– Copper (IBM p690,1.3 GHz Power4): 12 x 32

– Cobalt (SGI Altix, 1.5 GHz Itanium2): 2 x 512 processors

LCI Conference 2007 National Center for Supercomputing Applications

• Infrastructure: 394TB Fiberchannel

LCI Conference 2007 National Center for Supercomputing Applications

LCI Conference 2007 National Center for Supercomputing Applications

LCI Conference 2007 National Center for Supercomputing Applications

LCI Conference 2007 National Center for Supercomputing Applications

Device node mapping can change with changes to

Devfs helps only a little:

LCI Conference 2007 National Center for Supercomputing Applications

LCI Conference 2007 National Center for Supercomputing Applications

Devfs provides dynamic and persistent naming, but:

1. Uniquely identify each lun

LCI Conference 2007 National Center for Supercomputing Applications

root# scsi_id -g -u -s /block/sdb

LCI Conference 2007 National Center for Supercomputing Applications

LCI Conference 2007 National Center for Supercomputing Applications

Custom naming controlled by rulesets stored in /etc/udev/rules.d

LCI Conference 2007 National Center for Supercomputing Applications

Custom script : mpio_scsi_id

Sample udev rule:

LCI Conference 2007 National Center for Supercomputing Applications

Get disk controller WWPN

LCI Conference 2007 National Center for Supercomputing Applications

LCI Conference 2007 National Center for Supercomputing Applications

After running delete_fc_luns, udev removes /dev/sdX devices files (/var/log/messages)

Major/Minor numbers line up for /dev/disk/fc/* and /proc/partition/*

LCI Conference 2007 National Center for Supercomputing Applications

Exmaple: multiple paths on Nadir

Get HBA type lspci

sysfs (emulex) /sys/class/scsi_host/hostX/targetX:Y:Z

Get lun list sysfs /sys/class/scsi_host/hostX/targetX:Y:Z/X:Y:Z:L

Get lun info sysfs /sys/class/scsi_host/hostX/targetX:Y:Z/X:Y:Z:L/*

0x10000000c95ebeb4 0x200300a0b8122c6e 2:0:0:0 sdb 3600a0b8000122c6d00000000453174fc

LCI Conference 2007 National Center for Supercomputing Applications

Get host list ls_fc_luns

echo “1” > /sys/class/fc_host/hostX/lip

LCI Conference 2007 National Center for Supercomputing Applications

Get host list ls_fc_luns

echo “- - -” > /sys/class/scsi_host/hostX/scan

LCI Conference 2007 National Center for Supercomputing Applications

Get lun list ls_fc_luns

echo “1” > /sys/class/scsi_host/hostX/targetX:Y:Z/X:Y:Z:L/delete

LCI Conference 2007 National Center for Supercomputing Applications

LCI Conference 2007 National Center for Supercomputing Applications

LCI Conference 2007 National Center for Supercomputing Applications

LCI Conference 2007 National Center for Supercomputing Applications

LCI Conference 2007 National Center for Supercomputing Applications

1. Identify unique luns

LCI Conference 2007 National Center for Supercomputing Applications

Multipath handles 3 areas.

LCI Conference 2007 National Center for Supercomputing Applications