You are on page 1of 41

i  


 
 
 

‡ I/O storage modeling and performance
± David Kaeli

‡ Soft error modeling and mitigation


± Mehdi B. Tahoori

|  
 
 
3 
 i  

 
 
   

  
    
ë   
   
   
     

    


‡ Motivation to study file-based I/O
‡ Profile-driven partitioning for parallel file
I/O
‡ I/O Qualification Laboratory @ NU
‡ Areas for future work

|  
 
 
3 

   3 
  
‡ Many subsurface sensing and imaging workloads
involve file-based I/O
± Cellular biology ± in-vitro fertilization with NU biologists
± Medical imaging ± cancer therapy with MGH
± Underwater mapping ± multi-sensor fusion with Woods Hole
Oceanographic Institution
± Ground-penetrating radar ± toxic waste tracking with Idaho
National Labs

|  
 
 
Air

^ 3
      
Mine
  
 3  
 Soil

‡ Reduced the runtime of a single-body


  
 
Steepest Descent Fast Multipole Method  
(SDFMM) application by 74% on a 32-node

Beowulf cluster

 
 
‡ Hot-path parallelization  

‡ Data restructuring 


  

 
‡ Reduced the runtime of a Monte Carlo     
scattered light simulation by 98% on 
a 16-node Silicon Graphics Origin 2000
‡ Matlab-to-C compliation
    
 
‡ Hot-path parallelization
   
"
‡ Obtained superlinear speedup of Ellipsoid
 
!
Algorithm run on a 16-node IBM SP2 
!
‡ Matlab-to-C compliation 
‡ Hot-path parallelization  " # $ %
&
 '& 
|  
 
 
%#   "!%  
"#    
ë 
   

‡ For compute-bound workloads, Beowulf clusters can
be used effectively to overcome computational
barriers
‡ Middlewares (e.g., MPI and MPI/IO) can significantly
reduce the programming effort on parallel systems
‡ Multiple clusters can be combined, utilizing Grid
Middleware (Globus Toolkit)
‡ For file-based I/O-bound workloads, Beowulf clusters
and Grid systems are presently ill-suited to exploit the
potential parallelism present on these systems

|  
 
 


‡ Motivation to study file-based I/O
‡ Profile-driven partitioning for parallel file
I/O
‡ I/O Qualification Laboratory @ NU
‡ Areas for future work

|  
 
 
 3    

‡ The I/O bottleneck
± The growing gap between the speed of
processors, networks and underlying I/O devices
± Many imaging and scientific applications access
disks very frequently
‡ I/O intensive applications
± Out-of-core applications
± Work on large datasets that cannot fit in main memory
± File-intensive applications
± Access file-based datasets frequently
± Large number of file operations
|  
 
 
3
 

‡ Storage architectures
± Direct Attached Storage (DAS)
± Storage device is directly attached to the computer
± Network Attached Storage (NAS)
± Storage subsystem is attached to a network of servers
and file requests are passed through a parallel filesystem
to the centralized storage device
± Storage Area Network (SAN)
± A dedicated network to provide an any-to-any connection
between processors and disks

|  
 
 
3 


P
An I/O Multiple Processes P P « P
intensive (i.e. MPI-IO)
application
Disk Disk

Multiple disks
(i.e. RAID)

P P P « P

Disk Disk « Disk Disk Disk « Disk

Data Striping Data Partitioning


|  
 
 
3 


‡ I/O is parallelized at both the application level
(using MPI and MPI-IO) and the disk level
(using file partitioning)
‡ Ideally, every process will only access files on
local disk (though this is typically not possible
due to data sharing)
‡ How to recognize the access patterns?
‡ Profile-guided approach

|  
 
 
   ! 

 
 


  
  !


"  


# # 
$

% 
 


|  
 
 
3 
  


‡ For every process, for every contiguous file access,
we capture the following I/O profile information:
± V 
± 

± 
±  
±      
±  
‡ Generate a partition for every process
‡ Optimal partitioning is NP-complete, so we develop a
greedy algorithm
‡ We have found we can use partial profiles to guide
partitioning
|  
 
 
!   

  

for each IO process, create a partition;
for each contiguous data chunk {
total up the # of read/write accesses on a process-ID basis;
if the chunk is accessed by only one process
assign the chunk to the associated partition;
if the chunk is read (but never written) by multiple processes
duplicate the chunk in all partitions where read;
if the chunk is written by one process, but later read by multiple {
assign the chunk to all partitions where read and broadcast
the updates on writes;
else assign the chunk to a shared partition;
}}
For each partition
sort chunks based on the earliest timestamp for each chunk;
|  
 
 
 3   
‡ NASA Parallel Benchmark (NPB2.4)/BT
± Computational fluid dynamics
± Generates a file (~1.6 GB) dynamically and then reads it back
± Writes/reads sequentially in chunk sizes of 2040 Bytes
‡ SPEChpc96/seismic
± Seismic processing
± Generates a file (~1.5 GB) dynamically and then reads it back
± Writes sequential chunks of 96 KB and reads sequential chunks of 2 KB
‡ Tile-IO
± Parallel Benchmarking Consortium
± Tile access to a two-dimensional matrix (~1 GB) with overlap
± Writes/reads sequential chunks of 32 KB, with 2KB of overlap
‡ Perf
± Parallel I/O test program within MPICH
± Writes a 1 MB chunk at a location determined by rank, no overlap
‡ Mandelbrot
± An image processing application that includes visualization
± Chunk size is dependent on the number of processes
|  
 
 
RAID
Beowulf Cluster
Node
P2-350Mhz P2-350Mhz P2-350Mhz

Local 10/100Mb Local


PCI-IDE Ethernet Switch PCI-IDE
Disk Disk

P2-350Mhz
P2-350Mhz P2-350Mhz
RAID
Node
|  
 
 
"    
‡ DAS configuration
± Linux box, Western Digital WD800BB (IDE), 80GB,
7200RPM
‡ Beowulf cluster (base configuration)
± Fast Ethernet 100Mbits/sec
± Network Attached RAID - Morstor TF200 with 6-9GB drives
Seagate SCSI disks, 7200rpm, RAID-5
± Local attached IDE disks ± IBM UltraATA-350840, 5400rpm
‡ Fibre channel disks
± Seagate Cheetah X15 ST-336752FC, 15000rpm

|  
 
 

i # " 

``

  

 `
§ 
 
O&'&( ``
 


`

`
j  j 

  
     
``
  


 `
§ 

``
 
)|*
 § 
`

`
j  j 

  
     
|  
 
 

i # " 

 
+(
 `
!
`` ``
   Y 

   Y 
 `

` ``

 `

` `
Y  Y  
 
 Y  Y  
 


`
%, 
    Y 

``

§ 
`
 
 
`` § 

`

`
Y   Y  
  


|  
 
 
^ 
 ^ 

 ^  

```

```
 

```

```

`
 ^     
   ^ 
 

O  
|  
 
 
   
  

  
‡ We have found that IO access patterns are
independent of file-based data values
‡ When we increase the problem size or
reduce the number of processes, either:
± the number of IOs increases, but access patterns
and chunk size remain the same (SPEChpc96,
Mandelbrot), or
± the number of IOs and IO access patterns remain
the same, but the chunk size increases (NBT, Tile-
IO, Perf)
‡ Re-profiling can be avoided
|  
 
 
| 
   3 $   
‡ Growing need to process large, complex
datasets in high performance parallel
computing applications
‡ Efficient implementation of storage
architectures can significantly improve system
performance
‡ An accurate simulation environment for users
to test and evaluate different storage
architectures and applications

|  
 
 
| 
  3 $   
‡ Target applications: parallel scientific programs
(MPI)
‡ Target machine/Host machine: Beowulf clusters
‡ Use DiskSim as the underlying disk drive
simulator
‡ Direct execution to model CPU and network
communication
‡ We execute the real parallel I/O accesses and
meanwhile, calculate the simulated I/O response
time
|  
 
 
ù 
% 

3     
i    
 i 
i    
  
` 
  `

 

 

 




` `
       
a  
   a  
  
 
 a   
 a 

i    
  i        -    
` `
 
 › › 
› ›

 

` `
    
    
›„„ „
„r    
›       „„  
    
„r „„
|  
      
 
 
 "   

-    -    -    -   


- O. O

O/ 01
)"$

 2   1


"$$ % 
 
3 - #
 !
 
%%

2
0
)
$
|  
 
 
|  
   
 
```
 ``
```
 ``
 

 
```

 ``
```
``
`
   

    
|  
 
 
 
 " %    

‡ A variety of SAN where disks are distributed across the network and each
server is directly connected to a single device
‡ File partitioning
‡ Utilize I/O profiling and data partitioning heuristics to distribute portions of
files to disks close to the processing nodes
- O. O

1
)"$ 1
)"$ 1
)"$ 1
)"$

       


2
0 2
0 2
0 2
0
)
$ )
$ )
$ )
$
|  
 
 
| 
 
 
  (
  (

```

 ``

```
s c  s




 ``
r

```

``

`
   

r r c ss rs
|  
 
 
"    


|  
 
 
3) * ' +  

"!

"
! #  
) 

$  

%  
!


  

  3

  





 

 3

 

  
|  
 
 
3 )* '  

#
!

"! #  
) 

" $  


! %  

!


  

  3

  





 

 3

 

  
|  
 
 
 

á 
   
     â   !
 $ 
#4) 
   4   '
       ! 
 "  %
# !
 ||| 
 )"$ 
$  %
#4  "
 %)
$ 
 !
 $  %( $$
 
)"$5 )()64 ,4 '+
'
#  $   %&' 
  %
# !   
 
)"$ 
$ )  $ 
#4â4 '+'
(   ) *  +,," %&'  
  %
# !
|||. 0  )  #O/ 0 
 %  4'4 '+
'
-   ./  * 
  0  
 %


+,,  
 â   !)"$4)
 %(  #"4) $,
4 '+'

|  
 
 
 &
  
‡ Many imaging applications are dominated by file-based
I/O
‡ Parallel systems can only be effectively utilized if I/O is
also parallelized
‡ Developed a profile-guided approach to I/O data
partitioning
‡ Impacting clinical trials at MGH
‡ Reduced overall execution time by 27-82% over MPI-IO
‡ Execution-driven I/O model is highly accurate and
provides significant modeling flexibility

|  
 
 


‡ Motivation to study file-based I/O
‡ Profile-driven partitioning for parallel file
I/O
‡ I/O Qualification Laboratory @ NU
‡ Areas for future work

|  
 
 
3 '  
ë 

‡ Working with Enterprise Strategy Group
‡ Develop a state-of-the-art facility to provide
   performance qualification of
Enterprise Storage systems
‡ Provide a quarterly report to ES customer
base on the status of current ES offerings
‡ Work with leading ES vendors to provide
them with custom early performance
evaluation of their beta products

|  
 
 
3 '  
ë 

‡ Contacted by IOIntegrity and SANGATE
for product qualification
‡ Developed potential partners that are
leaders in the ES field
‡ Initial proposals already reviewed by
IBM, Hitachi and other ES vendors
‡ Looking for initial endorsement from
industry

|  
 
 
3 '  
ë 

‡ Why @ NU
± Track record with industry (EMC, IBM,
Sun)
± Experience with benchmarking and IO
characterization
± Interesting set of applications (medical,
environmental, etc.)
± Great opportunity to work within the
cooperative education model

|  
 
 


‡ Motivation to study file-based I/O
‡ Profile-driven partitioning for parallel file
I/O
‡ I/O Qualification Laboratory @ NU
‡ Areas for future work

|  
 
 
  
  
‡ Designing a Peer-to-Peer storage system on a Grid system
by partitioning datasets across geographically distributed
storage devices
Internet

Head node Head node

RAID
100Mbit/s 1Gbit/s

31 sub-nodes 8 sub-nodes
Π 







|  
 
 
O 
  
 

`

`

`

 `
 
``  
Y
`  
 
`

`

`



 

 

|  
 
 
  
  
‡ Reduce simulation time by identifying
characteristic ³phases´ in I/O workloads
‡ Apply machine learning algorithms to identify
clusters of representative I/O behavior
‡ Utilize K-Means and Multinomial clustering to
obtain high fidelity in simulation runs utilizing
sampled I/O behavior

+ 1   
1      
+.  
 ,$
%   722
 !'

|