You are on page 1of 6


Jahyun J. Koo‡§ , Alan C. Evans§ and Warren J. Gross‡

Department of Electrical and Computer Engineering, McGill University, Montreal, Quebec, Canada
McConnell Brain Imaging Center, Montreal Neurological Institute, Montreal, Quebec, Canada
email : {jkoo10,wjgross},

Many automatic algorithms have been proposed for analyzing magnetic resonance imaging (MRI) data sets. These
algorithms allow clinical researchers to generate quantitative data analyses with consistently accurate results. With
the increasingly large data sets being used in brain mapping,
there has been a significant rise in the need for methods to
accelerate these algorithms, as their computation time can
consume many hours. This paper presents the results from a
recent study on implementing such quantitative analysis algorithms on High-Performance Reconfigurable Computers
(HPRCs). A brain tissue classification algorithm for MRI,
the Partial Volume Estimation (PVE), is implemented on
an SGI RASC RC100 system using the Mitrion-C HighLevel Language (HLL). The CPU-based PVE algorithm is
profiled and computationally intensive floating-point functions are implemented on FPGA-accelerators. The images
resulting from the FPGA-based algorithm are compared to
those generated by the CPU-based algorithm for verification. The Similarity Indexes (SI) for pure tissues are calculated to measure the accuracy of the images resulting from
the FPGA-based implementation. The portion of the PVE
algorithm that was implemented on hardware achieved a 11×
performance improvement over the CPU-based implementation. The overall performance improvement of the FPGAaccelerated PVE algorithm was 3.5× with four FPGAs.
There has been a significant increase in the need for quantitative analysis of 3D magnetic resonance imaging (MRI)
data [1]. Such analysis provides an enormous amount of
information about the anatomy of the human brain to researchers. Traditionally, MRI data sets for the human brain
were manually analyzed by researchers. However, over the
past decade, automatic analysis methods have been proposed
and developed that reduce computation time and improve
data consistency. Even though the algorithms are automatic,
some can take many hours to process a single subject due
to complexity of the algorithms. As a result, acceleration

of the algorithms is important due to the increasing use of a
very large number of data sets in brain mapping.
MRI data analysis algorithms can be accelerated using
an architecture, High-Performance Reconfigurable Computers (HPRCs) which combine high performance CPUs with
reprogrammable accelerators such as Field Programmable
Gate-Arrays (FPGAs). Using HPRC, the computationally
intensive portion of the algorithm can be implemented on
FPGAs to exploit fine-grained parallelism, while other portions are implemented on the conventional CPU. Implementing the MRI data analysis algorithm on HPRC would be a
challenging task for researchers with only a software background, as basic low-level hardware description languages
(HDLs) and hardware knowledge are required to program
FPGAs. A number of commercial high-level languages have
been proposed to help such individuals program FPGAs by
abstracting away the hardware-dependent details.
In this study the 3D MRI tissue classification algorithm,
Partial Volume Estimation (PVE) is implemented on an SGI
Altix 350 with two RASC RC100 systems using the MitrionC HLL. An overview of HPRC system and HLL used in
this study are provided in Section 2. A general overview of
the PVE algorithm is described in Section 3. In Section 4,
the design strategies used to implement the PVE algorithm
on the FPGA-accelerator are proposed. The performance
comparison and differences of output images between the
FPGA-based and CPU-based implementation are also presented in Section 4. The differences between the two implementations are discussed and future works are described in
Section 5. Conclusions are provided in Section 6.
2.1. RASC RC100 System
In this study, the SGI RASC RC100 FPGA system is used to
accelerate the selected algorithm. Two RASC RC100 systems are connected to the Altix 350 multiprocessor via a
NUMAlink interconnect network. The Altix 350 has eight
1.5GHz Intel Itanium 2 CPUs with 16GB of shared physi-

design error and performance bottlenecks can be identified from a graphical representation of the hardware data flow graph created by the Mitrion debugger/simulator. An algorithm can use a combinations of both architectures to exploit available hardware resources and parallelism. This technique. This effect blurs the intensity distinction of tissues at the boundary areas as shown in Fig. PARTIAL VOLUME ESTIMATION (PVE) Brain clinical studies are typically interested in white matter (WM). During simulation. generated by the Mitrion compiler. N } where N is the total number of voxels. The SGI RASC RC100 system provides two special features.. the PVE algorithm developed in [5] is implemented on the FPGA-accelerators. therefore provides an implementation to easily exploit coarsegrained parallelism. to address the partial volume effect. The first stage estimates the partial volume context image W*.Bank 0/1 8MB 8MB QDR QDR SRAM SRAM Bank 4 Bank 2/3 8MB 8MB 8MB QDR QDR QDR SRAM SRAM SRAM 1.2. .. Each RC100 system is composed of two Xilinx Virtex-4 LX200 FPGAs. Therefore. . An algorithm. Mitrion-C Dataflow-oriented Mitrion-C is a parallel HLL with C-like syntax. Let j = {1. The architecture of a single RASC RC100 system is illustrated in Fig. Vectorization offers explicit parallelism by replicating processing elements (PEs).. this processor is synthesized and placedand-routed to generate a bitstream. distinguished based on how they model the partial volume effect. .. The content of a mixel. which help to efficiently execute algorithms on large data sets. This model. Parallelism is expressed using explicit language constructs and can be exploited in the form of vectorization and/or pipelining. Let the acquired 3D MR image be denoted by X = {xi : i = 1. The PVE algorithm is therefore divided into two different stages. assumes that the image intensity value of each mixel can be described as a sum of weighted random variables (RVs). These banks can be accessed by the FPGA at a rate of 128-bits per clock cycle. pipelining generates a single PE and uses it to process the input data sequentially. 1. N and j = 1. The wide-scaling feature allows the algorithm to be automatically scaled over multiple FPGAs. 2. by equally distributing large data sets to the available FPGAs and processing them simultaneously. represent the fractionalamount of tissue M type j present in voxel i (wij ∈ [0.. In this study. two TIO interface ASICs and a bitstream loader FPGA [2].2 GB/s Each direction TIO Interface ASIC NUMALink 4 (3. 2. has been developed to estimate the proportion of each tissue type present in a voxel more accurately. 3.6 GB/s Algorithm FPGA 0 3. although the core services currently provide access to 32MB per FPGA. If a partial volume context image is denoted as W = {wij : i = 1. . The process of classifying these tissue types in 3D MR images of the human brain is made difficult due to the partial volume effect. M }. M } represent the set of possible pure tissues present in the image and lj represent the RV describing the tissue type j. Several different PVE algorithms have been proposed. known as partial volume coefficients (PVCs). Streaming and Wide-scaling. On the other hand. Mitrion machine code. can be used to simulate the implementation or to gen- erate VHDL code from the Mitrion processor configurator with the information of target FPGA architecture. which runs at a fixed frequency of 100 MHz. xi can there be expressed as: M  xi = wij lj (1) j=1 where the weighting terms wij . It uses a statistical model. based on the selection of data types and the parallel constructs provided by Mitrion-C [3]. 1] and j=1 wij = 1). the programmer does not need to consider hardware-based concerns such as timing and can focus on expressing the dataflow of the algorithm. Finally. based on Markov Random field (MRF) theory. known as Partial Volume Estimation (PVE).2 GB/s Each direction TIO Interface ASIC NUMALink 4 (3. Each computational FPGA has five 8MB QDR SRAM DIMM memory banks. 1. first proposed in [4]. cal memory..2 GB/s) Fig. however this approach rapidly consumes FPGA resources.. Voxels (3D pixels) that demonstrate this effect are composed of a mixture of several different tissue types and are called mixels [4].2 GB/s) Loader FPGA Bank 0/1 8MB 8MB QDR QDR SRAM SRAM Bank 4 Bank 2/3 8MB 8MB 8MB QDR QDR QDR SRAM SRAM SRAM 1. The MVP is wrapped in the RASC RC100 core services which provide memory and IO interfaces between the FPGA external SRAMs and shared memory on the Altix 350 system. an estimate W* for the true partial volume context image can be determined from the given MR image X using the maximum a posteriori (MAP) criterion [5]. RASC RC100 hardware.6 GB/s Algorithm FPGA 1 3. The streaming feature allows the algorithm to reduce the overhead of data transfer by overlapping data loading and unloading while the algorithm is executing. . gray matter (GM) and cerebrospinal fluid (CSF). . while the second stage estimates the . .. The generated VHDL code represents the Mitrion Virtual Processor (MVP).

CSF } or C(i) ≥ 0 ci ∈ aik =   f (C(i)) c ∈ {GM CSF. Partial volume effect on Sulcal region. PVE CSF. the fractional amount of each pure tissue within the voxel i can be calculated by employing the maximum-likelihood principle as shown in [5]. WM and GM images provide the proportion of CSF. Three input images are required to perform the PVE algorithm on a single 3D MR brain image. If C is the context image to be estimated and X is the observed image. The classified image is used to calculate the parameters of the pure tissue classes and to reduce the computation complexity by eliminating the voxels in the background. Partial Volume (PV) classification In the PV classification stage. The PVE CSF. the ICM technique re-estimates the tissue labels based on the labels calculated in the previous iteration until convergence is reached. then using MAP criterion. C* from Equation 2. PVC Estimation The fraction of each pure tissue present within a particular voxel is calculated in the PVC estimation stage. provided by [6]. i is the index of a voxel in the image X. The prior information can be modeled using a MRF as:  N   aik (4) P (C) ∝ exp β d(i. The value of aik is determined based on the tissue relationship between voxel i and k.1. the estimated classified image can be expressed as: C ∗ = arg max P (C|X) = arg max P (C) C where N  C N  P (xi |ci ) i=1 (2) p(xi |ci ) and P(C) are the probability densities and i=1 the prior information for all possible tissue classes. such as PVE classified. 3. PVE WM and PVE GM images. 3. 2. k) i=1 k∈Ni where β is the MRF weighting parameter. k} indicates a mixel that  is composed by tissue type j and k. During each iteration. N is the total number of voxels in image X.3. Reference [6] proposes to use aik terms with a curvature image to overcome the oversmoothing artifacts on the boundaries between GM and CSF tissues in the deep sulcal regions (Fig. The algorithm is terminated when the total number of changed voxels during the previous iteration is lower than the total number of changed voxels during the current iteration or the maximum number of iterations (20) is reached.between the voxel i and its neighbour voxel k. In order to estimate the context image. k}) = 0 1    g xi .2. 3. respectively. CSF } and C(i) < 0  i   1 otherwise (5) where C(i) is the value of the curvature at voxel i and f(x) is a function. The probability densities for mixed tissue classes can be expressed as Equation 3 where {j.k) is the distance ci = ck ci and ck share a component / {GM CSF. µ(w). WM and GM present in each voxel. The PVE algorithm generates four output images. The T1 image is a normal T1-weighted MR image from an MR scanner. while µ(w) and (w) are the mean and the covariance of the mixel. the iterative conditional modes (ICM) algorithm is used. 2). If voxel i is a mixel containing tissue j and k. (w) dw (3) Since adjacent voxels can be assumed that they likely have similar tissue types in data sets generated by the 1 mm resolution MRI scanner. WM GM CSF Sulcal Region  −2      −1 Voxels with Partial Volume Effect Fig. each voxel is classified and labeled as one of four pure tissue types (BG. This is demonstrated in Equation 5. Overview of the PVE Algorithm An overview of the described PVE algorithm is provided in Fig.  p(xi |ci = {j. A simple grid search algorithm is used to solve the maximum-likelihood PVC estimation [5]. k is the index of a voxel from Ni and d(i. Ni is the 26-neighbour voxels around voxel i. to calculate an appropriate MRF weight term for the voxel in the sulcal region. . respectively. The curvature image contains the information of CSF in the sulcal region. respectively. a MRF model with 26-neighbour voxels is used to represent the prior information due to its flexibility in defining the neighborhood system [4]-[6]. 3. GMCSF and CSFBG). fractional amount (PVCs) of each pure tissue type present in each voxel. CSF. Each voxel in the PVE classified image is labeled to the dominant tissue type among all possible tissue types. WM and GM) or three mixed tissue types (WMGM.

Therefore. 3. the P (C) values are calculated for all tissue types. they are computed in a pipelined fashion. current and back slices) of the PVE classified image are used to determine the weighting term aik in Equation 5.2s (18%) Total 372. as shown in Fig. new bitstreams are generated from the design which is manually partitioned into four different parts. A aik term in Equation 4 is PE which generates a single d(i. A sequential loop-dependent accumulator is added at the end of the reduction tree. However. the PVE classified image is stored in an array of unsigned char (8-bits) data type. The report generated by the tool is shown in Table 1 and indicates that estimation of the prior information for all tissue types in the PV classification stage described in Section 3. all three images could not be sent to the FPGA at once. Nine rows (three each from the front.4s (28%) the prior information † Estimating 49. The weighting term is multiplied by the inverse of the distance between the voxel i and its neighbour voxel k. with the exception of the portion that computes the probability densities for all tissue types. To run the FPGA-based PV classification function.2s PVE Classified T1 image Compute the prior information P(C) PVE CSF Curvature Termination criteria No Yes Estimate the fractional amount of each tissue within each voxel Classified Image Input images PVC estimation stage PVE Algorithm † Two subsections within the PV classification stage PVE WM PVE GM Output images Fig. each row element is stored in 4-bit internal Block RAMs for the FPGAaccelerated implementation because the voxels in the classified image are used to represent all possible tissue classes. PVE profiling report.1 is the primary performance bottleneck. Unfortunately. 4.PV PV Classification classification Stage stage Estimate probability densities for pure and mixed tissue types Table 1. First. was the most ideal candidate to be implemented on RASC RC100 FPGA-accelerators over other functions. HARDWARE IMPLEMENTATION OF PVE The CPU-based PVE algorithm was first profiled using GNU gprof to identify the computationally intensive portions and where data-parallelism could be exploited on the FPGA. Due to the limited size of the FPGA external SRAMs. the RASC Wide-scaling feature in Section 2. 4. three images are required: the PVE classified image. Each PE first verifies the relationship between the reference tissue class and one of its 26-neighbour voxels and then determines the appropriate aik term as Equation 5. After all of the required data is stored in the internal Block RAMs. PVE algorithm. The RASC Streaming feature has therefore been utilized to reduce the overhead of data transfer between the memory in Altix 350 and FPGA external SRAMs. 4. Functions CPU-based FPGA-based † Estimating 301s (81%) 28.k) designed and nine of them are implemented in parallel to compute 26 distinct terms from 26-neighbour voxels due to the limited hardware resources. In the CPU-based PVE algorithm. the algorithm execution is distributed to the four available FPGA-accelerators to exploit coarsegrained parallelism. Implementing PV Classification The PV classification. to avoid costly floating-point division operations. Since the prior information for each voxel can be computed independently. The PV classification of the C-kernel is reprogrammed using the Mitrion-C HLL to generate the bitstreams for the FPGAs.1. The final P (C) is then generated using a . to reduce hardware resources and achieve better performance. the hardware-implemented algorithm reads all of the data required to process one row from the input SRAM and stores it in the internal Block RAMs. as it is performed iteratively and consumes approximately 80% of the total PVE algorithm computation time. instead of dividing.8s (1%) 3.8s (50%) the probability densities PVC Estimation 3.1 could not be used in this algorithm be- cause the bitstreams will be uploaded onto the FPGAs are not identical.2s (5%) 18.8s 100.8s (4%) IO functions 18.8s (13%) 49. The output of each PE is accumulated using a four-level single-precision reduction tree instead of a data-dependent for-loop. such that each FPGA can work on a horizontally partitioned segment of a single slice from the 3D input image. the curvature image and an image that contains the floating-point probability densities of different tissue classes. Since the P (C) values of each tissue type are independent. Both a single row of the curvature image and the floatingpoint probability densities for all possible tissue types are also read and stored in the internal Block RAMs.

The resources consumed by each implementation are given in Table 2.Reference PE tissue class PE Pre-calculated P(C)P(x|c) aik d (i. Software Hardware-PVE Total time 362s 100. The evaluated SI for each pure tissue class is provided in Table 4. 4. The mismatched 1 mm3 voxels between the two implementations are represented as black dots and overlaid on the T1 image in Fig.1 and Mitrion-C 1. The FPGA-based PVE algorithm is performed on several real human brain MR images to measure the performance improvement. running of bitstream and receiving of data.6× New classified tissue PE PE PE (a) CPU-based Fig. its SI becomes 0.3s Speedup per itr 11.2. 4. If there is no common elements between the results. A single slice of the resulting image from each implementation is shown in Fig. For a more accurate measure of the differences.176) 96. If the two results are identical. 5(b). its SI is 1. Each image is composed of 181 slices and each slice contains 39277 of 1 mm3 voxels (217 × 181). Hardware resources (Xilinx Virtex-4 LX200). The very same colin27 and ten ICBM images are used to verify the accuracy of the resulting images from the FPGAimplemented PVE algorithm. The average time spent for the synthesis and placeand-route for the implementation is approximately 18 hours. The average execution times and speedup obtained by both implementations on these subjects are shown in Table 3.479 (54%) Multipliers (96 18×18) 65 (67%) 18kb Block RAMs (336) 37 (11%) single-precision floating-point exponent operation as shown in Fig. Moreover.088) 77. the hardware execution time includes all overhead associated with the preparation of the input buffer.5 GHz Itanium 2. Hardware-implemented PV classification. is generated by averaging the 27 distinct MR scans of a single subject to increase the contrast between tissues and to minimize undesired noise [7].1. (b) FPGA-based (c) Difference overlaid on the T1 image Fig. . k ) Compute P(C)P(x|c) for all possible tissue types Evaluate the tissue type which has the maximum P(C)P(x|c) value Table 3. Finally.2s Time per itr 26. Average speedup achieved from 11 test MRI. the voxel i is re-classified as the determined tissue type and is written to the output RAM. Resource (total) Hardware-PVE Slices (89. To allow for a fair comparison. k ) P (C ) Σ 26neighbour PE voxels PE PE PE EXP a ∑ ik k∈N i d (i. ”colin27”.176) 82. 5(c). Results The original CPU-based PVE algorithm and the host program of the FPGA-based implementation are developed in ANSI-C and run on a single 1.6× Overall speedup 3. The SI is derived from a reliability measure known as the kappa statistic [8] and defined as: SI = 2 · n{ASW ∩ AHW } n{ASW } + n{AHW } (6) where n{A} represents the number of elements in set A while ASW and AHW are the set of voxels classified as the tissue type A resulting from the software and hardware implementation. 5(a) and Fig. Xilinx ISE 8. An image.2. the Similarity Index (SI) is calculated over the whole volume.954 (87%) Flip Flops (178. ten normal human images from the International Consortium for Brain Mapping (ICBM) database are randomly selected to compare the execution time spent by the CPU-based and FPGA-based implementation. The FPGAimplemented sections are developed using SGI RASC Library 2.7s 2. The generated P (C)s are multiplied by the corresponding single-precision tissue probability densities in a pipelined fashion and the tissue type which has the maximum P (C) is determined.2i is used to synthesize and place-and-route the implementations in Section 4. 4. Resulting images from two implementations. 5. respectively.159 (46%) 4-inputs LUTs (178. Table 2. sending of data.

no. A. vol.Table 4. S. Kim. September 2004. P. These mismatches were due to the different strategies used for updating the estimated tissue type between the two implementations.6× per iteration over the CPU-based parts. 6.” NeuroImage. single-precision floating-point was used in the FPGA-based implementation to reduce the required hardware resources. B. Tohka. the newly updated voxels were computed from the 26-neighbour voxels generated during the previous iteration. Holmes. Zijdenbos. “Morphometric analysis of white matter lesions in mr images: Method and validation. 5(c) and the calculated SI values indicate that the results from the two implementations were not identical. all voxels were computed in a pipelined fashion. Since there were no data-dependencies between voxels. the actual overall speedup only varied from 3. 3. pp.2× since different number of iterations required for each subject (average 3. R. [7] C. Singh. which will also implement the function that estimates the probability densities on the FPGA-accelerators. the proposed implementation can be performed on the 0. R. although the average speedup of the portions implemented on the FPGAs achieved 11. [3] S. D. R. While double-precision floating-point was used to calculate P (C)P (x|c) terms during the PV classification in the CPU-based PVE algorithm. “Use of a non-stationary markov random field in brain tissue partial volume segmentation. DISCUSSION The hardware PVE algorithm was profiled again and the new report indicated that estimation of the probability densities is the new performance bottleneck as shown in Table 1. Mitrionics Inc. the difference was very slight and would not cause any significant errors in clinical studies. In the CPUbased algorithm. no. Haynor. 2. Evans.” IEEE Transactions on Medical Imaging. However. Choi. pp. Mar 1998. The Mitrion-C Programming Language. [5] J.6×). vol.5 mm brain MRI data sets. of Electrical and Computer Eng. pp. no. pp. Margolin. With some minor memory structure modifications and new partitioning strategies. Sep 1991. 1. [8] A. 23. . C. C. 716–724. vol.998 0. The mismatched voxels overlaid on the T1 image in Fig. Hoge. Dec 1994. A. Dept. The 26-neighbour voxels of the current voxel are also instantly updated as they are changed. Oct 2002. 10. a quantitative analysis algorithm for 3D MR imaging was accelerated using HPRC. The PV classification function implemented on four FPGAs achieved approximately 11× speedup over the CPU-based implementation. The computationally intensive portions of the PVE algorithm were implemented on RASC RC100 FPGA-accelerators using the fine-grained optimization techniques provided by MitrionC HLL. no. Tissue colin27 ICBM CSF 0. R. P. pp. 7. Woods. the computation time of the proposed PVE algo- In the described study. Palmer. Toga. and A.5× to 4. “Partial volume tissue classification of multichannel magneticresonance imagesa mixel model. To achieve better overall speedup over the CPU-based implementation. vol.999 0. 324–333.” IEEE Transactions on Medical Imaging. M. Forghani. C. As a result. Evans. the FPGA-based implementation first stores the updated tissue types in the FPGA external SRAMs and then transfers them to the memory on the Altix 350 system when the algorithm has processed all voxels. This difference did not cause any errors in the output images because the minimal precision error will not degrade determining the tissue class that generates the maximum P (C)P (x|c) term. The overall performance improvement of the FPGA-based PVE algorithm was approximately 3. no. 21. McGill University. A. CONCLUSION 5.994 0. W. and A.995 WM 0. Reconfigurable Application-specific Computing User’s Guide. 1280– 1291.999 GM 0. vol. Similarity Index (SI). “Fast and robust parameter estimation for statistical partial volume models in brain mri. “Automatic ”pipeline” analysis of 3-d mri data for clinical trials: application to multiple sclerosis.5× with four FPGAs. Feb 2005. The percentage of the total time spent for this function increased from 13% to 50%.5 mm. a new design is currently under development. The CPU-based and FPGA-based PVE algorithms were used to process on several real human brain MR imaging data set. Collins. Mohl. 007th ed. 395–407. 4. However. [2] Silicon graphics Inc. 10. The updating scheme of the hardware-implemented algorithm was necessary to fully exploit the parallelism offered by FPGAs. and A. As a result. the SI values for pure tissue classes were calculated to measure the accuracy of the resulting images. Evans. L. [6] V. “Enhancement of mr images using registration for signal averaging.997 rithm is expected to be increased by a factor of 8 since the number of voxels in an image is increased by a factor of 8. J. Zijdenbos.. REFERENCES [1] A. 22.” master’s thesis. [4] H. 2006.” Journal of computer assisted tomography. Zijdenbos.. Moreover. As the resolution of MR imaging increases from 1 mm to 0. The images generated by the FPGA-based algorithm were then compared to those generated by the CPUbased algorithm as a means of verification. Dawant. the new classified tissue type for a voxel is instantly stored into memory. and Y. 84–97.” IEEE Transactions on Medical Imaging. and A. R. 13.