Cell image classification based on ensemble features and random forest

B.C. Ko, J.W. Gim and J.Y. Nam
An efficient white blood cell (WBC) image classification method using ensemble features and classification scheme is introduced. After WBC segmentation into the nucleus and cytoplasm, classification into five different categories is necessary for accurate disease diagnosis. First, from several experiments, it was proved that the nucleus alone is adequate for classifying WBCs without using the cytoplasm because the cytoplasm of some WBCs presents a very weak difference against the background and touches neighbouring WBCs and red blood cells. Secondly, it was proved that the random forest is a reasonable classifier for WBC classification compared to other classification methods, when using small training datasets.
a b c d e

Fig. 1 Segmentation results for five types of WBCs
a Basophil b Eosinophil c Lymphocyte d Monocyte e Neutrophil White line is nucleus; black line is cytoplasm

Introduction: Peripheral blood cells consist of five types of white blood cells (WBCs) along with red blood cells; WBCs provide important clues for patient diagnoses in terms of their numbers for blood-related diseases such as leukaemia or cancer. There are five classes of WBCs in the peripheral blood that differ in terms of the size and shape of the nuclei: i.e. neutrophils, eosinophils, basophils, monocytes, and lymphocytes. Therefore, differential WBC category counting is essential for accurate disease diagnosis. Even though current automated cell counters used in hospitals are based largely on laser-light scatter principles, a quarter of the blood samples require microscopic review by experts. However, few algorithms [1–4] allow automatic cell classification using image processing. Theera-Umpon and Dhompongsa [1] applied two conventional classifiers to classify WBCs: a Bayes classifier and neural networks using four granulometric nuclei features without cytoplasm. Adjouadi et al. [2] introduced an algorithm to optimise the pattern recognition of different white blood cell types in flow cytometry. They used a support vector machine (SVM) classifier to cluster parametric data in a multidimensional space. Ramoser [3] also presented an automated approach to a WBC classification method that uses a pairwise SVM classifier to catalogue cytoplasm and nucleus features. Reta et al. [4] presented a two-phase methodology to analyse the morphology of abnormal leukocytes images for the classification of acute leukaemia subtypes using image processing and data mining techniques. In this Letter, we propose a novel WBC classification method that combines a few characteristic features of WBCs and the random forests classifier. Furthermore, we prove that classification with only the nucleus gives better performance than that with both the nucleus and cytoplasm. Using random forests with ensemble features, a test WBC image is classified into one of the five categories based on the highest probability. We demonstrate that our classification method is more robust than the conventional multi-class support vector machine (MSVM).

WBC classification: A random forest is a decision tree ensemble classifier with each tree grown using some type of randomisation. A random forest has the capacity to process huge amounts of data with high training speeds based on a decision tree [7]. In the training procedure, the random forest starts by choosing a random subset I ′ from the training data I. At node n, the training data In is iteratively split into left and right subsets Il and Ir by using the threshold and split function. After training of the random forest, the test WBC images are applied to the trained random forest. The final class distribution is generated by an ensemble of each distribution of all trees L ¼ (l1 , l2 , . . ., lT) using (1). In (1), T is the number of trees, and we choose ci as the final class f of an input image if the average of P(ci |lt ) has the maximum value. f = max
i=1 to 5

1 T P(ci |lt) T t=1

(1)

Results: The experimental tests consisted of 60 colour peripheral blood cell images collected from the Cellavision reference library and 40 images, including 200 WBCs from Severance Hospital, constituted all five types of stained WBCs. First, because WBCs consist of a nucleus and cytoplasm, we compared the classification performance between entire regions of WBCs including cytoplasm with the nucleus and only the nucleus without cytoplasm to determine the effective region for WBC classification. For the training classifier, the feature vector of entire regions had 141 (number of nuclei of cytoplasm excepted) feature dimensions by combining two feature vectors from the cytoplasm and nucleus. The feature vector of the nucleus had 71 feature dimensions, and two approaches were applied to the random forest for training using half of the database images including 120 WBC cells. After training, testing was performed using all of the database images including 240 WBC cells.
120 percentage (%) 100 nucleus+cytoplasm nucleus 100 87.5 80 62 60 50 38 52 50 76 77 75 72.5 61 60 47 40 80 61 70 50 54 44 6566.7 67.34 120 percentage (%) 100 nucleus+cytoplasm nucleus

Feature extraction: In our previous study [5], we propose a new stained WBC image segmentation method using stepwise merging rules based on mean-shift clustering and boundary removal rules with a GVF (gradient vector flow) snake as shown in Fig. 1. After cell segmentation, we need to extract the features from the segmented nucleus and cytoplasm. For fast and efficient classification, we extract 12 ensemble features such as shape, colour, and texture features with 71 dimensions. We used two kinds of shape features: for the global shape feature, we use area, perimeter, and eccentricity first; for the geometric feature, we choose the first and second invariant moment shape signatures to be invariant for the shape’s scaling, rotation, and translation. In addition, because nuclei of neutrophil consist of two parts, as shown in Fig. 1, the number of nuclei is also used for classification. For the colour feature, we extract the average and standard deviation of the LUV colour associated with each nucleus instead of the colour histogram in order to reduce feature dimensions. The colour component L denotes the luminance of the colour, while components U and V together represent the chromatic information. Lastly, for the texture feature, we use 59 LBPs (local binary patterns) [6] because a LBP is robust against illumination changes, very fast to compute, and does not require many parameters to be set. LBP describes the greyscale local texture of the image with low computational complexity by detecting local patterns between adjacent pixels. Finally, we extract 71 properties of each WBC as a feature vector. After feature extraction, each feature vector is normalised to [0 – 1] using Gaussian normalisation.

100

50 54.2

40

20

20

0 1 2 3 class 4 5 average

0 1 2 3 class 4 5 average

a

b

Fig. 2 Performance comparison results between nucleus + cytoplasm and nucleus
a Precision b Recall Numbers labelling graphs represent WBC cell types: 1. basophil, 2. eosinophil, 3. lymphocyte, 4. monocyte, 5. neutrophil

Overall, the second approach outperformed the first approach with an average precision rate of 72.5% compared to 61% and recall rate of 67.3% compared to 54.2%, as shown in Fig. 2. The main reason for the lower rate of the first approach is that the cytoplasm of some WBCs presented a very weak difference against the background, and they were in contact with neighbouring WBCs and red blood cells. Moreover, in certain cases, the cytoplasm had a complex texture or

ELECTRONICS LETTERS 26th May 2011 Vol. 47 No. 11

IEEE Trans. 1 –6 6 Ojala. Part. J. Ko. Mach. These results prove that the random forest is a reasonable classifier for WBC classification with small training datasets.64%. respectively). 3 Performance comparison results between MSVM and random forest classifiers a Precision b Recall Numbers labelling graphs represent WBC cell types: 1. 11 .. J. C.Y. B.W. J. 86–91 5 Gim.. January 2011. experimental results showed that using the random forest with dynamic features could indeed improve the classification performance compared to other classification methods when using small training datasets. 2001.ac. 4. the random forest produced a better classification performance.. 2007..several different colour regions (granules). 5. classification using only the nucleus outperformed classification using the cytoplasm and nucleus because the cytoplasm of some WBCs presents a very weak difference against the background and touches neighbouring WBCs and red blood cells.W. pp.S. N. San Francisco.. Int. 4. Altamirano.A.C. Gonzalez. Nam (Department of Computer Engineering. FLAIRS Conf. (7). In addition. J.1049/el. eosinophil. neutrophil Conclusion: This study demonstrated an efficient WBC image classification method using ensemble features from the nucleus and a random forest classification scheme.. and Maenpaa. N. Inf. 971–987 7 Breiman.5% compared to the MSVM method (72.. and Nam..34 58.: ‘Multiresolution gray-scale and rotation invariant texture classification with local binary patterns’.34% to 58.2011. (2). H.. Mach. and Guichard.. Graph. (1). SPIE’ Machine Vision Applications.: ‘Random forests’. monocyte. B.C. 2.7 50 50 67.5 80 62. 3]. Diaz.: ‘Segmentation of bone marrow cell images for morphological classification of acute leukemia’.. pp. 22. L. and Ayala.5 70 50 45 66. 5– 32 40 20 20 0 1 2 3 class 4 5 average 0 1 2 3 class 4 5 average a b Fig. 24. Park. lymphocyte. M. Proc.. in particular. Pietikaninen. 2005.H. These results show that the nucleus alone is adequate for classifying WBCs without using cytoplasm. Zong. pp. USA.: ‘Multidimensional pattern recognition and classification of white blood cells using support vector machines’. M.: ‘Leukocyte segmentation and SVM classification in blood smear images’. 187– 200 4 Reta.. In ELECTRONICS LETTERS 26th May 2011 Vol.5 60 50 37. J.. 3.5 60 60 50 40 80 62. 1. Technol. we compared the random forest using only the nucleus with the multi-class support vector machine (MSVM). Int. Acknowledgment: This work was supported by grant no. Mach.. 2008. 107–118 3 Ramoser.. Secondly. J. pp. basophil. R. Daegu 704-701. Vis. 11. Learn. Florida. Vol. ‘Proc. (3). 120 percentage (%) 100 100 87.5 against 60%. and Dhompongsa.. S.7 120 percentage (%) 100 MSVM random forest addition. pp. the average recall rate of the random forest outperformed the MSVM method at 67. Ko. 3a and b.: ‘A novel framework for white blood cell segmentation based on stepwise rules and morphological features’.: ‘Morphological granulometric features of nucleus in automatic bone marrow white blood cell classification’. Republic of Korea) E-mail: niceko@kmu. 17. As shown in Figs. RTI04-01-01 from the Regional Technology Innovation Program of the Ministry of Knowledge Economy (MKE). Shindang-Dong Dalseo-Gu.Y. Keimyung University.5 50 50 75 75 75 72. M. N.0831 One or more of the Figures in this Letter are available in colour online. # The Institution of Engineering and Technology 2011 25 March 2011 doi: 10.. T. Biomed. pp. 353– 359 2 Adjouadi. J. 47 No. J. Part.. pp. Charact. Pattern Anal. May 2010. precision rate improved 12. IEEE Trans. 2002.64 MSVM random forest 100 85. Lee. 7877. T. Gim and J.kr References 1 Theera-Umpon.H. Intell. (1). which is known as a better WBC classifier than other conventional classifiers and has recently come into wide use [2. Syst.

Sign up to vote on this title
UsefulNot useful