1

Robust Band Profile Extraction Algorithm, Using Constrained N-P Machine Learning Technique
Shadab Khan, Jo˜ o Sanches and Rodrigo Ventura a

Abstract—Poor image quality, a typical characteristic of images of bone marrow cells taken during cell division process, poses a challenging task of extraction of accurate band profile representative of intensity distribution over chromosomes, hence necessitating requirement of a robust method to tackle this problem. An algorithm was thus developed, which estimates a single line medial axis, the basis for computation of band profile. Medial axis was generated by computing an ultimate prediction, using primary and secondary predictions obtained by a nonparametric machine learning algorithm trained with data from chromosome’s skeleton, and geometrical properties of medial axis respectively. Experiments were performed using LK 1 dataset. The algorithm was found capable of estimating a satisfactory single line medial axis. Band profile obtained was found to be accurate representative of intensity levels in different regions of chromosomes. Additionally this algorithm was found to be robust, as it was capable of growing a very small seed region into desired medial axis and handled highly irregular chromosomes well. Index Terms—Medial Axis, Discrete Curve Evolution, Band Profile, Biological Cells.

Fig. 1.

A typical karyogram from LK1 dataset.

I. I NTRODUCTION

I

N the field of cytogenetic, Karyotype is the set of features that can be used to study taxonomic relationships,chromosomal aberrations or steps of evolution in the past. Karyotyping is the procedure by which these studies can be carried out, manual procedure requires a considerable time of an expert, thus automating it is highly desirable. In this regard, classifier design often suffers due to measurement degradation in features, over which it relies for classification. Band profile is one such prominent feature, which has been used widely such as in [1]–[4]. A band profile extracted should be an accurate representative of the spatial distribution of intensity over regions of chromosome surface, for the classifier to be able to discriminate with high rate of classification. Thus an algorithm that can accurately extract band profiles for the chromosomes is essential. Although sufficient literature is already available to process images from high quality dataset such as Copenhagen or Edinburgh, a satisfactory method to work on images of bone marrow cells, taken during mitosis, is still missing. Images of bone marrow cells suffer because the chromosomes are often distorted in shape with considerable blur and unclear edges. Close observation of [3, Fig. 6d] reveals that the lines orthogonal to medial axis, which were drawn for the
S. Khan is with Manipal Inst. of Technology, India. shadab@arro.in Rodrigo Ventura (yoda@isr.ist.utl.pt) and Jo˜ o Sanches (jmrs@ist.utl.pt) are a with Institute for Systems and Robotics, Technical Superior Institute, Lisbon, Portugal.

computation of band profile were often intersecting in the regions which were not close to the boundary, which is highly undesirable as integrals of intensity values along these lines were used to compute band profile, thus counting contribution of same pixel multiple times. Jau Hong Kao et al. [4] proposed a method, which was better in terms of visually rendering medial axis than [3], but the problem due to unconstrained interpolation can be observed in [4, Fig. 4a]. Thus, a robust algorithm to adapt the changes in contour of chromosome was required. The algorithm proposed here starts with computation of medial axis. It does so by first training a nonparametric machine learning algorithm with training set taken from the skeleton of chromosome, using which a primary prediction is found out. Using the information available about the dependence of points on medial axis over contour of chromosome, a secondary prediction is computed as well. Lastly, using primary and secondary prediction, final prediction is computed, which is then appended to the part of skeleton, with which algorithm started (seed region). The algorithm continues recursively until a complete single line medial axis has been estimated and as a last step band profile is computed. The algorithm described is robust, computationally inexpensive in its performance and is capable of processing chromosomes with highly irregular contour as well. II. A LGORITHM D ESCRIPTION In this work medial axis of a closed contour is defined as a single continuous curve transversing across the length of the contour, for simplifying further discussion. A formal definition of skeleton is described later. The complete algorithm to compute band profile can be divided in four major steps, which are: (1) Preprocessing of chromosome subimage; (2) Skeletonizing the chromosome subimage; (3) Seed Region growing; and (4)

it also returns two bifurcation points. with their own advantages and disadvantages. skeleton pruning algorithm based on contour partitioning using discrete curve evolution (DCE) [?] was chosen. with n vertices. s2 )l(s1 )l(s2 ) l(s1 ) + l(s2 ) In the above formula. b) Pruned skeleton. P 3 such that P n−m is obtained by removing vertex v with smallest K in P n−(m−1) . B. X always contains a series of numbers in increasing or decreasing order since M (D) obtained for a chromosome with length greater then breadth will be spatially distributed along the length as shown in Fig. . with a structuring element is performed as noted below: ˆ ˆ A ◦ B = z|Bz ∩ z|Bz ⊆ A = φ . to aid in explanation. Preprocessing of chromosome subimage Chromosome subimages are extracted one by one from an ordered karyogram image. discrete curve evolved for chromosome marked blue. h) Original medial axis marked red. c) Seed region marked red. Due to noisy boundary edges of the chromosome. A “smoothed” form of δD is also obtained. Skeletonizing the chromosome subimage In this step. s2 are replaced by a single segment joining the endpoints of s1 ∪s2 . β(s1 . Skeleton pruning by contour partitioning is not poised to noise and produces a stable skeleton [?]. The medial axis M (D) for entire chromosome is obtained by growing this seed region. j) Seed region with 10 elements marked red. Few definitions are noted down here. a) Chromosome skeleton marked white. and is then binarized using suitable threshold. C. 1d M (D) = [X Y ]. T an(s) is the set consisting of points of intersection of maximal disc centred at s and δD . owing to poor quality of the images from LK1 dataset. till it includes two points from δD. which are analytic and polygonal. Then.3c Mesh laying & reconstruction of geometrically compensated image. A “Seed Region” as will be referred to here. several unwanted branches of the skeleton of chromosome were observed which demanded pruning. s2 ) = β(s1 . According to Blum [2] skeleton S(D) of a set D is the locus of the centre of a closed disk in D that touches δD and is not a subset of any other disk in D. Properties of interest of K are [3] and [4]. DCE successively produces simpler polygons P = P n−1 . two neighboring line segments s1 . X and Y to describe the locations of pixels constituting M (D). The method used here has merit of fast computation and allows control over desired number of branches in the skeleton. seed region will be referred as M (D). For this purpose. A. s1 . s2 ) is the turn angle at v and l is the length function normalized with respect to the total length of a polygonal curve. . Since any closed digital curve can be assumed to represent a polygonal curve with many vertices (each pixel denoting a vertex). k) Grown medial axis from seed region in Fig. Seed Region Growing When DCE is forced to produce a skeleton with four endpoints. since the emphasis is on growing seed region to produce medial axis. bifurcation points marked red. deg(s) is defined as cardinality of T an(s). is the region in S(D) that connects the two bifurcation points. Starting with an input polygon P . It is assumed that a planer set D. S(D) so obtained is then “thinned” recursively until one pixel wide skeleton is obtained. prepared by experts. Since the boundary of chromosome is very irregular. Degree of s. . DCE returns a simpler representation of D through following method: Contour is divided into many segments with number of vertices equal to number of pixels on δD (initially). Considering an image. thus indicating the algorithm to stop expansion of seed region. P n−2 . during every evolution step. The substitution is done based on the relevance measure K given by: K(s1 . it is smoothed using a 3 step process: 1) Image opening operation on the chromosome sub image. d) Boundary obtained using DCE with 20 vertices e) Structuring matrix. A set of bifurcation point is defined as {s ∈ S(D) : deg(s) ≥ 3}. original boundary of chromosome marked white. Thus formally skeleton pruning is defined as elimination of s∀s∈ S(D) whose generating points lie in the same open segment.2 0 0 0 1 0 0 0 (a) (b) (c) (d) (e) (f) (g) 0 0 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 1 1 (h) 0 1 1 1 1 1 0 0 0 1 1 1 0 0 0 0 0 1 0 0 0 Fig. the purpose of which will be clear in further discussion. For simplifying further discussion. and is discussed here in brief. this should not cause any confusion. skeleton of the chromosome subimage is developed. Chromosome Images. Numerous methods for skeleton pruning exist. . 2. is the closure of a connected subset of R2 and its boundary δD is constituted by simple closed curves. i) Smoothed medial axis. M (D) will have two column vectors. All subimages are then padded with additional rows and columns to aid in further steps of skeletonization. s2 are line segments incident on a common vertex v. Higher value of K(s1 ∪s2 ) signifies relatively higher contribution of s1 ∪s2 towards the overall shape of the contour.

Band profile for the chromosome subimage is then computed as the average intensity values across each row H(n) of the . then 4 and finally 8. the weight function used was a time-shifted function of a fairly standard model. Here “close” implies EP(1or2) − EM ≤ T where T is a threshold parameter to ensure that EM lies close to either EP1 or EP2 . y coordinate is predicted and a normal norm(p) to C at p is found out. θ’s are the parameters for hypothesis hθ . ||Sx (j) − x| − pn| W (j) = exp − 2τ 2 Here x is the query point and τ is bandwidth parameter used to control the rate of decay of the exponential function. first with knots at interval of 3. which is unlikely considering the usual structure of chromosomes in the dataset or can be attributed to small protuberance on the boundary and should be ignored. 2) For EM with coordinates (X(i).5 4) If p2 is located such that p2 − p ≤ T1 then. In further discussion Sx and Sy will be used to denote input feature vector component of S and target variable vector component of S respectively. pn 2 1) Fit θ to minimize j=0 W (j) {Sy (j) − hθ (Sx (j))} . The algorithm then continues next iteration. using the hypothesis function hθ (x). a two-step process is performed. D2 } 2) Calculate D1 − p and D1 − p 3) If D1 − p − D2 − p ≤ 0. and predicting p using a nonparametric machine learning algorithm namely locally weighted linear regression. W t1 and W t2 are the weights assigned to p and p2 for finding out EM . Mesh laying & geometrically compensated image After an accurate medial axis has been obtained. Algorithm begins by finding the plausible candidate p. 3. followed by connecting neighboring vertices to form a closed polygonal curve. and find set IS IS = {orth(s) ∩ δD} 3) Extract the profile of chromosome subimage lying between the elements of IS . For this purpose. Mesh laying over chromosome and geometrical compensation of chromosome subimage was performed through the following steps: 1) Smoothing of medial axis using successive cubic splines. 3) Discarding all other elements in δD except for 20 vertices found using DCE. thus resulting in a single line medial axis M (D). To grow M (D) is to include an extra element EM (X(i). Y (i)) X(i) ∈ X and / EM lies close to endpoints of M (D) named EP1 and EP2 . This is done by first finding the next element in the sequence of X. a mesh was distributed on the chromosome. this time with EM obtained in last iteration being EP1 or EP2 . if not then a secondary prediction is computed: 1) Let norm(p) ∩ δD = {D1 . Considering the context of the work. and θT is transpose of vector of parameters θ. which lie towards the extremities of X and Y as shown in the Fig. the point to be included. This allows control over how fast the slope of C at EM will change. Finally.3 where A is chromosome image and B is structuring element as shown in Fig. This boundary will be referred to as δD2 . In the context of the work. Shift introduced this way allows for points lying away from the boundary elements of M (D) to have higher weight. which do not contribute significantly in the topology of chromosome. D. then a normal norm(p) to C on EM intersects δD at points D1 and D2 such that D1 − EM ≈ D2 − EM where “ pt1 − pt2 ” operator denotes Euclidean distance between points pt1 and pt2 . and preserve them for an accurate shape reconstruction. T1 also aids in identifying small protuberances and irregularities in the contour. For the purpose of constraining hθ (x) to attain values as described by vector Sy . and transport it into a new image to reconstruct geometrically compensated chromosome subimage. Next step is to decide if p meets the criteria for EM .8 then prediction is reliable. Y (i)). EM = W t1 × p + W t2 × p2 W t1 + W t2 where T1 is a threshold to either validate or discard a secondary observation p2 . To capture the geospatial distribution of intensity level. Depending on the shape of the chromosome. A hypothesis hθ is then defined such that: hθ (x) = θ0 + θ1 x hθ (x) is the function used to predict output values for input x. which will be referred to as primary prediction. and the variation along the boundary. This is an intuitive measure considering that predicted point lying too far away from EP1 or EP2 might induce oscillations in C. 2) Draw orthogonal line orth(s)∀s ∈ M (D). Next. M (D) is updated with an extra element EM added beyond EP1 or EP2 depending on whether the growing was performed on upper half or lower half of the chromosome subimage. A training set S is formed by taking pn elements from X and Y . next step is geometrical compensation of the image (for feature extraction). 1g. This mesh was then transformed to reconstruct the chromosome shape. The process terminates when EM is found such that it is beyond the boundary of chromosome. 2) Output the hypothesis hθ using θT Sx Here W (j)’s are nonnegative valued weights. if not then then calculate p2 such that p2 = (D1 + D2 ) × 0. for the x coordinate of p as input. 2) Discrete curve evolution with 20 vertices on the boundary. it might be desirable to chose the element to be included in M (D) depending on p or p2 . chromosome was assumed to be a two dimensional deformable surface. the elements lying away from the endpoints EP1 or EP2 are more likely to have been a part of the seed region. This is a valid operation since during subsequent iteration of the algorithm to grow M (D).5. Fig. p should be such that: 1) If a curve C is defined to be representative of spatial distribution of M (D) at EM .

” IEEE Trans. Blum.. Wang. 1989. y) = xT y is the correlation function of the column vectors x and y and h1 (n) and h2 (n) are the band profiles of 1st and 2nd chromosomes of the same class. it is important that orth(s)∀s ∈ M (D) should be nonintersecting. Narayanan et al. Khmelinskii. The distance between the chromosomes is the Euclidean distance between one profile and the other. pp. “Skeleton Pruning by Contour Partitioning with Discrete Curve Evolution.Theoretical Biology. 3. III. thus it was of our interest to see the pairing results based only on band profiles of chromosomes. “Datadriven homologue matching for chromosome identification. J. R ESULTS AND D ISCUSSION Fig.4 (a) Fig. . V. Gader. no. 4 and Fig. Latecki and R. shape compensated image.. [7] L. Y. 38. 274-283. Latecki and R. it is intended to develop an accurate chromosome pairing system. no. correspondence can be observed. 2002. Pattern Recognition. [8] L. 3. pp. such that. Piper and E. 5c) into M(D) with 60 elements (Fig. Imag.449-462. 2000.J. Lakamper. 242-255. vol. F UTURE W ORK An accurate algorithm to compute medial axis of chromosomes from LK1 dataset is very motivating. The optimization is performed by testing the correlation function for σ = {−10. Eds. 10. J. [5] X. and Mach. 17. σ = arg [max {ϕ((h1 (n). vol. Oct. C ONCLUSION In this paper an algorithm to accurately extract the band profile of chromosomes was presented. IV. R. pp.” IEEE Trans. [2] J.. j) = hi (n) − hj (n − σ) 2 . no.” Cytometry. 10.29. [4] J. To avoid measurement degradation during the comparison. Having this vital tool for automatic karyotyping purpose. as it was able to grow seed region with 10 elements (Fig. 10 out of 22 times algorithm was able to pair chromosomes from same class to each other perfectly. pp. 1. 2006. . 205-287. here N is the number of columns in the shape compensated chromosome subimage and I is the intensity of a pixel. C. Ventura and J. R EFERENCES [1] J. on Pattern Anal. 5 show the effectiveness of algorithm in finding a smooth medial axis. Figure 5c and Fig. Keller. J. P. L. Berlin: Springer Verlag.to be published. In the context of karyotyping. P. Shape Similarity Measure based on Correspondence of Visual Parts. h2 (n − σ))}] where ϕ(x.J. on the average.. pp. Lakamper. Latecki and W. Biological Shape and Visual Science (Part I). P. and W. the two band profiles were first aligned by estimating a shift constant σ.” LNCS. 5d prove the robustness of the algorithm in accurate prediction of elements to be included in medial axis. on Med. i) i=1 with experiments on LK1 dataset. which can be used to compute band profile. Kao. M. vol. H. Chuang and T. “Automatic Chromosome Classification Using Medial Axis Approximation and Band Profile Similarity. (b) Image displaying chromosomes from class 2 along with their band profiles. 10} and choosing the one that maximizes it. 22. “A Novel Metric for Bone Marrow Cells Chromosome Pairing.J. no. Figure 7 shows the band profiles for a pair of chromosomes from class 2. Intell. “On fully automatic feature measurement for banded chromosome classification. and Mach. Liu. 5. no. vol. on Biomed. The correspondences between the band profiles can be confirmed visually and by considering the normalized correlation between them. H(n) = 1 N N I(n. 5d). Caldwell. from here. Eng. pp. vol. vol.. 35. 1998. The algorithm was found robust in its operation and is a computationally inexpensive model to be implemented. 451-462. shifted by σ. Granum. 1185-1190. although it was known that it is not possible to have accurate results using only band profiles. March 2007. This objective was met as can be confirmed by visual inspection of Fig. [3] A. d(i. IEEE Trans.3. using additional features such as geometrical properties of chromosomes along with statistical data. Intell. . Sanches. Stanley. Application of Planar Shape Comparison to Object Retrieval in Image Databases. 1973. Bai. H. [6] H. 15-29. R. on Pattern Anal.” IEEE Trans.J. . 3852. pp.