Professional Documents
Culture Documents
Clasificacion de Galaxias.
Clasificacion de Galaxias.
net/publication/45915512
CITATIONS READS
10 255
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by John Abela on 28 May 2014.
Adam
1
Gauci1 , Kristian Zarb Adami2 , John Abela1
Department of Intelligent Computer Systems, Faculty of ICT, University of Malta
2 Department of Physics, Faculty of Science, University of Malta
arXiv:1005.0390v2 [astro-ph.GA] 1 Jun 2010
ABSTRACT
In this work, decision tree learning algorithms and fuzzy inferencing systems are ap-
plied for galaxy morphology classification. In particular, the CART, the C4.5, the
Random Forest and fuzzy logic algorithms are studied and reliable classifiers are de-
veloped to distinguish between spiral galaxies, elliptical galaxies or star/unknown
galactic objects. Morphology information for the training and testing datasets is ob-
tained from the Galaxy Zoo project while the corresponding photometric and spectra
parameters are downloaded from the SDSS DR7 catalogue.
Key words: SDSS, Galaxy Zoo, galaxy morphology classification, decision trees,
fuzzy logic, machine learning
Set Name Number of Objects No. of Ellipticals No. of Spirals No. of Stars/Unknown objects
3.1 C4.5 1990). Although a divide and conquer search strategy simi-
lar to that of C4.5 is used, the resulting tree structure, the
Following work done by Hunt in the late 1950s and early
splitting criteria, the pruning method as well as the way
1960s, Ross Quinlan continued to improve on the developed
missing values are handled, are redefined.
techniques and released the Iterative Dichottomizer 3 (ID3)
CART only allows for binary trees to be created. While
and the improved C4.5 decision tree learners (Kohavi and
this may simplify splitting and optimally partitions cate-
Quinlan 1990). Even though the C5 algorithm is commer-
gorical attributes, there may be no good binary split for a
cially available, the freely available C4.5 algorithm will be
parameter and inferior trees might be inferred. However, for
brought forward.
multi-class problems, twoing may be used. This involves sep-
Here trees are built by recursively searching through
arating all samples in two mutually exclusive super-classes
and splitting the provided training set. If all samples in the
at each node and then apply the splitting criteria for a two
set belong to the same class, the tree is taken to be made of
class problem.
just a leaf node. Otherwise, the values of the parameters are
As a splitting criterion, CART uses the Gini diversity
tested to determine a non trivial partition that separates the
index. Let RF (Cj , S) again be the relative frequency of sam-
samples into the corresponding classes. In C4.5, the selected
ples in set S that belongs to class Cj , then the Gini index
splitting criterion is the one that maximizes the information
is defined as:
gain and the gain ratio.
x
Let RF (Cj , S) be the relative frequency of samples in X
Igini (S) = 1 − RF (Cj , S)2
set S that belongs to class Cj . The information that identi-
j=1
fies the class of a sample in set S is:
x
X and the information gain due to a particulat test T can be
I(S) = − RF (Cj , S)log(RF (Cj , S)) computed from:
j=1 t
X |Si |
After applying a test T that separates set S in S1 , S2 , ..., Sn , G(S, T ) = Igini (S) − Igini (Si )
|S|
the information gained is: i=1
Xt
|Si | As with the C4.5 algorithm, the split that maximises
G(S, T ) = I(S) − I(Si ) G(S, T ), is selected. If all samples in a given node have the
|S|
i=1 same parameter value, then the samples are perfectly ho-
The test that maximises G(S, T ) is selected at the respective mogenous and there is no impurity.
node. The main problem with this approach is that it favours The CART algorithm also prunes the tree and use
tests having a large number of outcomes such as those pro- cross validation methods that may require more computa-
ducing a lot of subsets each with few samples. Hence the tion time. However, this will render shorter trees than those
gain ratio that also takes the potential information from the obtained from C4.5. Samples with missing data may also be
partition itself is introduced: processed.
t
X |Si | |Si
P (S, T ) = − log
i=1
|S| |S| 3.3 Random Forests
If all samples are classified correctly, the tree may be Breiman and Cutler (2001), the pioneers of random forests,
overfitting the data and will fail when attempting to classify suggest using a classifier in which a number of decision trees
more general, unseen samples. Normally this is prevented are built. When processing a particular sample, the output
by restricting some examples from being considered when by each of the individual trees is considered and the result-
building the tree or by pruning some of the branches af- ing mode is taken as the final classification. Each tree is
ter the tree is inferred. C4.5 adopts the latter strategy and grown from a different subset of examples allowing for an
remove some branches in a single bottom up pass. unseen (out of bag) set of samples to be used for evaluation.
One of the main advantages of the C4.5 algorithm is Attributes for each node are chosen randomly and the one
that it is capable of dealing with real, non-nominal attributes which produces the highest level of learning is selected. It is
and so renders itself compatible with continuous parameters. shown that the overall accuracy is increased when the trees
It can also handle missing attribute data. are less correlated. Having each of the individual trees with
a low error rate, is also desirable.
Apart from producing a highly accurate classifier, such
3.2 CART
a scheme can also handle a very large amount of samples and
The Classification and Regression Tree (CART) scheme was input variables. A proximity matrix which shows how sam-
developed by Friedman and Breiman (Kohavi and Quinlan ples are related, is also generated. This is useful since such
4 A. Gauci et al.
relations may be very difficult to be detect just by inspec- Table 2. Set of input parameters that are band independent,
tion. With this strategy, good results may still be obtained from the i band (≈ 700nm − 1400nm) and from the r band
even when a large portion of the data is missing or when the (≈ 700nm)
number of examples in each category is biased.
Name Description
Fuzzy logic accommodates soft computing by allowing for an deVAB i DeVaucouleurs fit axis ratio
imprecise representation of the real world. In crisp logic a expAB i Exponential fit axis ratio
clear boundary is considered to separate the various classes lnLExp i Exponential disk fit log likelihood
and each element is categorised into one group such that lnLDeV i DeVaucouleurs fit log likelihood
lnLStar i Star log likelihood
samples in sets A and notA represents the entire dataset.
petroR90 i / petroR50 i Concentration
Fuzzy logic extends on this by giving all sample a degree of mRrCc i Adaptive (+) shape measure
membership in each set hence also caters for situations in texture i Texture parameter
which simple boolean logic is not enough. If classically set mE1 i Adaptive E1 shape measure
membership was denoted by 0 (false) or 1 (true), now we mE2 i Adaptive E2 shape measure
can also have 0.25 or 0.75. In fuzzy logic, the truth of any mCr4 i Adaptive fourth moment
statement becomes a matter of degree.
deVAB r DeVaucouleurs fit axis ratio
The mathematical function that maps each input to the
expAB r Exponential fit axis ratio
corresponding membership value between 0 and 1 is known lnLExp r Exponential disk fit log likelihood
as the membership function. Although this can be arbitrary, lnLDeV r DeVaucouleurs fit log likelihood
such function is normally chosen with computation efficiency lnLStar r Star log likelihood
and simplicity kept in mind. Various common membership petroR90 r / petroR50 r Concentration
functions include the triangular function, the trapezoidal mRrCc r Adaptive (+) shape measure
function, the gaussian function and the bell function. The texture r Texture parameter
latter are the most popular and although they are smooth, mE1 r Adaptive E1 shape measure
concise and can attain non-zero values anywhere, they fail mE2 r Adaptive E2 shape measure
mCr4 r Adaptive fourth moment
in specifying asymmetric membership functions. Such limi-
tation is elevated through the use of sigmoid functions that
can either open left or right.
5.1 Photometric Attributes
For an inference system, if − then rules that deal with
fuzzy consequents and fuzzy antecedents are defined. An ag- In this study, the set of 13 parameters as taken by Banerji
graded fuzzy set is then outputted after these conditional et al. (2009) which are based on colour, profile fitting and
rules are compared and combined by standard logical oper- adaptive moments were used. However, we did not limit the
ators equivalents. Since the degree of membership can now evaluation to the i band but also aimed at testing whether
attain any value between 0 and 1, the AN D and OR opera- the values derived from the r band give equal or better classi-
tors are replaced by the max and min functions respectively. fication accuracies. The input parameters used are presented
The resulting output is then defuzzified to obtain one output in Table 2.
value. The DeVaucouleurs law provides a measure of how the
Fuzzy inference systems are easily understood and can surface brightness of an elliptical galaxy varies with appar-
even be applied when dealing with imprecise data. Like de- ent distance from the centre. This should provide a good
cision tree classifiers, they provide a penetrative model that element of discrimination between spiral and elliptical pro-
experts can analyze and even add other information to it. files. The lnLStar parameter also helps to separate galaxy
Such inference approaches have already been successfully from star objects. The concentration parameter is given by
applied in a number of applications that range from inte- the ratios of radii containing 90% and 50% of the Petrosian
gration in consumer produces to industrial process control, flux in a given band. The texture parameter compares the
medical instrumentation and decision support systems. range of fluctuations in the surface brightness of the object
to the full dynamic range of the surface brightness. It is ex-
pected that this is negligible for smooth profiles but becomes
significant in high variance regions such as spiral arms.
5 INPUT PARAMETERS
The other parameters used are based on the object’s
In all machine learning algorithms, the set of input param- shape. Particularly, the adaptive moments derived from the
eters strongly determine the overall accuracy of the clas- SDSS photometric pipeline are second moments of the ob-
sifier. Ideally, a minimum number of attributes that can ject intensity, measured using a particular scheme designed
differentiate between the three galaxy morphology classes to have an optimal signal to noise ratio. These moments
are required. For this work, photometric and spectra values are calculated by using a radial weight function that adopts
downloaded from the SDSS PhotoObjAll and SpecLineAll to the shape and size of the object. Although theoretically
tables, were used. Data for which classification information there exists an optimal radial shape for the wight function
is available in the Galaxy Zoo catalogue were downloaded related to the light profile of the object, a Gaussian with size
and used to test the various machine learning algorithms matched to that of the object is used (Bernstein and Jarvis
used. 2002).
Machine Learning for Galaxy Morphology Classification 5
The sum of the second moments in the CCD row and Table 3. Wavelengths of spectra lines
column direction (mRrCc) is calculated by:
mRrCc =< c2 > + < r2 > Wave Label Wave Label
where c and r correspond to the columns and rows of the 3727.09 OII 3727 4960.30 OIII 4960
3729.88 OII 3730 5008.24 OIII 5008
sensor respectively and the second moments are defined as
3798.98 Hh 3799 5176.70 Mg 5177
[I(r, c)w(r, c)c2 ]
P
3836.47 Oy 3836 5895.60 Na 5896
< c2 >= P 3889.00 HeI 3889 6302.05 OI 6302
[I(r, c)w(r, c)]
3934.78 K 3935 6365.54 OI 6366
I is the intensity of the object and w is the weighting func- 4072.30 SII 4072 6549.86 NII 6550
tion. The ellipticity/polarisation components are defined by: 4102.89 Hd 4103 6564.61 Ha 6565
4305.61 G 4306 6585.27 NII 6585
< r2 >
me1 =< c2 > − 4341.68 Hg 4342 6718.29 SII 6718
M RrCc 4364.44 OIII 4364 6732.67 SII 6733
4862.68 Hb 4863
< (c)(r) >
me2 = 2
M RrCc
A fourth order moment is also defined as:
< (c2 + r2 )2 >
mcr4 =
σ4
In this case, σ is the weight of the Gaussian function applied.
6 RESULTS
Initially, the 13 photometric parameters derived from the i
band were standardised and independent component analy-
sis was performed to determine the most significant compo-
nents. As can be seen from the resulting eigenvalues shown
in Figure 3, all of the independent components attain a non-
zero value. This implies that all attributes are important for
galaxy classification and dimension reduction is unnecessary. Figure 4. Eigenvalues from the 13 r band parameters
Figure 4 and Figure 5 show the eigenvalues obtained when
6 A. Gauci et al.
for testing and the other nine subsets were put together to
form the training set. The presented results are the com-
puted averages across all ten trials. By this approach, every
sample is part of the test set at least once.
Figure 9. Decision tree confusion matrices for r band input pa- Figure 11. Decision tree confusion matrices for spectra input
rameters parameters
REFERENCES
K. N. et al Abazajian. The seventh data release of the sloan
sigital sky survey. ApJS, 182:543–558, 2009.
R. Andrae and P. Melchior. Morphological galaxy classifi-
cation with shapelets.
N. M. Ball, J. Loveday, M. Fukugita, O. Nakamura, S. Oka-
mura, J. Brinkmann, and R. J Brunner. Galaxy types in
Figure 13. Samples of spiral (top), elliptical (middle) and un- the sloan digital sky survey using supervised artificial neu-
known (bottom) galaxies that were incorrectly classified by the ral networks. MNRAS, 348:1038–1046, 2009.
fuzzy inference system S. P. Bamford, R. C. Nichol, I. K. Baldry, K. Land,
C. J. Lintott, K. Schawinski, A. Slosar, A. S. Szalay,
D. Thomas, M. Torki, D. Andreescu, E. M. Edmondson,
In most cases, when processing photometric parameters, C. J. Miller, P. Murray, M. J. Raddick, and J. Vandenberg.
the adaptive shape measure (mRrCc) parameter was chosen Galaxy zoo: the dependence of morphology and colour on
as the root of the tree. First level nodes included the con- environment. Monthly Notices of the Royal Astronomical
centration (petroR90/petroR50) and the dered g-dered r Society, 393:1324–1352, 2009.
parameters. For spectra data, the Ha wave line was deter- M. Banerji, O. Lahav, C. J. Lintott, F. B. Abdalla,
mined to provide the highest information gain while the Hb K. Schawinski, D. Andreescu, S. Bamford, P. Murray,
and the K lines were chosen as first level nodes. M. J. Raddick, A. Slosar, A. Szalay, D. Thomas, and
Figure 13 shows samples of incorrectly classified galax- J. Vandenberg. Galaxy zoo: Reporducing galaxy mor-
ies by the fuzzy inference system. Although this is not the phologies via machine learning. 2009.
most accurate technique described, the incorrectly classified G. M. Bernstein and M. Jarvis. Shapes and shears, stars
spiral and elliptical samples are very faint in magnitude. and smears: Optimal measurments for weak lensing. The
Moreover, all incorrectly classified unknown objects have Astrophysical Journal, 123:583–618, 2002.
bright sources in the vicinity and this could have had an L. Breiman and A. Cutler. Random forests, 2001.
effect on the calculated parameters by the SDSS photomet- J. Calleja and O. Fuentes. Automated classification of
ric pipeline. galaxy images. (3215):411–418, 2004.
M. Fukugita, O. Nakamura, S. Okamura, N. Yasuda, J. C.
Barentine, J. Brinkmann, J. E. Gunn, M. Harvanek,
T. Ichikawa, R. H. Lupton, D. P. Schneider, M. Strauss,
8 ACKNOWLEDGEMENTS
and D. G. York. A catalog of morphologically classified
The GalaxyZoo data was supplied by Dr Steven Bamford galaxies from the sloan digital sky survey: North equato-
on behalf of the Galaxy Zoo team. The authors would like rial region. The Astronomical Journal, 134:579–593, 2007.
to thank him for his comments and suggestions that helped R. Kohavi and R. Quinlan. Decision tree discovery. 1990.
to improve this paper. C. J. Lintott, K. Schawinski, A. Slosar, K. Land, S. Bam-
Funding for the SDSS and SDSS-II has been provided ford, D. Thomas, M. J. Raddick, R. C. Nichol, A. Szalay,
by the Alfred P. Sloan Foundation, the Participating In- D. Andreescu, P. Murray, and J. VanDenBerg. Galaxy
stitutions, the National Science Foundation, the U.S. De- goo: Morphologies derived from visual inspection of galax-
partment of Energy, the National Aeronautics and Space ies from the sloan digital sky survey. MNRAS, 2008.
Administration, the Japanese Monbukagakusho, the Max SDSS. Algorithms - emission and absorption line fitting.
Planck Society, and the Higher Education Funding Council M. A. Strauss, D. H. Weinberg, R. H. Lupton, V. K.
for England. The SDSS Web Site is http://www.sdss.org/. Narayanan, J. Annis, M. Bernardi, M. Blanton, S. Burles,
The SDSS is managed by the Astrophysical Research A. J. Connolly, J. Dalcanton, M. Doi, D. Eisenstein, J. A.
Consortium for the Participating Institutions. The Partic- Frieman, M. Fukugita, J. E. Gunn, Z. Ivezic, S. Kent,
ipating Institutions are the American Museum of Natu- R. S. J. Kim, G. R. Knapp, R. G. Kron, J. A. Munn, H. J.
ral History, Astrophysical Institute Potsdam, University of Newberg, R. C. Nichol, S. Okamura, T. R. Quinn, D. J.
Basel, University of Cambridge, Case Western Reserve Uni- Richmond, M. W.and Schlegel, K. Shimasaku, M. Sub-
Machine Learning for Galaxy Morphology Classification 9
baRao, A. S. Szalay, D. V. Berk, M. S. Vogeley, N. Yanny,
B.and Yasuda, D. G. York, and I. Idit Zehavi. Spec-
troscopic target selection in the sloan digital sky survey:
The main galaxy sample. The Astronomical Journal, 124:
1810–1824, 2002.