Professional Documents
Culture Documents
Textbook Computational Science and Technology 5Th Iccst 2018 Kota Kinabalu Malaysia 29 30 August 2018 Rayner Alfred Ebook All Chapter PDF
Textbook Computational Science and Technology 5Th Iccst 2018 Kota Kinabalu Malaysia 29 30 August 2018 Rayner Alfred Ebook All Chapter PDF
https://textbookfull.com/product/computational-science-and-
technology-6th-iccst-2019-kota-kinabalu-
malaysia-29-30-august-2019-rayner-alfred/
https://textbookfull.com/product/computational-science-and-
technology-7th-iccst-2020-pattaya-
thailand-29-30-august-2020-rayner-alfred-editor/
https://textbookfull.com/product/computational-science-and-
technology-4th-iccst-2017-kuala-lumpur-
malaysia-29-30-november-2017-1st-edition-rayner-alfred/
https://textbookfull.com/product/user-science-and-
engineering-5th-international-conference-i-user-2018-puchong-
malaysia-august-28-30-2018-proceedings-natrah-abdullah/
Static Analysis 25th International Symposium SAS 2018
Freiburg Germany August 29 31 2018 Proceedings Andreas
Podelski
https://textbookfull.com/product/static-analysis-25th-
international-symposium-sas-2018-freiburg-germany-
august-29-31-2018-proceedings-andreas-podelski/
https://textbookfull.com/product/innovative-technologies-and-
learning-first-international-conference-icitl-2018-portoroz-
slovenia-august-27-30-2018-proceedings-ting-ting-wu/
https://textbookfull.com/product/network-and-system-
security-12th-international-conference-nss-2018-hong-kong-china-
august-27-29-2018-proceedings-man-ho-au/
https://textbookfull.com/product/discovery-science-21st-
international-conference-ds-2018-limassol-cyprus-
october-29-31-2018-proceedings-larisa-soldatova/
https://textbookfull.com/product/natural-language-processing-and-
chinese-computing-7th-ccf-international-conference-
nlpcc-2018-hohhot-china-august-26-30-2018-proceedings-part-i-min-
Lecture Notes in Electrical Engineering 481
Rayner Alfred
Yuto Lim
Ag Asri Ag Ibrahim
Patricia Anthony Editors
Computational
Science and
Technology
5th ICCST 2018, Kota Kinabalu,
Malaysia, 29–30 August 2018
Lecture Notes in Electrical Engineering
Volume 481
Europe:
North America:
Editors
Computational Science
and Technology
5th ICCST 2018, Kota Kinabalu, Malaysia,
29–30 August 2018
123
Editors
Rayner Alfred Ag Asri Ag Ibrahim
Knowledge Technology Research Unit, Faculty of Computing and Informatics
Faculty of Computing and Informatics Universiti Malaysia Sabah
Universiti Malaysia Sabah Kota Kinabalu, Sabah, Malaysia
Kota Kinabalu, Sabah, Malaysia
Patricia Anthony
Yuto Lim Lincoln University
School of Information Science, Christchurch, New Zealand
Security and Networks Area
Japan Advanced Institute of Science
and Technology
Ishikawa, Japan
This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Preface
v
vi Preface
vii
viii Contents
Hau Lee Tong1, Mohammad Faizal Ahmad Fauzi2, Su Cheng Haw1, Hu Ng1 and
Timothy Tzen Vun Yap1
1
Faculty of Computing and Informatics, Multimedia University, 63100 Cyberjaya, Selangor,
Malaysia
2
Faculty of Engineering, Multimedia University, 63100 Cyberjaya, Selangor, Malaysia
hltong@mmu.edu.my
Abstract. In this work, Computed Tomography (CT) brain images are adopted
for the annotation of different types of hemorrhages. The ultimate objective is
to devise the semantics-based retrieval system for retrieving the images based
on the different keywords. The adopted keywords are hemorrhagic slices, intra-
axial, subdural and extradural slices. The proposed approach is consisted of
three separated annotation processes are proposed which are annotation of
hemorrhagic slices, annotation of intra-axial and annotation of subdural and ex-
tradural. The dataset with 519 CT images is obtained from two collaborating
hospitals. For the classification, support vector machine (SVM) with radial ba-
sis function (RBF) kernel is considered. On overall, the classification results
from each experiment achieved precision and recall of more than 79%. After
the classification, the images will be annotated with the classified keywords to-
gether with the obtained decision values. During the retrieval, the relevant im-
ages will be retrieved and ranked correspondingly according to the decision
values.
1 Introduction
Intracranial hemorrhage detection is clinically significant for the patients having head
trauma and neurological disturbances. This is because early discovery and accurate
diagnosis of the brain abnormalities is crucial for the execution of the successful ther-
apy and proper treatment. Multi-slice CT scans are extensively utilized in today’s
analysis of head traumas due to its effectiveness to unveil some abnormalities such as
calcification, hemorrhage and bone fractures.
A lot of research works have been carried out to assist the visual interpretation of
the medical brain images. Brain hemorrhage can cause brain shift and make it lose its
symmetric property. As such, investigation of the symmetric information can assist in
hemorrhage detecting. Chawla et al. proposed the detection of line of symmetry based
on the physical structure of the skull [1]. Likewise, Saito et al. proposed a method to
detect the brain lesion based on the difference values of the extracted features be-
tween region of interests (ROIs) at symmetrical position [2]. The symmetric line is
determined from the midpoints of the skull contour. Besides the symmetric detection
approach, segmentation technique is commonly adopted to obtain the potential de-
sired regions. Bhadauria and Dewal proposed an approach by combining the features
of both fuzzy c-means and region-based active contour method to segment the hemor-
rhagic regions [3]. Al-Ayyoub et al. used Otsu’s thresholding method for the image
segmentation [4]. From the Otsu’s method, the potential ROIs are obtained. Jyoti et
al. proposed genetic FCM (GFCM), a clustering algorithm integrated with Genetic
Algorithm (GA) to segment the CT brain images with the modified objective function
[5]. Suryawanshi and Jadhao adopted a neural network approach with watershed seg-
mentation to detect the intracerebral haemorrhage and subdural [6].
Huge collections of medical images have contributed significant source of
knowledge and various areas of medical research. However, the major problem is that
the size of the medical image collection in hospitals faces constant daily growth. Thus
the amount of work and time required for retrieving a particular image for further
analysis is costly. As such, image retrieval particularly in medical domain has gained
some attention in the research area of content-based image retrieval (CBIR) [7-10].
Consequently, multi-slice CT brain images are adopted in this work to classify, an-
notate and retrieve the images by using different keywords based on the type of hem-
orrhages. The retrieval is based on the adopted keywords which are hemorrhagic slic-
es, intra-axial, subdural and extradural slices
The architecture diagram of our proposed system is shown in Fig. 1. The proposed
system consists of two main components, i.e., the offline mode and online mode. The
offline component will be the automated annotation part. Each CT scan will be anno-
tated by the keywords and the attached semantic keyword will be stored into a data-
base. On the other hand, the online component will be the retrieval part. CT images
can be retrieved according to the predefined keywords.
Offline
component
CT brain Midline
Preprocessing Clustering
images Detection
Feature
Classification
extraction
Online
component
CT scan
Query image by Labelling Retrieved CT
image
keyword database brain images
database
3 Preprocessing
The original CT images are lacking in dynamic range whereby only limited objects
are visible. The major aim of the enhancement is to augment the interpretability and
perception of images to contribute better input for the subsequent image processing
process. Firstly the histogram for the original image is constructed as shown in Fig.
2(a). The constructed histogram is consisted of several peaks. However, only the
rightmost peak is within the significant range for region of interest which is the intra-
cranial area. Then, the curve smoothing process is conducted where convolution oper-
ation with a vector is applied. Each element of the vector is of the value 10-3. The
smoothened curve is converted into absolute first difference (ABS) in order to gener-
ate two highest peaks. The two generated peaks will be the lower limit, IL and upper
limit, IU. The acquired IL and IU are utilized for the linear contrast stretching as shown
in equation (1).
( I (i, j ) I L )
F (i, j ) I max
(I U I L ) (1)
where Imax, I(i,j) and F(i,j) denote the maximum intensity of the image, pixel value
of the original image and pixel value of the contrast improved image respectively.
After contrast stretching, the contrast enhanced image is shown in Fig. 2(b).
The purpose of third preprocessing is to further enhance the parenchyma for the sub-
sequent clustering stage. Thus, this enhancement aim to make the hemorrhagic re-
gions more visible to clearly reveal the dissimilarity between the hemorrhagic hemi-
sphere and non-hemorrhagic hemisphere. Prior to contrast stretching, the appropriate
lower and upper limits need to be obtained. Firstly construct the histogram for the
acquired parenchyma. Then identify the lower limit, IL which is the peak position of
the constructed histogram. From the obtained lower limit, the upper limit can be de-
rived by equation (2). The determination of the appropriate lower and upper limits is
necessary to ensure the focus of the stretching is for the hemorrhagic regions rather
than normal regions.
IU= IL +Ie (2)
Lastly, input the auto-determined values of IL and IU into equation (1) for the con-
trast stretching and the result is depicted in Fig. 4.
The main goal for this section is to garner potential hemorrhagic regions into a single
cluster to be used for the annotation of the hemorrhages. Hemorrhagic regions always
appear as bright regions. Therefore only intensity of the images is considered for the
Automatic Classification and Retrieval of Brain Hemorrhages 5
clustering. In order to achieve this, firstly, the image is partitioned into two clusters.
From these two clusters, the low intensity cluster without potential hemorrhagic re-
gions is ignored. Only the high intensity cluster which consists of potential hemor-
rhagic regions is considered. Four clustering techniques which are Otsu thresholding,
fuzzy c-means (FCM), k-means and expectation-maximization (EM) are attempted in
order to select the most appropriate technique for subdural, extradural and intra-axial
hemorrhages annotation.
From the results obtained as shown in Fig. 5, Otsu thresholding, FCM and EM en-
countered over-segmentation as hemorrhagic region is merged together with neigh-
bouring pixels and more noises are present. On the other hand, k-means conserves the
appropriate shape most of the time and yields less noise. This directly contributes to
more precise shape properties acquisition at the later region-based feature extraction.
Consequently, k-means clustering is adopted in our system.
5 Midline detection
The proposed approach of the parenchyma midline acquisition consists of two stages.
Firstly, acquire the contour of the parenchyma as shown in Fig. 6(a). Then the line
scanning process will be executed: The scanning begins from the endpoints of the top
and bottom sub-contours to locate the local minima and local maxima for top and
bottom sub-contour respectively. The line scanning of local maxima, (x1,y1) for bot-
tom sub-contour is as illustrated in Fig. 6(d).
In the case where the local minima or maxima point is not detected, Radon trans-
form [11] will be utilized to shorten the searching line. Radon transform is defined as:
where , and f ( x, y ) denote distance from origin to the line, angle from the
X-axis to the normal direction of the line and pixel intensity at coordinate (x,y). Dirac
delta function is the obtained shortened line is thickened as depicted in Fig. 6(e).
Then moving average is applied to locate the points of interest, (x2,y2) from the
shortened contour line. Basically moving average is used for the computation of the
6 H. L. Tong et al.
average intensity for all the points along the contour line. The location of the points is
based on the highest average value of intensity as shown in Fig. 6 (f).
Eventually, midline is formed by using the linear interpolation as defined in equa-
tion (4). The detected midlines are shown in Fig. 7.
( x x1 )( y2 y1 ) (4)
y y1
( x2 x1 )
Fig. 6. (a) Contour of parenchyma area (b) Top sub-contour (c) Bottom sub-contour (d) Line
scanning for local maxima detection (e) Shortened searching line (f) Located highest average
value of intensity point
6 Feature Extraction
Features are extracted from the left and right hemispheres and used to classify the
images. There is a three-stage feature extraction which is feature extraction for hem-
orrhagic slices, feature extraction for intra-axial slices and feature extraction for sub-
Automatic Classification and Retrieval of Brain Hemorrhages 7
dural and extradural. Twenty three features are proposed based on their ability to
distinguish hemorrhagic and non-hemorrhagic slices. These twenty three features are
derived based on one feature from Local Binary Pattern (LBP), one from entropy,
four from four-bin histogram, one from intensity histogram and sixteen from four
Haralick texture features( energy, entropy, autocorrelation and maximum probability)
descriptors on four directions. Prior to the feature extraction for intra-axial, subdural
and extradural, the potential hemorrhagic regions will be divided into intra regions
and boundary regions. Intra regions will proceed with the intra-axial feature extrac-
tion with the twenty three that identical with the features of hemorrhagic slices. On
the other hand, the boundary regions will proceed with the region-based feature ex-
traction to annotate the subdural and extradural. The twelve shape features adopted
are region area, border contact area, orientation, linearity, concavity, ellipticity, circu-
larity, triangularity, solidity, extent, eccentricity and sum of centroid contour distance
curve Fourier descriptor(CCDCFD). Some of the adopted shape features are defined
in equation (4) to equation (8)
Linearity, minor axis
L( ROI ) 1
major axis (5)
Triangularity, ( ROI ) = 108I if I
1 ,
1 108 (6)
108 I otherwise
Where I= 2,0 ( ROI ) 0,2 ( ROI ) ( 1,1 ( ROI )) and i , j ( ROI ) is the (i,j)-moment
2
Sum of CCDC=
N 1
ft (k ) (9)
k 2 ft (1)
Where ft (k ) 1
N 1
i 2 kn , k=0, 1, 2…., N-1, z(n)= x1(n) +iy1(n) and
N n 0
z (n)exp
N
7 Classification
The acquired CT brain images are in DICOM format with their dimension 512 x 512.
The images are acquired from two collaborating hospitals, which are Serdang Hospi-
tal and Putrajaya Hospital, to test the feasibility of the proposed approach on datasets
generated from different scanners and different setup. The dataset consists of 181
normal slices, 209 intra-axial slices, 60 extradural slices and 86 subdural slices. The
adopted classification technique is SVM with RBF kernel. During the classification,
ten-fold cross validation method is performed. The features obtained from three sepa-
rate feature extraction process are channel into the classifier respectively in order to
categorize the different slices. The obtained recall and precision are shown in Table 1.
The recall and precision for all slices achieved satisfactory results with at least 0.79
for all. This is contributed by the features employed as they well describe the charac-
teristics of the different slices. However, the recall for intra-axial slice is relatively
lower compared with other slices. Some bright normal regions presented in non-intra-
axial slices increase the similarity of the features between some non-intra-axial and
intra-axial slices. Therefore, it generated the higher misclassification of intra-axial
slices.
8 Retrieval Results
This section presents the retrieval results based on the keywords which are “hemor-
rhage”, “intra-axial”, “subdural” and “extradural”. The retrieval results for each key-
word are exhibited based on the twenty five most relevant as shown in Fig. 8. The
relevancy or ranking is based decision values obtained from RBF SVM.
Besides the visual retrieval results, the accuracy of the retrieval is evaluated based
on the numbers of most relevant to the adopted keywords over 519 images. The test-
ing numbers for each keyword are different as the evaluation is according to the num-
ber of total slices of each keyword. From Table 2, the retrieval result based on the
keyword “hemorrhage” is 96% for the 300 most relevant retrievals. In this case, there
are 12 irrelevant images being retrieved together in the top 300. On the other hand, for
the 100 most intra-axial relevant results, there are three irrelevant images being re-
trieved. For 50 most extradural relevant retrieval results, the accuracy obtained is
96%. There are 2 irrelevant images presented in the retrieval results. Lastly for the 50
most subdural relevant retrieval results, the accuracy obtained is 92%. There are 4
irrelevant images being retrieved in this retrieval.
Automatic Classification and Retrieval of Brain Hemorrhages 9
9 Conclusion
This research work proposed three segregated feature extraction processes to classify
and retrieve the different types of slices based on their different features and loca-
tions. By individualizing each of the process, the appropriate approach such as mid-
line approach, global-based and region-based feature extractions can be adopted at
each level for the categorization of specific hemorrhage. Overall, the precision and
recall rates are over 79% for all the classification results. In our future work, we
would like to extend classification types and retrieval keywords to include more ab-
normalities of brain such as infarct, hydrocephalus and atrophy.
References
1. Chawla, M., Sharma, S., Sivaswamy, J., Kishore, L. T.: A method for automatic detection
and classification of stroke from brain ct images. Annual International Conference of the
IEEE Engineering in Medicine and Biology Society, pp. 3581-3584. (2009).
2. Saito, H., Katsuragawa, S., Hirai, T., Kakeda, S., Kourogi, Y.: A computerized method for
detection of acute cerebral infarction on ct images. Nippon Hoshasen Gijutsu Gakkai
Zasshi 66(9), 1169–1177.(2010).
3. Bhadauria, H. S., Dewal, M. L.: Intracranial hemorrhage detection using spatial fuzzy c-
mean and region-based active contour on brain CT imaging. Signal, Image and Video
Processing 8(2), 357-364 (2014).
Automatic Classification and Retrieval of Brain Hemorrhages 11
4. Al-Ayyoub, M., Alawad, D., Al-Darabsah, K., Aljarrah, I.: Automatic detection and
classification of brain hemorrhages. WSEAS Transactions on Computers 12(10), 395-405
(2013).
5. Jyoti, A., Mohanty, M., Kar, S., Biswal, B.: Optimized clustering method for CT brain
image segmentation. Advances in Intelligent Systems and Computing 327(1),317-324.
(2015).
6. Suryawanshi, S. H.,Jadhao, K. T.: Smart brain hemorrhage diagnosis using artificial neu-
ral networks. International Journal of Scientific and Technology Research 4(10), p267-271
(2015).
7. Müller, H., Deserno, T.: Content-based medical image retrieval. Biomedical image
processing, biological and medical physics, biomedical engineering, pp. 471–494. (2011).
8. Ramamurthy, B., Chandran, K., Aishwarya, S., Janaranjani, P.: CBMIR: content based
image retrieval using invariant moments, glcm and grayscale resolution for medical
images. European Journal of Scientific Research 59(4), 460-471 (2011).
9. Müller, H., de Herrera, A., Kalpathy-Cramer, J., Demner-Fushman, D., Antani, S., Eggel,
I.: Overview of the ImageCLEF 2012 medical image retrieval and classification tasks.
CLEF Online Working Notes, pp. 1–16. (2012).
10. de Herrera, A. G., Kalpathy-Cramer, J., Fushman, D. D., Antani, S., Muller, H.: Overview
of the ImageCLEF 2013 medical tasks. Working notes of CLEF, pp. 1-15. (2013).
11. Chen, B., Zhong, H.: line detection in image based on edge enhancement. Second
International Symposium on Information Science and Engineering, pp. 415-418. IEEE
Computer Society Washington, DC, USA (2009).
Towards Stemming Error Reduction for Malay Texts
Mohamad Nizam Kassim1, Shaiful Hisham Mat Jali2, Mohd Aizaini Maarof2 and
Anazida Zainal2
1
CyberSecurity Malaysia, 43300 Seri Kembangan, Selangor
2
Faculty of Computing, Universiti Teknologi Malaysia, 81310 Skudai Johore, Malaysia
nizam@cybersecurity.my shaiful.hisham@outlook.com
{aizaini,anazida}@utm.my
1 Introduction
tion word patterns and various bound morphemes of affixes, clitics and particles at-
tached to those word patterns. Hence, many current text stemmers were developed
based on various stemming approaches such as the selection of longest/shortest affix-
es, clitics or particles, ordering affixation stemming rules, ordering lookup dictionar-
ies and dictionary entries [5,6]. Despite the fact that the wide variety of text stemming
approaches were developed, the current text stemmers for the Malay language still
suffered from stemming errors. Therefore, there is a need for an enhanced stemming
approach that improves stemming accuracy. Hence, this paper proposes an enhanced
text stemmer that combines affixes removal method and dictionary lookup to address
possible stemming errors. This paper is organized into five sections. Section 2 high-
lights the related works of the current text stemmers. Section 3 describes the linguistic
aspects of the Malay language. Section 4, describes an enhanced text stemming ap-
proach to address possible stemming errors. Section 5 discusses experimental results
and explains our findings with respect to the proposed text stemmer. Finally, Section
6 concludes this paper with a brief summary.
2 Related Works
The early text stemmer was developed by Lovins [8] for the English language and
inspired other researchers to develop text stemmers in various natural languages such
as Latin, Dutch, and Arabic. This early text stemmer has also led to active research on
the development of other improved text stemmers for the English language such as
Porter's stemmer, Paice/Husk’s stemmer and Hull’s stemmer [3]. The Porter’s stem-
mer is considered as the most prominent text stemmer due to its application in infor-
mation retrieval. It has become a defacto text stemmer for the English language and
the same stemming approaches used in Porter’s stemmer have been adopted for use in
other natural languages such as the Romance (French, Italian, Portuguese and Span-
ish), Germanic (Dutch and German), Scandinavian (Danish, Norwegian and Swe-
dish), Finnish and Russian languages [3]. Similarly, the early text stemmer for the
Malay language was developed Othman [9] and also influenced other researchers to
develop an improved text stemmer for the Malay language. The subsequent text
stemmers applied various stemming approaches to stem affixation words. It is im-
portant to highlight that past researchers faced various stemming errors in developing
a perfect text stemmer for the Malay language [6]. This scenario led to various stem-
ming approaches has been developed to address these stemming errors. Currently, six
different text stemming approaches were proposed in the current text stemmers for
Malay language, i.e., rule-based text stemming [9,10], Rules Application Order
(RAO) [5,11,12], Rules Frequency Order (RFO) [13], Modified RFO [7,14], Modi-
fied Porter Stemmer [15], and Syllable-Based text stemming [16]. There are two dif-
ferent strategies for text stemming research, i.e., first, rule-based text stemming and
second, other text stemming. The first text stemming strategies were to improve upon
the rule-based text stemmer developed by Othman [9], Rule-based stemming applies
sets of affixation text stemming rules in alphabetical order and dictionary lookup for
checking that a word is not the root word prior to applying text stemming. However,
Towards Stemming Error Reduction for Malay Texts 15
Sembok et al. [17] highlighted the weaknesses of the rule-based stemming approach
in which produced many stemming errors. Consequently, the RAO stemming ap-
proach was introduced, which adds and rearranges affixation text stemming rules in
morphological order and also uses a dictionary lookup to check that the input word is
not the root word before and after applying text stemming [7]. Based on the RAO
stemming approach, the RFO stemming approach was then introduced, which consid-
ered only the most frequent affixes in Malay texts and used a dictionary lookup to
check that the word is not the root word before and after applying text stemming rules
[13]. Finally, Modified RFO stemming was introduced, which uses two types of dic-
tionaries: root word dictionaries and ‘background knowledge’ derivative word dic-
tionaries [14]. The second text stemming strategies were adopting other stemming
approaches such as the Modified Porter Stemmer [15] and Syllable-Based methods
[16]. However, these stemming approaches were not popular with researchers due to
the complexity of Malay morphology, which contains spelling variations and excep-
tions rules wherein affix removal is not sufficient for finding root words and recoding
the first letter is required to produce the correct root words. In short, these text stem-
ming approaches were developed to improve stemming accuracy to stem affixation
words in Malay texts.
3 Malay Language
Various possible affixation stemming errors have been associated with the current
Malay text stemmers [5,7,11-12, 16]. Researchers categorize these stemming errors in
terms of overstemming, understemming, unstem and spelling variations and excep-
tions errors. The root causes of these stemming errors will be further discussed in this
section.
Scenario I: Root words with similar morphemes to prefixation, suffixation and confixation
words
berat (heavy) rat when the prefix (be+) is removed (overstemming)
pasukan (team) pasu when the confix (ber+an) is removed (overstemming)
menteri (minister) ter when the confix (ber+an) is removed (overstemming)
To address these stemming errors, the current Malay text stemmers adopt a dictionary
lookup to accept these root words as root words. Thus, these root words are not
stemmed by affixation text stemming rules. However, it is challenging to apply text
stemming rules to affixation words that contain these root words, as the following
examples illustrate:
Scenario I: Affixation words containing root words with similar morphemes to prefixation,
suffixation and confixation words
berkesan (effective) kes when the confix (ber+an) is removed (overstemming)
pasukannya (his/her team) pasu when the suffix (+an) and enclitics (+nya) are re-
moved (overstemming)
menterinya (minister) ter when the confix (men+i) and enclitics (+nya) are removed
(overstemming)
Unfortunately, the use of a root word dictionary lookup only matches each word in
text documents if the word is a root word as opposed to an affixation word. Hence,
the proposed text stemmer will address these stemming errors by using the root word
and derivative dictionaries. The root word dictionaries accept root words with similar
morphemes to affixes in affixation words as root words whereas the derivative dic-
tionaries stem affixation words that contain root words with similar morphemes to
affixes in affixation words (e.g. berkesan, pasukannya, menterinya) to find the correct
root words.
Towards Stemming Error Reduction for Malay Texts 17
Root word ambiguity has not been addressed in current Malay text stemmers, as high-
lighted by Darwis et al. [7]. The root word ambiguity problem occurs when there is
more than one root word that can be selected during text stemming. The main prob-
lem associated with root word ambiguity involves selecting the correct root word
from affixation word that contains multiple root words. For example, affixation words
may contain two possible root words, depending on the type of affix matching select-
ed, which could lead to stemming errors as follows:
Scenario I:
Incorrect: beribu (thousands) ibu (mother) when the prefix (ber+) is removed
Correct: beribu (thousands) ribu (thousand) when the prefix (be+) is removed
Scenario II:
Incorrect: katakan (let say) katak (frog) when the suffix (+an) is removed
Correct: katakan (let say) kata (word) when the suffix (+kan) is removed
Scenario III:
Incorrect: perangkaan (statistics) rangka (skeleton) when the confix (pe+an) is
removed
Correct: perangkaan (statistics) angka (number) when the confix (per+kan) is removed
Unfortunately, the current Malay text stemmers use of a root word dictionary to check
against stemmed words being valid root words in which do not address these stem-
ming errors. Thus, the proposed text stemmer will use derivative dictionaries to stem
these words to avoid root word ambiguity.
There are other peculiarities found only in the aquatic species which
have not so obvious a relation to their habitat. In no genus that is
mainly aquatic in habit are the ova small and nearly unprovided with
yolk as in Lumbricus; the ova of aquatic forms are invariably large and
filled with abundant yolk.
Darwin has also pointed out the benefits to the agriculturist which
accrue from the industry of these Annelids. The soil is thoroughly mixed
and submitted to the action of the atmosphere. The secretions of the
worms themselves cannot but have a good effect upon its fertility, while
the burrows open up the deeper-lying layers to the rain. Mr. Alvan
Millson,[421] in detailing the labours of the remarkable Yoruba worm
(Siphonogaster millsoni Beddard), hints that they may serve as a check
upon the fatal malaria of the west coast of Africa. By their incessant
burrowings and ejecting of the undigested remains of their food many
poisonous germs may be brought up from below, where they flourish in
the absence of sunlight and oxygen, and submitted to the purifying
influence of sun and air.
Phosphorescence.—Phosphorescence has been observed in several
species of Oligochaeta. The most noteworthy instance of recent times
is the discovery by Giard of the small worm which he called Photodrilus
phosphoreus at Wimereux. During damp weather it was sufficient to
disturb the gravel upon the walks of a certain garden to excite the
luminosity of these Annelids. In all probability this species is identical
with one whose luminosity had been noticed some years before (in
1837) by Dugès, and named by him Lumbricus phosphoreus.
According to Giard, the light is produced by a series of glands in the
anterior region of the body debouching upon the exterior. This same
worm has since been found in other localities, where it has been shown
to be phosphorescent, by Moniez[422] and by Matzdorf[423]. It is
remarkable that in some other cases the luminosity, though it exists, is
very rarely seen. The exceedingly common Brandling (Allolobophora
foetida) of dunghills has been observed on occasions to emit a
phosphorescent light. This observation is due to Professor Vejdovsky,
[424] and was made "upon a warm July night of 1881." He thinks that
the seat of the light is in the secretion of the glandular cells of the
epidermis, for when this and other worms are handled the
phosphorescence clings to the fingers, as of course does the mucous
secretion voided by the glands.
Earthworms, on the other hand, have not such easy means of travelling
from country to country; the assistance which the cocoons in all
probability give to the smaller aquatic Oligochaeta cannot be held to be
of much importance in facilitating the migrations of the earthworms. In
the first place, the animals themselves are of greater bulk, and their
cocoons are naturally larger, and thus less easy of transportation.
Secondly, they are deposited as a rule upon dry land, where the
chances of their sticking to the feet of birds would be less; and thirdly,
they are often deposited deep in the ground, which is a further bar to
their being taken up. Another possible method by which earthworms
could cross the sea is by the help of floating tree-trunks; it is, however,
the case with many species that they are fatally injured by the contact
of salt water. There are, it is true, a few species, such as Pontodrilus of
the Mediterranean coast, which habitually live within reach of the
waves; but with the majority any such passage across the sea seems to
be impossible.[427] On the other hand, rivers and lakes are not a barrier
to the dispersal of the group. There are a few species, such as Allurus
tetragonurus, which live indifferently on land and in fresh water; and
even some habitually terrestrial species can be kept in water for many
weeks with impunity. A desert, on the other hand, is a complete barrier;
the animals are absolutely dependent upon moisture, and though in dry
weather the worms of tropical countries bury themselves deep in the
soil, and even make temporary cysts by the aid of their mucous
secretions, this would be of no avail except in countries where there
were at least occasional spells of wet weather.
The range of the existing genera and species is quite in keeping with
the suggestions and facts already put forward. But in considering them
we must first of all eliminate the direct influence of man. Every one who
studies this group of animals knows perfectly well that importations of
plants frequently contain accidentally-included earthworms; and there
are other ways in which the transference of species from one country to
another could be effected by man. There are various considerations
which enable us to form a fair opinion as to the probability of a given
species being really indigenous or imported. Oceanic islands afford one
test. There are species of earthworms known from a good many, but
with a few exceptions they are the same species as those which occur
on the nearest mainland; in those cases where it is supposed that the
animal inhabitants have reached an oceanic island by natural means of
transit, it is a rule that the species are different, and even the genera
are frequently different. That the bulk of them are the same seems to
argue either frequent natural communication with the mainland or a
great stability on the part of the species themselves. It is more probable
that the identity is in this case to be ascribed to accidental transference.
The Microdrili are, as a rule, small and aquatic in habit; they have
short sperm-ducts which open on to the exterior in the segment which
immediately follows that which contains the internal aperture; the
clitellum is only one cell thick; the egg-sacs are large; the epoch of
sexual maturity is at a fixed period. This group, to my thinking, includes
the Moniligastridae; although Professor Bourne has denied my
statement with regard to the clitellum, and in this case it is not so easy
to decide their systematic position.
I. Microdrili.
Fam. 1. Aphaneura.[430]—This name was originally given to the present
family by Vejdovsky; the family contains a single genus, Aeolosoma, of
which there are some seven species. The name is taken from, perhaps,
the most important though not the most salient characteristic of the
worms. The central nervous system appears in all of them to be
reduced to the cerebral ganglia, which, moreover, retain the embryonic
connexion with the epidermis. The worms of the genus are fairly
common in fresh waters of this country, and they have been also met
with in North and South America, and in Egypt, India, America, and
tropical Africa. They are all small, generally minute (1 to 2 mm. long),
and have a transparent body variously ornamented by brightly-coloured
oil globules secreted by the epidermis. These are reddish brown in A.
quaternarium, bright green in A. variegatum and A. headleyi, in the
latter even with a tinge of blue. In the largest species of the genus, A.
tenebrarum they are olive green. In A. niveum the spots are colourless,
and A. variegatum has colourless droplets mixed with the bright green
ones. Fig. 195 shows very well the general appearance of the species
of this genus. The body has less fixed outlines than in most worms, and
the movement of the creatures is not unsuggestive of a Planarian. As
the under side of the prostomium is ciliated, and as the movements of
these cilia conduce towards the general movement of the body, the
resemblance is intelligible. One species of Aeolosoma, at any rate, has
a curious habit which is unique in the Order. At certain times, for some
reason at present unknown, the worm secretes a chitinous capsule,
inside which it moves about with considerable freedom; these capsules
when first observed were mistaken for the cocoons of the worms; they
are really homologous with the viscid secretion which the common
earthworm throws off when in too dry soil, and with which it lines the
chamber excavated in the earth in which it is lying. The worms of this
genus multiply by fission; sexual reproduction has been but rarely
observed.
II. Megadrili.
There is no other definition which will distinguish this family from the
next two families, and even this definition is not absolutely distinctive.
There are Acanthodrilids which have a large number of chaetae in each
segment. The only difference is that in this case—in the genus
Plagiochaeta—the chaetae are implanted in twos; this is not the case in
the Perichaetidae. In all Perichaetidae that are known the sperm-ducts
open in common with the ducts of the spermiducal glands; they
generally open into them at some distance from the common external
pore. In Megascolex, Perichaeta, and Pleionogaster the nephridia are
of the diffuse type so widely spread among these worms, and the
spermiducal glands are lobate. Megascolex differs from the others in
the fact that in addition to the small scattered nephridia there are a pair
of large nephridia in each segment, and the chaetae do not form
absolutely continuous circles, but are interrupted above and below.
Pleionogaster has more than one gizzard but otherwise agrees with
Perichaeta; it is confined to the East. Perichaeta is tropical and occurs
—no doubt introduced—in Europe and America. Megascolex is Old
World only, and, like Perichaeta, Australian as well as Oriental. But
whereas Perichaeta is rare in the Australian region, Megascolex is
common there. Perionyx and Diporochaeta are the other genera which
it is possible to recognise. Both of them have paired nephridia, and
neither of them have intestinal caeca, a peculiarity which they both
share with Megascolex and Pleionogaster. Perionyx principally differs
from Diporochaeta in that the spermiducal glands are lobate, whereas
in the latter they are as in the Acanthodrilidae. Perionyx is Oriental;
Diporochaeta occurs in Australia and New Zealand.