You are on page 1of 12

Hybrid text segmentation for Hungarian clinical records

Gy orgy Orosz1,2 , Attila Nov ak1,2 , and G abor Pr osz eky1,2


1

P azm any P eter Catholic University, Faculty of Information Technology and Bionics 50/a Pr ater street, Budapest, Hungary 2 MTA-PPKE Hungarian Language Technology Research Group 50/a Pr ater street, Budapest, Hungary {oroszgy,novak.attila,proszeky}@itk.ppke.hu

Abstract. Nowadays clinical documents are getting widely available to researchers who are aiming to develop resources and tools that may help clinicians in their work. While several attempts exist for English medical text processing, there are only few for other languages. Moreover, word and sentence segmentation tasks are commonly treated as simple engineering issues. In this study, we introduce the diculties that arise during the segmentation of Hungarian clinical records, and describe a complex method that results in a normalized and segmented text. Our approach is a hybrid combination of a rule-based and an unsupervised statistical solution. The presented system is compared with other algorithms that are available and commonly used. These fail to segment clinical text (all of them reach F -scores below 75%), while our method scores above 90%. This means that only the hybrid tool described in this study can be used for the segmentation of Hungarian clinical texts in practical applications. Keywords: text segmentation, clinical records, sentence boundary detection, log-likelihood ratios

Introduction

Hospitals produce a large amount of clinical records, containing valuable information about patients: these documents might be utilized to help doctors in making better diagnoses, but they are generally used only for archiving and documentation purposes. To be able to extract information from such textual data, they should be properly parsed and processed, thus proper text segmentation3 methods are needed. However, although performing well on the general domain, existing preprocessing algorithms have diverse problems in the case of Hungarian medical records. These diculties arise because documents produced in the clinical environment are very noisy. The typical sources of errors are the following: a) typing errors (i.e. mistyped tokens, nonexistent string of falsely concatenated
3

While the term text segmentation is widely used for diverse tasks, in our work it is used as the process of dividing text into word tokens and sentences.

Gy orgy Orosz et al.

words), b) nonstandard usage of Hungarian. While errors of the rst type can generally be corrected with a rule-based tool, others need advanced methods. For English, there are many solutions that deal with such noisy data, but for Hungarian only a few attempts have been made. [24, 25]. Studies reporting about processing clinical texts generally do not include a description of the segmentation part. Furthermore, the existence of a reliable segmentation tool is essential, since error propagation in a text-processing chain is an important problem. This is even more notable in the case of clinical records, where noise is generally present at various levels of processing. In this study, a hybrid approach to normalization and segmentation of Hungarian clinical records is presented. The method consists of three phases: rst, a rule-based clean-up step is performed; then tokens are partially segmented; nally, sentence boundaries are determined. It is shown below how these processing units are built upon one another. Then key elements of the sentence boundary detection (SBD) algorithm are described. The presented system is evaluated against a gold standard corpus, and is compared with other tools as well.

2
2.1

Related work
Previous work on text segmentation

The task of text segmentation is often composed of subtasks: normalization of noisy text (when necessary), segmentation of words, and sentence boundary detection. The latter may involve the identication of abbreviations as well. Furthermore, there are attempts (e.g. [32]), where text segmentation and normalization are treated as a unied tagging problem, and there are some that handle the problem with rule-based solutions (such as [13, 8]). Although segmentation of tokens is generally treated as a simple engineering problem4 that aims to split punctuation marks from word forms, SBD is a much more researched topic. As Read et al. summarize [19], sentence segmentation approaches fall into three classes: 1) rule-based methods, that employ domainor language-specic knowledge (such as abbreviations); 2) supervised machine learning approaches, which, since they are trained on a manually annotated corpus, may perform poorly on other domains; 3) unsupervised learning methods, which extract their knowledge from raw unannotated data. Riley [21] presents an application of decision-tree learners for disambiguating full stops, utilizing mainly lexical features, such as word length and case, and probabilities of a word being sentence-initial or sentence-nal. The SATZ system [16] is a framework which makes it possible to employ machine learning algorithms that are based on not just contextual clues but PoS features as well. The utilization of the maximum entropy learning approach for SBD was introduced by Reynar and Ratnaparkhi [20]. Later, Gillick presented [7] a similar approach
4

In the case of alphabetic writing systems.

Hybrid text segmentation for Hungarian clinical records

for English that relies on support vector machines resulting in a state-of-the-art performance. In [13], Mikheev utilizes a small set of rules that are able to detect sentence boundaries (SB) with a high accuracy. In another system presented by him [12], the rule-based tokenizer described above is integrated into a PoS-tagging framework, which allows to assign labels to punctuation marks as well: these can be labeled as a sentence boundary, an abbreviation or both. Punkt [9] is a tool presented by Kiss and Strunk, that employs only unsupervised machine learning techniques: the scaled log-likelihood ratio method is used for deciding whether a (word, period) pair is a collocation or not. There are some freely available systems specic to Hungarian: Huntoken [8] is an open source tool that is mainly based on Mikheevs rule-based system; and magyarlanc [33], which is a full natural language processing chain that contains an adapted version of MorphAdorners tokenizer [10] and sentence splitter. 2.2 Processing medical text

For Hungarian clinical records, Sikl osi et al. [24] presented a baseline system to correct spelling errors found in clinical documents. Their algorithm tries to x a subset of possible spelling errors, relying on various types of language models. An improved version of their system was introduced recently [25], but sentence boundary detection is still not touched in it. Tokenization and SBD is currently not a hot area in natural language processing, but there are some attempts at dealing with the segmentation of medical texts. Sentence segmentation methods, employed by clinical text processing systems, fall into two classes: many of them apply rule-based settings (e.g. [31]), while others employ supervised machine learning algorithms (such as [1, 3, 22, 27, 29]). Tools falling into the latter category mainly use maximum entropy or CRF learners, thus a handcrafted training material is essential. This corpus is either a domain-specic one or derived from the general domain. In practice, training data from a related domain yields better performance, while others [28] argue that the domain of the training corpus is not critical.

3
3.1

Resources and metrics


Clinical records

Our aim was to develop a high-performance text segmentation algorithm for Hungarian clinical records, since the resulting outcome is intended to be used by shallow parsing tools, which may be parts of a greater text processing system. To ensure the high quality of the processed data, input texts had to be normalized and a gold standard corpus was created for testing purposes. The normalization process had to deal with the following errors5 :
5

Text normalization is performed using regular expressions, which are not detailed here.

Gy orgy Orosz et al.

1. doubly converted characters, such as >, 2. typewriter problems (e.g. 1 and 0 is written as l and o), 3. dates and date intervals that were in various formats with or without necessary whitespaces (e.g. 2009.11.11, 06.01.08), 4. missing whitespaces between tokens that usually introduced various types of errors: (a) measurements were erroneously attached to quantities (e.g. 0.12mg), (b) lack of whitespace around punctuation marks (e.g. t or ok ozegek.Fundus: ep.), 5. various formulation of numerical expressions. In order to investigate possible pitfalls of the algorithm being developed, the gold standard data set was split into two sets of equal sizes: a development and a test set, containing 1320 and 1310 lines respectively. The rst part was used to identify typical problems in the corpus and to develop the segmentation methods, while the second part was used to verify our results. The comparison of clinical texts and a corpus of general Hungarian (Szeged Corpus [4]) reveals the following dierences : 1. 2.68% of tokens found in clinical corpus sample are abbreviations while the same ratio for general Hungarian is only 0.23%; 2. sentences taken from the Szeged Corpus almost always end in a sentence nal punctuation mark (98.96%), while these are totally missing from clinical statements in 48.28% of the cases; 3. sentence-initial capitalization is a general rule in Hungarian (99.58% of the sentences are formulated properly in the Szeged Corpus), but its usage is not common in the case of clinicians (12.81% of the sentences start with a word that is not capitalized); 4. the amount of numerical data is signicant in medical records (13.50% of sentences consist exclusively of measurement data and abbreviations), while text taken from the general domain rarely contains statements that are full of measurements. These special properties together make the creation of a specialized text segmentation algorithm for clinical texts necessary. 3.2 Metrics used for comparison

There are no metrics that are commonly used across text segmentation tasks. Researchers specializing in machine learning approaches prefer to calculate precision, recall and F -measure, while others, basically having a background in speech recognition, prefer to use NIST and Word Error Rate. Recently, Read et al. have reviewed [19] the current state-of-the-art in sentence boundary detection, and have proposed a unied metric that makes it possible to compare the performance of dierent approaches. Their method allows for the detection of sentence boundaries after any kind of character: characters can be labeled as sentence-nals or non sentence-nals, and thus accuracy is used for comparison.

Hybrid text segmentation for Hungarian clinical records

In our work, the latter unied metric was generalized in order to be employed for the unied segmentation problem: viewing the text as a sequence of characters and empty strings between each character, the task can be treated as a classication problem, where all entities (characters and empty string between them) are labeled with the following tags: T if the entity is a token boundary, S if it is a sentence boundary, None if the entity is neither. The usage of this classication scheme enables us to use accuracy as a measurement. Moreover, one can calculate precision, recall and F -measure as well. In our work, we use accuracy as described above, to monitor the progress of the whole segmentation task. Since it is important to measure the two subtasks separately, precision and recall values are calculated for both word tokenization and SBD. Precision becomes more important than recall during the automatic sentence segmentation of clinical documents: an erroneously split sentence may cause information loss, while statements might still be extracted from multisentence text. Thus F0.5 measure is employed for SBD, while word tokenization is evaluated against the balanced F1 measure.

The proposed method

Tasks not examined previously need a baseline method that can be used for comparison. In our case, the method detailed below is compared with others. This algorithm was found to perform well with unambiguous boundaries, thus it was kept as a base of the proposed tool. Further extensions to the algorithm are detailed in section 4.2, which made it possible to deal with more complicated cases. 4.1 Baseline word tokenization and sentence segmentation

The baseline rule-based algorithm is composed of two parts. The rst performs the word tokenization (BWT), while the second marks the sentence boundaries (BSBD). One principle that was kept in mind for the word segmentation process was not to detach periods from the ends of the words. It was necessary since BWT did not intend to recognize abbreviations: this task was left to the sentence segmentation process. The tokenization algorithm is not detailed here, since it performs tasks generally implemented in standard tokenizers. The subsequent system in the processing chain is BSBD: to avoid information loss (as described in section 3.2.), it is minimizing the possibility of making falsepositive errors. Sentences are only split if there is a high condence of success. These cases are when: 1. a separate period or exclamation mark occurs before another token which is not a punctuation mark6 ;
6

Question marks are not considered as sentence-nal punctuation marks, since they generally indicate a questionable nding in clinical texts.

Gy orgy Orosz et al.

2. a line starts with a full date, and is followed by other words (The last whitespace character before the date is marked as SB.); 3. a line begins with the name of an examination followed by a semicolon and a sequence of measurements. Pipelining these algorithms yields 100% precision and 73.38% recall for the token segmentation task7 , while the corresponding values for the sentence boundary detection are 98.48% and 42.60% respectively. The latter values mean that less than half of the sentence boundaries are discovered, which obviously needs to be improved. An analysis of errors produced by the whole process showed that the whole tokenization process has diculties only with separating sentence nal periods. A detailed examination suggests that an improvement in sentence boundary detection can result in higher recall scores for word tokenization as well. Because of these, hereunder we only concentrate on improving the BSBD module. 4.2 Improvements on sentence boundary detection

Usually, there are two kinds of indicators for sentence boundaries: the rst is when a period () is attached to a word: in this case a sentence boundary is found for sure only if the token is not an abbreviation; the second is when a word starts with a capital letter and it is neither part of a proper name nor of an acronym. Unfortunately, these indicators are not directly applicable in our case, since medical abbreviations are diverse and clinicians usually introduce new ones that are not part of the standard. Furthermore, Latin words and abbreviations are sometimes capitalized by mistake and there are subclauses that start with capitalized words. Moreover, as shown above, several sentence boundaries lack both of these indicators. However, these features can still be used in SBD. A full list of abbreviations is not necessary: it is enough to nd evidence for the separateness of a word and the attached period to be able to mark a sentence boundary. We found that the utilization of the scaled log-likelihood ratio method worked well in similar scenarios [9], thus its possible adaptation was examined. The algorithm was rst introduced in [6]: it was used for identifying collocations. Later it was adapted for the sentence segmentation task by Kiss and Strunk [9]. Their idea was to handle abbreviations as collocations of words and periods. In practice, this is formulated via a null hypothesis (1) and an alternative one (2). H0 : P (|w) = p = P (|w) (1) HA : P (|w) = p1 = p2 = P (|w) L(H0 ) log = 2log L(HA ) (2) (3)

(1) expresses the independence of a (word, ) pair, while (2) formulates that their co-occurrence is not just by chance. Based on these hypotheses log is
7

The values that are presented in this section were measured on the development set.

Hybrid text segmentation for Hungarian clinical records

calculated (3), which is asymptotically 2 distributed, thus it could be applied as a statistical test [6]. Kiss and Strunk found that pure log performs poorly8 in abbreviation detection scenarios, and thus they introduced scaling factors [9]. In that way, their approach loses the asymptotic relation to 2 and becomes a ltering algorithm. Since our aim is to nd those candidates that co-occur only by chance, the scaled log-likelihood ratio method was used with inverse score (iscore = 1/log). Several experiments were performed on the development set that showed that the scaling factors described below give the best performance. Our case contrasts to [9]: counts and count ratios did not indicate properly whether a token and the period is related in a clinical record, since frequencies of abbreviation types are relatively low. As the original work proposes, good indicators of abbreviations is their lengths (len): shorter tokens tend to be abbreviations, while longer ones do not. Formulating this observation, a function is required that penalizes words that only have a few characters, while increasing the scores of others. Having a medical abbreviation list of almost 200 elements9 we found that that more than 90% of the abbreviations are shorter than three characters. This fact encouraged us to set the rst scaling factor as in (4). This enhancement is not just able to boost the score of a pair but it can also decrease it as well. Slength (iscore) = iscore exp (len/3 1) (4)

Recentlys, HuMor [17, 18] a nite-state morphological analyzer (MA) for Hungarian was extended with the content of a medical dictionary [14]. Therefore this new version of the tool could be used to enhance the sentence segmentation algorithm. HuMor is able to analyze possible abbreviations and full words as well, thus its output is utilized by an indicator function (5). Since the output of a MA is very strong evidence, a scaling component based on it needs larger weights compared to others. This led us to formulate (6). 1 indicatormorph (word) = 1 0 if word has an analysis of a known full word if word has an analysis of a known abbreviation otherwise (5) (6)

Smorph (iscore) = iscore exp (indicatormorph len2 )

Another indicator was discovered during the analysis of the data: a hyphen is generally not present in abbreviations but rather occurs in full words. This led us to the third factor (7) where indicatorhyphen is 1 only if the word contains a hyphen, otherwise it is 0.
8 9

In terms of precision. The list is gathered with an automatic algorithm that uses word shape properties and frequencies, then the most frequent elements are manually veried and corrected.

Gy orgy Orosz et al.

Shyphen (iscore) = iscore exp (indicatorhyphen len)

(7)

Scaled iscore is calculated for all (word, ) pairs that are not followed by another punctuation mark. If this value is higher than 1.5, the period is regarded as a sentence boundary and it is detached. Investigating the performance of the method described above, it was pipelined after the BSBD module producing recall value of 77.14% and precision of 97.10%. In order to utilize the second source of information (word capitalization), another component was introduced. It deals with words that begin with capital letters . Good SB candidates of these are the ones that do not follow a non sentence terminating10 punctuation, and are not part of a named entity. Sequences of capitalized words are omitted and the rest is processed with HuMor: words known to be common are marked as the beginning of a sentence. In our case common words are those that do not have a proper noun analysis. Chaining only this component with BSBD results in 65.46% of recall and 96.37% of precision.

Evaluation

Our hybrid algorithm was developed using the development set, thus it is evaluated against the rest of the data.
Table 1. Accuracy of the input text and the baseline segmented one Accuracy 97.55% 99.11%

Preprocessed Segmented (baseline)

Accuracy values in Table 1 can be used as good bases for the comparison of the overall segmentation task, but one must note that this metric is not well balanced. Its values are relatively high even for the preprocessed text, thus the increase in accuracy needs to be measured. Relative error rate reduction scores are provided in Table 2, which are calculated over the baseline method. We investigated each part of the sentence segmentation algorithm and examined their collaboration as well. The unsupervised SBD algorithm is marked with LLR 11 , while the second is indicated by the term CAP. A more detailed analysis of the SBD task is made by comparing precision, recall and F0.5 values (see Table 3.). Each component signicantly increases the recall, while precision is just barely decreased. The combined hybrid attempt12 brings signicant increment over the well established baseline, but a comparison with other available tools is also presented.
10

11 12

Sentence terminating punctuation marks are the period and the exclamation mark for this task. Referring to the term log-likelihood ratio. It is the composition of the BWT, BSBD, LLR and CAP components.

Hybrid text segmentation for Hungarian clinical records Table 2. Error rate reduction over the accuracy of the baseline method Error rate reduction 58.62% 9.25% 65.50%

LLR CAP LLR+CAP

Table 3. Evaluation of the proposed sentence segmentation algorithm compared with the baseline Precision 96.57% 95.19% 94.60% 93.28% Recall 50.26% 78.19% 71.56% 86.73% F0.5 81.54% 91.22% 88.88% 91.89%

Baseline LLR CAP LLR+CAP

Freely available tools for segmenting Hungarian texts are magyarlanc and Huntoken. The latter can be slightly congured by providing a set of abbreviations, thus two versions of Huntoken are compared. One which employs the built-in set of general Hungarian abbreviations (HTG ), and another one that extends the previous list with medical ones described in section 4.2 (HTM ). Punct [9] and OpenNLP [2] a popular implementation of the maximum entropy SBD method [20] were involved in the comparison as well. Since the latter tool employs a supervised learning algorithm, Szeged Corpus is used as a training material.
Table 4. Comparision of the proposed hybrid SBD method with competing ones Precision 72.59% 44.73% 43.19% 58.78% 52.10% 93.28% Recall 77.68% 49.23% 42.09% 45.66% 96.30% 86.73% F 0. 5 73.55% 45.56% 42.97% 55.59% 57.37% 91.89%

magyarlanc HTG HTM Punkt OpenNLP Hybrid system

Values in Table 4 show that general segmentation methods failed on sentences of Hungarian clinical records. It is interesting that the maxent approach has high recall, but boundaries marked by it are false positives in almost half of the cases. Rules provided by magyarlanc seem to be robust, but the overall performance inhibits its application for clinical texts. Others do not just provide low recalls, but their precision values are around 50%, which is too low for practical purposes. Our approach mainly focuses on the sentence segmentation task, but an improvement of word tokenization is expected as well. Better recognition of words that are not abbreviations results in a higher recall (see Table 5), while it does not signicantly decrease precision.

10

Gy orgy Orosz et al.

Table 5. Comparing tokenization performance of the new tool with the baseline one Precision Recall F1 99.74% 74.94% 85.58% 98.54% 95.32% 96.90%

Baseline Hybrid system

Conclusion

We presented a hybrid approach that consists of several rule-based algorithms and an unsupervised machine learning algorithm to tokenization and sentence boundary detection. Owing to the special properties of Hungarian clinical texts, emphasis was laid on the direct detection of sentence boundaries. The method performs word tokenization rst, partially segmenting tokens. Attached periods are left untouched in order to help the subsequent sentence segmentation process. The SBD method described above is based mainly on the calculation of log, which was adapted to the task in order to improve the overall performance of our tool. A unique property of the algorithm presented is that it is able to incorporate the knowledge of a morphological analyzer as well, increasing the recall of sentence segmentation. The segmentation method successfully deals with several sorts of imperfect sentence boundaries. As described in section 5, the given algorithm performs better in terms of precision and recall than competing ones. Only magyarlanc reached an F0.5 -score above 60%, which is still too low for practical applications. The method presented in this study achieved a 92% of F0.5 -score, which allows the tool to be used in the Hungarian clinical domain.

Acknowledgement
We would like to thank Borb ala Sikl osi and N ora Wenszky for their comments on preliminary versions of this paper. This work was partially supported by TAMOP 4.2.1.B 11/2/KMR-2011-0002 and TAMOP 4.2.2/B 10/120100014.

References
1. Emilia Apostolova, David S Channin, Dina Demner-Fushman, J Furst, S Lytinen, and D Raicu. Automatic segmentation of clinical texts. In Engineering in Medicine and Biology Society, 2009. EMBC 2009. Annual International Conference of the IEEE, pages 59055908. IEEE, 2009. 2. Jason Baldridge, Thomas Morton, and Gann Bierner. The OpenNLP maximum entropy package, 2002. 3. Paul S Cho, Ricky K Taira, and Hooshang Kangarloo. Text boundary detection of medical reports. In Proceedings of the AMIA Symposium, page 998. American Medical Informatics Association, 2002. 4. D ora Csendes, J anos Csirik, and Tibor Gyim othy. The Szeged Corpus: A POS tagged and syntactically annotated Hungarian natural language corpus. In Proceedings of the 5th International Workshop on Linguistically Interpreted Corpora, pages 1923, 2004.

Hybrid text segmentation for Hungarian clinical records

11

5. Rebecca Dridan and Stephan Oepen. Tokenization: returning to a long solved problem a survey, contrastive experiment, recommendations, and toolkit. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 378382. Association for Computational Linguistics, 2012. 6. Ted Dunning. Accurate methods for the statistics of surprise and coincidence. Computational linguistics, 19(1):6174, 1993. 7. Dan Gillick. Sentence boundary detection and the problem with the US. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, pages 241244. Association for Computational Linguistics, 2009. 8. P eter Hal acsy, Andr as Kornai, L aszl o N emeth, Andr as Rung, Istv an Szakad at, and Viktor Tr on. Creating open language resources for Hungarian. In Proceedings of Language Resources and Evaluation Conference, 2004. 9. Tibor Kiss and Jan Strunk. Unsupervised multilingual sentence boundary detection. Computational Linguistics, 32(4):485525, 2006. 10. Amit Kumar. Monk project: Architecture overview. In Proceedings of JCDL 2009 Workshop: Integrating Digital Library Content with Computational Tools and Services, 2009. 11. S. M. Meystre, G. K. Savova, K. C. Kipper-Schuler, and J. F. Hurdle. Extracting information from textual documents in the electronic health record: a review of recent research. Yearbook of Medical Informatics, pages 128144, 2008. 12. Andrei Mikheev. Tagging sentence boundaries. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference, pages 264271. Association for Computational Linguistics, 2000. 13. Andrei Mikheev. Periods, capitalized words, etc. Computational Linguistics, 28(3):289318, 2002. 14. Gy orgy Orosz, Attila Nov ak, and G abor Pr osz eky. Magyar nyelv u klinikai rekordok morfol ogiai egy ertelm us t ese. In IX. Magyar Sz am t og epes Nyelv eszeti Konferencia, pages 159169, Szeged, 2013. Szegedi Tudom anyegyetem. 15. David D Palmer and Marti A Hearst. Adaptive sentence boundary disambiguation. In Proceedings of the fourth conference on Applied natural language processing, pages 7883. Association for Computational Linguistics, 1994. 16. David D Palmer and Marti A Hearst. Adaptive multilingual sentence boundary disambiguation. Computational Linguistics, 23(2):241267, 1997. 17. G abor Pr osz eky. Industrial applications of unication morphology. In Proceedings of the Fourth Conference on Applied Natural Language Processing, page 213, Morristown, NJ, USA, 1994. 18. G abor Pr osz eky and Attila Nov ak. Computational Morphologies for Small Uralic Languages. In Inquiries into Words, Constraints and Contexts., pages 150157, Stanford, California, 2005. 19. Jonathon Read, Rebecca Dridan, Stephan Oepen, and Lars Jrgen Solberg. Sentence Boundary Detection: A Long Solved Problem? In 24th International Conference on Computational Linguistics (Coling 2012). India, 2012. 20. Jerey C Reynar and Adwait Ratnaparkhi. A maximum entropy approach to identifying sentence boundaries. In Proceedings of the fth conference on Applied natural language processing, pages 1619. Association for Computational Linguistics, 1997. 21. Michael D Riley. Some applications of tree-based modelling to speech and language. In Proceedings of the Workshop on Speech and Natural Language, pages 339352. Association for Computational Linguistics, 1989.

12

Gy orgy Orosz et al.

22. Guergana K. Savova, James J. Masanz, Philip V. Ogren, Jiaping Zheng, Sunghwan Sohn, Karin Kipper Schuler, and Christopher G. Chute. Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. Journal of the American Medical Informatics Association, 17(5):507 513, 2010. 23. Helmut Schmid. Unsupervised learning of period disambiguation for tokenisation. Technical report, 2000. 24. Borb ala Sikl osi, Gy orgy Orosz, Attila Nov ak, and G abor Pr osz eky. Automatic structuring and correction suggestion system for hungarian clinical records. In Guy De Pauw, Gilles-Maurice De Schryver, Mike L Forcada, Francis M Tyers, and PeterEditors Waiganjo Wagacha, editors, 8th SaLTMiL Workshop on Creation and use of basic lexical resources for lessresourced languages, page 29.34., 2012. 25. Borb ala Sikl osi, Attila Nov ak, and G abor Pr osz eky. Context-aware correction of spelling errors in hungarian medical documents. In Adrian-Horia Dediu, Carlos Mart n-Vide, Ruslan Mitkov, and Bianca Truthe, editors, Statistical Language and Speech Processing, volume 7978 of Lecture Notes in Computer Science, pages 248 259. Springer Berlin Heidelberg, 2013. 26. Mark Stevenson and Robert Gaizauskas. Experiments on sentence boundary detection. In Proceedings of the sixth conference on Applied natural language processing, pages 8489. Association for Computational Linguistics, 2000. 27. Ricky K Taira, Stephen G Soderland, and Rex M Jakobovits. Automatic structuring of radiology free-text reports. Radiographics, 21(1):237245, 2001. 28. Katrin Tomanek, Joachim Wermter, and Udo Hahn. A reappraisal of sentence and token splitting for life sciences documents. Studies in Health Technology and Informatics, 129(Pt 1):524528, 2006. 29. Katrin Tomanek, Joachim Wermter, and Udo Hahn. Sentence and token splitting based on conditional random elds. In Proceedings of the 10th Conference of the Pacic Association for Computational Linguistics, pages 4957, 2007. 30. J. O. Wrenn, P. D. Stetson, and S. B. Johnson. An unsupervised machine learning approach to segmentation of clinician-entered free text. AMIA Annu Symp Proc, pages 811815, 2007. 31. Hua Xu, Shane P. Stenner, Son Doan, Kevin B. Johnson, Lemuel R. Waitman, and Joshua C. Denny. Medex: a medication information extraction system for clinical narratives. Journal of the American Medical Informatics Association, 17(1):1924, 2010. 32. Conghui Zhu, Jie Tang, Hang Li, Hwee Tou Ng, and Tiejun Zhao. A unied tagging approach to text normalization. In The 45st Annual Meeting of the Association for Computational Linguistics, pages 688695, 2007. 33. J anos Zsibrita, Veronika Vincze, and Rich ard Farkas. magyarlanc: A Toolkit for Morphological and Dependency Parsing of Hungarian. In Proceedings of Recent Advances in Natural Language Provessing 2013, pages 763771, Hissar, Bulgaria, 2013. Association for Computational Linguistics.

You might also like