You are on page 1of 21

1

Acknowledgement

I would like to thank to the professor Dr.Theptchai for accepting me as an


intern during the previous summer and supporting me by all means of
resources and encouragements. I want to say thanks to P’Chez for explaining
the concepts and idea of Natural Language Processing with extraordinary
patience. And finally, I would like to say thanks to my family and all of the
friends for supporting me all the time.

2
Table of Contents

1. Introduction ---------------------------------------------------------------------------------------------------4

2. Objective--------------------------------------------------------------------------------------------------------4

3. List of Works Done--------------------------------------------------------------------------------------------4

4. Noun Phrase Translation-------------------------------------------------------------------------------------4

6.1. Noun Phrase Remove------------------------------------------------------------------------------4

6.2. Noun Phrase Replace-------------------------------------------------------------------------------7

6.3. Actual Replacement of NP with Translated Results----------------------------------------10

5. Web-based Voting System----------------------------------------------------------------------------------12

6. Tree Transducer-----------------------------------------------------------------------------------------------13

6.1. Naïve Tree Transducer----------------------------------------------------------------------------13

6.2. Probabilistic Tree Transducer--------------------------------------------------------------------15

7. Ontology-based Patient Recommendation System---------------------------------------------------17

8. Glossary----------------------------------------------------------------------------------------------------------21

3
1. Introduction
This document reports all the works accomplished during internship period (16 th Mar, 2009-22nd May,
2009). The internship place is National Electronic and Computer Technology Center (NECTEC). Each
works are categorized into appropriate sections and explained in terms of inputs, procedure, and
outputs. In this report, diagram-oriented approach is used so that readers can get the main idea of
works easily.

2. Objectives
The main objectives of working as an intern at NECTEC can be listed as followed.

 To learn practical knowledge in Natural Language Processing.


 To elaborate Text Processing not only in theory but also in applications and in researches.
 To gain experience from real-time working environment.
 To apply all the knowledge in the classrooms to useful applications that may be beneficial for
the society.
 To fulfill the requirements of completing Bachelor Degree.

3. List of works done


 Noun-phrase Translation
o Noun-phrase Removing
o Noun-phrase Replacing with generic patterns
o Actual Replacement of NP with Translated Results
 Web-based Voting System
 Tree Transducer
o Naive Tree Transducer
o Probabilistic Tree Transducer
 Ontology-based Patient Recommendation System
o Mapping between OWL concepts and Java Beans.
o Applying Java Beans to Instantiate
o Adding Inference Rules

4. Noun-phrase Translation
Motivation

Machine translation between two different languages contains many considerations like word
reordering, classification and so on. When we translate from English to Thai, we must be aware of
differences between the structure of English and Thai. Even in a simple sentence, there may be more
than one noun phrase.

In some case, a noun word in English can be translated into more than one word in other language.
Developing noun phrase translation related modules, the machine translation system can result more
accurate result.
4
4.1. Noun-phrase Removing

4.1.1. Abstract Description

The core idea of this script is removing the noun phrase from source language (English) and replaced
with symbol, [X]. Whenever a phrase in source is removed and marked as [X], the related phrase in
target language document also must be done in similar way. Input resources for this script are

 Parallel corpus
 Alignment and POS information (by Stanford parser).

4.1.2. Input Documents

Src and trg will serve as the two languages that we want to find an automatic translation in between.
POS is an acronym of Part Of Speech which describes grammatical tags of a sentence. Alignment data
tells which words at a particular position in source language will be translated into the words at some
position in target language. In diagram, two words with label ‘a’ will be translated into first word in Thai
sentence.

5
4.1.3. Procedure

The program starts by getting Noun Phrases from an English sentence in the parallel corpus of English-
to-Thai translation. This process needs Part-Of-Speech information and source language sentence. The
second process is storing alignment data into memory with the help of input document which contains
alignment data. Third, using data in hand, it replaces the noun phrases or word from English sentence
and eventually from Thai sentence too. Finally the program writes all results into two output files.

6
4.1.4. Outputs

4.2. Noun-phrase Replacing

4.2.1. Abstract Description

The core idea of this script is finding the relation of identifiers between two languages in parallel corpus
for replacement of actual words. Phrase table will give necessary information about patterns and theirs
translated results. Input resources for this script are sample English sentences with identifier [x],
another document written in Thai which is the translated result and contains segment information, and
phrase table.

7
4.2.2. Input Documents

Phrase Table contains translation, segment information and word alignment information. It is generated
by Gizza components which needs only parallel corpus to generate the phrase table.

8
4.2.3. Procedure

The program begins with storing segment information from sample English sentences and their
translated sentences in Thai. Then using phrase table, it finds whether there is any matched patterns. If
it found matched pattern, it would retrieve required information and store into memory. Then calculate
the actual position of the phrase. Here, the actual position means the position that occurs in realistic
appearance of particular pattern either in first line or in middle of the paragraph.

4.2.4. Outputs

9
4.3. Actual Replacement of NP with Translated Results

4.3.1. Abstract Description

The main purpose of this script is to replace the translated results into identifier-marked places. In order
to do that, the script uses the input relation information which is generated by “Noun-phrase replacing”,
the previous one. After doing the actual replacement, it writes the result into an output file.

4.3.2. Input Documents

4.3.3. Procedure

10
This does not have any complicated processes since it just replace the translated results into target
sentences by using its alignment data which is the result of the previous program.

4.3.4. Outputs

4.4. Visualization of three programs’ roles

1. Remove noun phrases from English and Thai sentences in the parallel corpus
2. Get links between identifiers (replaced in the place of noun phrases)
3. Translate and replace back the noun phrases into the sentences.

11
Conclusion and Future Works

After developing the scripts mentioned earlier, detecting noun phrases from English sentences in
parallel corpus, translating those phrases into Thai language, finding word alignments, replacing back
the translated results can be achieved seamlessly. In the future, we can develop many other
improvements such as adding probabilities to translate one noun words to be context-sensitive, using
example-based approach in translating noun phrases and so on.

5. Web-based Voting System


This system is developed to support the NP translation team to get necessary amount of votes with
limited time. The system development does not contain any big deal. It is just a simple PHP used
database-driven website. Thus, I would not provide any inputs, outputs or whatsoever. Instead I will
provide some of the screenshots of the system.

12
6. Tree Transducer
Motivation

 When we translate the English sentence into Thai sentence, we should take some consideration
about phrasal verbs and idioms.
 Those phrasal verbs and idioms can occur consecutively or sometimes separately.
 In order to translate those phrases, we need some mechanisms which use hierarchical concepts
of words.
 Translating sentences containing phrases with long distant dependencies needs tree-like
handling approach.

6.1. Naive Tree Transducer

6.1.1. Abstract Description

Tree Transducer is sometimes called Transferred Rules. The goal of this script is to find whether there is
any similar sub-tree matched in an input tree comparing with predefined tree patterns. If there is a
pattern matched, the desired resulting pattern is replaced. The sub-tree may contain dentifier, [X].

13
6.1.2. Inputs

Input sentence in this case is “(6((2((1((5)(4)))(3((11)(9)))))(5((2((3)))(1((7)))))))”.

6.1.3. Procedure

The program will search whether any sub-tree of input tree has matched pattern in predefined patterns
document. The yellow nodes indicate there is a matched sub-tree, thus, the input tree is transferred into
new tree with green nodes.

14
6.1.4. Outputs

6.2.Probabilistic Tree Transducer

6.2.1. Abstract Description

Probabilistic Tree Transducer is similar to the Naive one before. The goal of this script is also to find
whether there is any similar sub-tree matched in an input tree comparing with predefined tree patterns.
If there is a pattern matched, the desired resulting pattern is replaced. The sub-tree may contains
identifier, [X]. However, the difference is using probabilities to decide the patterns matched, when
pattern conflict occurs. And, successive replacing of patterns (Successive Transducer) is added so that it
can give more accurate results.

6.2.2. Inputs

Input sentence in this case is “(6((2((1((5((8)(0)))(4)))(3((11)(9)))))(5((2((3)))(1((7)))))))”.

15
6.2.3. Procedure

The program will search whether any sub-tree of input tree has matched pattern in predefined patterns
document. The yellow nodes indicate there is a matched sub-tree, thus, the input tree is transferred into
new tree with green nodes. However, when choosing the transfer rules, it choose the pattern with
highest probability.

6.2.4. Outputs

Conclusion and Future Works

After developing the scripts mentioned earlier, transferring rules and pattern transformations with or
without probabilities which are very useful in translating sentences containing phrasal verbs and idioms,
will be achieved. In the future, we can develop many other improvements such as using better data
structure to transfer rules for successive pattern matches in stead of using recursion, developing
pattern-based statistical machine translation based on tree transducer and so on.

16
7. Ontology-based Patient Recommendation System
Motivation

 Diabetes patients sometimes are not aware of small health care such as blood pressure, pulse
rate and so on.
 Since those health cares are in fact quite important for diabetes patients, providing them with
easy to assess recommendation system will be beneficial for them.
 Diabetes ontology and patients’ records are available from the medical experts and hospitals.
 Using diabetes ontology which contains concepts, relationships, and other specific data, and
inference rules, the system will be able to help patients to take care their health easily.

7.1. Open-sources involved

7.1.1. JENA (Java Semantic Framework)

Jena is a Java framework for building Semantic Web applications. It provides programmer-friendly
Application Programming Interface. It can parse all semantic web-related documents including XML,
RDF, Families of OWL, etc. Providing common interface for all structured mark-ups, it can merge or
divide the resources easily and seamlessly. JENA is a successful work of HP Labs Semantic Web Program.

7.1.2. JASTOR

JASTOR is a open source Java code generator that emits Java Beans from Web Ontologies (OWL)
enabling convenient, type safe access and event handlings of RDF stored in a Jena Semantic Web
Framework model. JASTOR generates Java interfaces, implementations, factories, and listeners based
on the properties and class hierarchies in the Web Ontologies. JASTOR is based on several ideas from
“Automatic Mapping of OWL Ontologies to Java”.

7.2. Abstract Description

The recommendation system gets the patient information and medical records from some hospitals.
Analyzing on those record data, it generates appropriate recommendations to the users. It uses the
necessary inference rules and relationship between concepts and roles from predefined Ontology
Resource. The information about diseases and appropriate treatments are defined by our experts in a
structured way. Medical records of the patients are stored in CSV file. The system is developed in JAVA
programming language.

17
7.3. Inputs

18
7.4. Procedure

The system generates Java Beans from Ontology concepts by using JASTOR. Then using instance
information from “records.csv”, it creates instances with related recommendation. Once the users
request those recommendations it shows in message display.

7.5. Outputs

19
Conclusion and Future Works

After developing this system, we can get the system which can help diabetes patients to get necessary
recommendations. Since this current work is just a prototype for approval, we can develop the complete
system of this which can make use of all information from ontology concepts and medical records, in the
future.

20
8. Glossary
JENA-Java Semantic Framework which provides application interface for the developers in doing
semantic web applications.

JASTOR-Mechanism which transform ontology concepts into java beans by using JENA as its core.

Noun Phrase Translation-Translating the noun phrases selectively from sentences instead of translating
the whole sentence.

Ontology-Formal representation of a set of concepts within a domain and the relationships between
those concepts.

OWL-Short form of Web Ontology Language which is a family of knowledge representation languages
for authoring ontologies, and is endorsed by the World Wide Web Consortium.

Parallel corpus-Large documents pair which are known as source document and target document.
(Sometimes called as dual-language series of texts).

PHP-Popular server side scripting language which can support database interaction, dynamic content
generation and so on.

Phrase Table-Document containing translated result, alignment position and frequency of occurrences in
the corpus for assisting in translation.

POS-Linguistic category of words (or more precisely lexical items), which is generally defined by the
syntactic or morphological behaviour of the lexical item in question.

Probabilistic-Any system or module which use probability in its core processing functions.

Tree Transducer-Changing the arrangement of any of the sub-tree from orignal tree.

21

You might also like