B. Lexer Generating D. Language Modeling: Figure 3. Tokenization and POS Tagging in Source Code Analysis

lines of code (LOC) of dataset rows with individual labels from the dataset to analyze.
alyze. Next, this analysis iterates

(Figure 2). The ultimate goal of these steps is to create a through each line of code and turns it into a stream of tokens
specific dataset with the appropriate lines of code and Java using a lexer. We then collect all the tokens for each line of
statement labels, which can be used for further source code code into an array. We then concatenate all token types in
analysis and language modeling. each line of code and combine them with spaces to form
common natural language sentences that are easy to process
in NLP.
B. Lexer Generating
ANTLR is a flexible tool developer can use to analyze D. Language Modeling
source code and make language processing tools for Language modeling is creating a statistical model that
different applications [2]. This generator can analyze source can predict the possible word order in a given language. A
code by generating source code for any programming token stream is a sequence of text data from the previous
language, provided the user provides the grammar for that process. This research needs to make a numerical
language. ANTLR is based on Java but can generate source representation. One common approach is the frequency
code in many other programming languages. This research inverse document frequency (TF-IDF) method [3], which
uses ANTLR version 4 as the latest version of the tool. In assigns a weight to each token based on how frequently it
general, ANTLR will generate lexers, listeners, and parsers, occurs in the code and how unique it is across the dataset.
while this study only uses lexers. A lexer is a software First, the language modeling import the Python libraries
component that reads a stream of program code characters like CountVectorizer and TfidfTransformer, which will let
and converts it into a stream of tokens, representing a us use TF-IDF to change the textual data in the token stream
programming language's fundamental building blocks. into a numeric format. From each line of code or statement,
The first thing to do to use ANTLR to make a lexer is to we use the CountVectorizer to calculate the frequency of
get the ANTLR library from the official site. After each token in each statement, then use the TfidfTransformer
downloading the library, the next step is to provide the Java to calculate the TF-IDF weights for each token. This method
grammar that defines the rules and tokens of the will turn the token stream into a numerical representation
programming language to analyze. Grammars must be that shows each token’s importance in the source code
written in the ANTLR grammar format. The ANTLR tool context. This method results in 53 features from a dataset
generates a lexer for Java source code programs based on a with 594 lines of code. These features can then be used to
Java grammar file as input and generates source code in the classify each statement into its type.
user's specified language. This study develops the lexer
E. Machine Learning Classification
program in Python language to make classification easier.
Machine learning classification is a powerful way to sort
C. Source Code Analysis text data into different groups. This method is beneficial in
Source code analysis is the process of figuring out what source code analysis for finding and putting source code
the source code of a program means by looking at its statements into groups. Many machine learning models are
structure and syntax. The component used in this process is available for text classification, including Naïve Bayes,
the lexer. This process is analogous to tokenization and POS SVM, kNN, Decision Tree, and Rochio Algorithms [4].
tagging in natural language processing (NLP). Tokenization Each model has strengths and weaknesses, and choosing the
is breaking source code text into constituent words or suitable model for a particular task is essential. In order to
determine the best model for source code statement
tokens. POS tagging identifies each token types, such as
classification, it is necessary to compare the performances of
identifier, operator, keyword, or other token types. This
these different models.
information can be used to create a token stream, a list of
token names and types representing a source code file. We use Python libraries to model machine learning
Figure 3 shows a sample of this research’s the tokenization algorithms for the classification process. The classification,
and POS Tagging. process as well as the evaluation process, use K-Fold Cross-
Validation. This process use four folds so it divide the
dataset into 75% of training data and 25% of testing data.
After splitting the data set, the classification constructs a
classification model from the training data using the selected
machine learning model alternately. The program record the
model's accuracy and time consumption for the test dataset in
each fold. Finally, it calculates the average accuracy and time
consumption in each model. This process allows us to
compare the performance of different machine learning
models and choose the best fit for our source code assertion
classification.
Figure 3. Tokenization and POS Tagging in source code analysis
First, this source code analysis process imports the

generated Lexer program. Then it reads the source code file
IV. RESULT AND DISCUSION 70000
60000
Accuracy Time Consuming
50000
% (ms)
40000
Decision Tree 95.3% 4.53
30000
Naïve Bayes 94.4% 3.79
20000
SVM 95.1% 5.34
10000
RVM 96.1% 64908.16
0
kNN 87.3% 49.13 Decision Naïve SVM RVM kNN Rochio
Rochio 83.7% 61.86 Tree Bayes
Tampilkan semua table, grafik semua yg dihasilkan Figure 5 Time consumption in Java statement classification
Dilakukan pembahasan
Sehingga bisa disimpulkan (mengarah ke kesimpulan)
V. CONCLUSION
Hasil terbaik
ACKNOWLEDGEMENT
This work was supported by the Indonesian Government
through the Scholarship Schema of LPDP RI and Puslapdik
Kemendikbudristek.
REFERENCES
[1] S. Brudzinski, "Open LaTeX Studio," 2018. [Online].

Available: https://github.com/sebbrudzinski/Open-
LaTeX-Studio/. [Accessed 2023].
[2] T. Parr, The Definitive ANTLR 4 Reference, The
Pragmatic Programmers, 2013, pp. 1-326.
[3] Z. Yun-tao, G. Ling and W. Yong-cheng, "An Improved
Figure 4 Accuracy in Java statemsent classification
TF-IDF Approach for Text Classification," Journal of
Zhejiang University-Science, vol. 6, pp. 49-55, 2005.
[4] B. Agarwal and N. Mittal, "Text Classification Using
Machine Learning Methods-A Survey," Proceedings of
the Second International Conference on Soft Computing
for Problem Solving, vol. 236, pp. 701-709, 2012.
Data Preparation
• Cari OSS berbahasa Java yang tersedia di Github
• Terpilih Open Latex Studio
• 511 KB 120 file
• Diambil 4 file *.java  37KB
• Dipindah setiap baris ke CSV  806 baris kode
• Diberi label: Declaration, Expression, Control
• Dibuang yang lain: Comment, Braces, Parentheses
• Tersisa 594 baris dataset
Lexer Generating
• Download library ANTLR & grammar Java
• Set konfigurasi ANTLR
• Set target program ke Python
• Generate program Python untuk mengolah source code Java
• Dapat Lexer, Parser, Listener
• Ambil Lexernya
Source Code Analysis

• Import program Lexer yang sudah di-generate
• Baca tiap baris dataset
• Lakukan Analisis dengan Lexer (Tokenization & POS Tagging)
• Hasilnya: TokenStream (array)
• Tokenization: mendapatkan token-token dari source code
• POS Tagging: menandai jenis token
• Mengubah TokenStream ke dalam String yang berisi jenis token
• Menggabungkan jenis token setiap baris dengan pemisah spasi
• Dapat kalimat yang seperti bahasa alami, siap untuk diproses NLP
Language Modeling
• Import library Python: CountVectorizer & TFIDFTransformer
• CountVectorizer: Menghitung kemunculan tiap token di setiap baris & di kumpulan baris
• TFIDFTransformer: Menghitung bobot token-token yang relevan di setiap baris
• Dapat dataframe: 53 fitur bobot di 594 baris dataset
• Siap diklasifikasi
Statement Classification
• Import library Python untuk model ML: DecisionTree, NaiveBayes, SVM, RVM, kNN, Rochio
• Membuat K-Fold CrossValidation: 4 folds
• 75% data training & 25% data testing
• Memilih metode ML
• Mengeksekusi 4 iterasi:
• Membangun model klasifikasi berdasarkan metode
• Menghitung akurasi & waktu
• Menghitung rerata akurasi & waktu setiap metode ML

B. Lexer Generating D. Language Modeling: Figure 3. Tokenization and POS Tagging in Source Code Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

B. Lexer Generating D. Language Modeling: Figure 3. Tokenization and POS Tagging in Source Code Analysis

Uploaded by

Copyright:

Available Formats

lines of code (LOC) of dataset rows with individual labels from the dataset to analyze.

alyze. Next, this analysis iterates

Figure 3. Tokenization and POS Tagging in source code analysis

First, this source code analysis process imports the

[1] S. Brudzinski, "Open LaTeX Studio," 2018. [Online].

Source Code Analysis

You might also like