A Machine Learning Framework For Automated News Article Title Classification in Albanian
A Machine Learning Framework For Automated News Article Title Classification in Albanian
Abstract—Automated news article classification is a method significant challenges [4]. Currently, the majority of available
of categorizing textual data into predefined classes. Addressing text corpora, which algorithms are trained on, are in English.
this problem finds applications in diverse domains, including For under-represented languages such as Albanian, the present
information retrieval, topic modeling, sentiment analysis and
content recommendation systems. In Albanian, though there is a text corpora is limited and small in size. An additional
rapid increase of digital content, there is limited availability of challenge is caused from the inherent ambiguity of news
text corpora, presenting significant obstacles for advancement of articles, often falling under multiple categories. For example,
natural language processing research and applications. an article discussing sporting events may also include elements
The contribution of this paper is twofold. First, we introduce of social or cultural significance, making it difficult to assign
a dataset consisting of 9600 news article titles spanning across
various categories. Second, we utilize this dataset to assess the a single, definitive label. Moreover, the grammatical structure
effectiveness of several machine learning algorithms for topic and writing style of Albanian text significantly differs from
classification. Experimental results demonstrate the efficacy of that of English, posing further obstacles for accurate classifi-
recurrent neural networks in comparison to simpler classifiers cation. Collectively, these limitations affect the performance of
and ensemble methods. machine learning algorithms to semantically understand text
Index Terms—news classification, machine learning, NLP
for low-resource languages [4], [5].
I. I NTRODUCTION To address these challenges, we introduce a comprehensive
dataset of 9600 news article titles in Albanian, covering six
Text classification is a central problem in natural lan- distinct categories. News titles are sourced from several news-
guage processing with applications in information retrieval papers, ensuring a balanced representation across categories
and summarization, aggregation of news sources by topic, such as politics, economy, sport, culture, lifestyle and current
customer feedback segmentation, and content personalization affairs. This labeled dataset provides a substantial body of
[1]. Due to the vast volume of digital content, it is necessary news article titles available for training. We leverage the col-
to understand, examine and organize text [2]. This work lected dataset to address the task of news article classification
focuses on classification of news articles in the Albanian by examining the effectiveness of various machine learning
language. Currently available content is characterized by small models. We begin by testing traditional models such as lo-
volume of information provided, a wide variety of lexical and gistic regression, support vector machines and decision trees,
grammatical structures, and direct and formal writing style. which serve as benchmarks for comparison. Additionally, we
Recent advancements in machine learning, natural language investigate the performance of ensemble learning methods
processing (NLP) and large language models are playing like random forest and gradient boosting. Experimental results
a pivotal role in achieving high level of accuracy in text identify recurrent neural networks as the better performing
understanding, generation and classification [3]. Such meth- model, able to capture sequential and contextual information
ods leverage large training data and sophisticated algorithms in news headline data.
to extract text patterns, semantic representations and under-
standing of language nuances. Despite these advancements, II. P ROBLEM D EFINITION AND R ELATED W ORK
classification of news article titles in Albanian presents several Let D be a dataset containing N news article titles, where
979-8-3503-6813- 0/24/$31.00 ©2024 IEEE each title xi is associated with a category label yi such
Authorized licensed use limited to: National University Fast. Downloaded on November 01,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
that yi ∈ {1, 2, . . . , K} with K being the total number the effectiveness of deep learning methods for low-resource
of categories. Each news article title xi is composed of a languages.
sequence of words represented as xi = (wi1 , wi2 , . . . wiNi ) News article classification tasks with Albanian text corpora
where Ni is the number of words in title xi . These words is also addressed by previous research. A study comparing
form the features used for classification. the effectiveness of various classifiers such as Multinomial
The goal is to leverage the training data to learn a function Naı̈ve Bayes, Logistic Regression, SVM, and others in terms
f : X → Y where X denotes the set of news article of accuracy and execution time, has shown that Passive
titles, while Y represents the set of corresponding true labels. Aggressive algorithm achieves the highest accuracy, while
Function f maps each title xi to its corresponding category Random Forest performs the poorest [15]. In other work,
yi , with the goal of accurately predicting the category of new, beside traditional classifiers, bag of word models and hierar-
previously unseen titles. In other words, the objective is to chical classifiers are also employed, focused on semantic and
identify optimal function f ∗ that minimizes a predefined loss syntactical similarities between words, resulting in models that
function L(f (xi ), yi ) over the entire dataset D, such that achieve high accuracy in multi-label text classification [16] -
[17]. In a more closely related work, a series of traditional and
N
ensemble classification algorithms is employed on a relatively
f ∗ = argmin L(f (xi ), yi ) (1) small dataset of Albanian news article headlines. Experimental
f i=1
results demonstrate that basic models outperform ensemble
where N denotes the total number of news article titles, learning methods [18]. Though previous research have ex-
while L represents the loss incurred by the prediction of the amined news article topic classification in Albanian relying
model. mostly on traditional classification algorithms, our approach
Various methods are employed to address this problem, distinguishes itself with a large dataset of over 9000 news
from probabilistic models to recent deep neural architectures. article titles. In addition, we also employ deep learning meth-
Early contributions to topic modeling include methods such as ods, such as recurrent neural networks and achieve a higher
Latent Semantic Indexing (LSI) [6] and Probabilistic Latent accuracy on news article topic classification.
Semantic Indexing (PLSI) [7], aiming to discover hidden
thematic structures within a body of text based on word usage III. P ROPOSED M ETHODOLOGY
statistics. While these approaches seek to discover laten topics We employ an approach with three key modules. First,
without prior knowledge of categories, we use labeled training we construct a dataset by web scraping news article titles
data to predict the category of previously unseen article spanning several categories. Second, we process the dataset by
titles. Traditional machine learning approaches such as logistic tokenizing and converting it to numerical tokens. Labels are
regression [8] model the likelihood of a given news article title also encoded into numerical values. Third, we build a variety
belonging to a particular category. Support Vector Machine of classification algorithms, from traditional approaches, to
(SVM) [9] aim to find an optimal hyperplane that separate data ensemble and deep learning models and train them with the
points into distinct categories by creating decision boundaries objective of building a model to classify previously unseen
that maximize the separation between classes. Decision trees news article titles. Figure 1 provides an illustration.
maximize information gain and assign categories based on
majority of data points [10], while ensemble models such as Data Extraction Module
random forests leverage multiple decision trees and use major- Newspapers
Article Raw Data Article Headlines
ity votes to improve prediction accuracy [11]. Other ensemble online editions Web scraping Filtering
Authorized licensed use limited to: National University Fast. Downloaded on November 01,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
lifestyle. To ensure robustness of our approach, we include po- TABLE II
tentially overlapping categories, such as culture and lifestyle. I LLUSTRATION OF T EXT T OKENIZATION AND PADDING
An example is shown in table I. Original Text Vectorized and Padded Text
Once the data are assembled, to ensure uniform formatting,
tirana mposht pastër teutën dhe [ 0 0 0 0 0 0 0 0 0 0 0 35 1
a preprocessing phase takes place to remove punctuations, rikthehet tek fitorja dopietë e 29 12 148 8 360 39 4 38 118
special symbols, convert all data to lowercase and remove florent hasanit 363 32]
titles with insufficient length. We ensure a well balanced
representation of each category across the entire dataset. A
human supervisor checks the quality and correctness of the
data into a format that is suitable to be utilized as input for
data to ensure quality and relevance of the training data.
machine learning algorithms.
Authorized licensed use limited to: National University Fast. Downloaded on November 01,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
2) Model Optimization: Once the architecture of the neural
network is defined, the model is trained using the Adam
optimizer, a well known enhancement of gradient descent,
widely applicable in neural network trainings. Adam is renown
for its robustness to deal with sparse gradients and noisy data.
Categorical crossentropy is chosen as the loss function for
our multi-class classification task. It measures the discrepancy
between the predicted probability distribution and the true
distribution of the classes. Because neural network architec-
tures tend to have a larger number of trainable parameters in
comparison to more traditional approaches, with the objective
Fig. 2. Illustration of Recurrent Neural Network Gates. The figure depicts of avoiding overtraining, we are implementing early stopping
the architecture of a typical LSTM cell, featuring a cell state an input gate, a as a regularization technique to optimize model performance.
forget gate, an output gate, and a candidate cell state In addition, we employ other regularization techniques such
as Dropout and Batch Normalization to ensure that the model
is robust enough to learn meaningful patterns and generalizes
one. Values of zero denote information that is discarded, while well to previously unseen data.
values of one shows information that is remembered, with
anything in between being partially remembered. C. Traditional Classification Algorithms
The role of the input gate is to identify the elements that In our work, we consider several traditional classification
need to be added to the cell state and the long-term memory models to address the task of news article title classification,
of the network. The input gate decides which values will be including simpler models such as logistic regression, support
updated. To achieve that, a sigmoid function is applied on vector machines and decisions trees, and ensemble models
the current state Xt and the hidden state ht−1 , transforming including random forest and gradient boosting.
the values in the range zero (not relevant) to one (relevant). Logistic regression is a simple, yet effective linear model
The input gate is charged with the task of deciding what whose objective is to estimate the probability of a data point
information is relevant to update in the current cell state. At belonging to a particular category [8]. Logistic regression
this point, the network has enough information to calculate the uses the logistic function to identify underlying relationship
cell state and it is ready to store the information in it. between the input features (that is word embeddings) and
The objective of the output gate is to decide what to output category outputs. Logistic regression is computationally effi-
from the memory cell. It contains information on previous cient, highly interpretable and has demonstrated to work well,
inputs and decides the value of the next hidden state. especially if classes are well separated among one another.
1) Model Architecture: To address the news article topic Support vector machines, on the other hand, identify a
classification problem, we propose a neural network archi- hyperplane to separate data points into distinct categories,
tecture composed of several sequential layers, whose goal maximizing the distance between categories [9]. In comparison
is to process vectorized text data and understand hidden with logistic regression, SVMs are better suited to handle
relationships to perform classification for new unseen article high-dimensional data and capture complex non-linear rela-
titles. tionships, while logistic regression inherently assumes a linear
To map each word in the input news article title to a relationship between the input features and output categories.
dense vector representation, we use an embedding layer, whose Another traditional classification model we utilize is a
goal is to capture semantic meaning of words based on decision tree [10]. The objective of decision trees, as non-
their contextual usage in the dataset. Next, we connect the parametric models, is to partition the input space into non-
embedded layer to a LSTM layer consisting of 128 memory overlapping regions and recursively split this space based
units. The goal is to leverage the LSTM layer to capture decision nodes represented by the most informative features.
long-term dependencies in the input sequences, enabling the Decision trees aim to maximize information gain and assign
algorithm to gain the relevant context for topic understanding, categories based on majority of data points within a region.
and ultimately accurate predictions. Next, after the LSTM Though highly interpretable, decision trees may suffer from
layer has processed input sequences, our model uses a dense poor generalization capabilities and be prone to overfitting.
layer with 64 neuron units aiming to extract high-level features To improve the performance of single decision trees, another
and abstract representations of the input data. This dense common approach is to utilize random forests, as an ensemble
layer drives the classification process to discriminate between learning methodology that constructs multiple decision trees
different news article topics. Finally, a dense layer with six and makes a decision based on the majority vote [11]. In
different units is used as the output layer, with the goal of particular, a single decision tree is trained on a random subset
performing the final classification, yielding the probability of the data, with the goal of introducing randomness and
distributions enabling the model to predict the most likely diversity in the ensemble. More concretely, a random forest
category given a news article title. model aggregates individual predictions to allow the model
Authorized licensed use limited to: National University Fast. Downloaded on November 01,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
to improve its overall performance. Because random forests C. Performance of Classification Algorithms
are ensemble models, they are less prone to overfitting in A series of experimental results were conducted to assess
comparison with decision trees, and generally more robust. the performance of various classification models. Accuracy
Another ensemble learning method utilized in our approach score is used as evaluation measure in test data. Note that
is gradient boosting. The objective is to iteratively build a accuracy is calculated as the proportion of instances classified
sequence of decision trees, each aiming to correct errors made correctly (i.e., true positives and true negatives) over the entire
by preceding models [12]. In the context of news article number of records. Figure 3 shows the performance of various
topic classification, gradient boosting model focuses on the classification algorithms.
most informative parts of the input space, gradually improving
predictive capabilities. Gradient boosting is able to capture
nonlinear relationships of input data, but is computationally
expensive and sensitive to the choice of hyperparameters.
Authorized licensed use limited to: National University Fast. Downloaded on November 01,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
not be able to fully capture the overlapping nuances of various R EFERENCES
news article topic headlines, leading to lower accuracy in [1] A. Palanivinayagam, C. El-Bayeh, R. Damaševičius. “Twenty Years
comparison to RNNs and logistic regression. of Machine-Learning-Based Text Classification: A Systematic Review,”
Lastly, decision tree models achieved a low accuracy of Algorithms, 2023.
[2] A. Joshi, E. Fidalgo, E. Alegre, L. Fernández-Robles. “L DeepSumm:
only 58%. Decision trees segment the input feature space using Exploiting topic models and sequence to sequence networks for extrac-
decision nodes and categorize each news article headline based tive text summarization,” Expert Systems with Applications, 2023.
on a majority class. This procedure, however, tends to overfit [3] D. Khurana, A. Koli, K. Khatter, K, S. Singh, S. “Natural language
processing: State of the art, current trends and challenges,” Multimedia
the training data, resulting therefore in poor generalization tools and applications, 2023, vol 82, pp. 3713–3744.
capabilities, and consequently, low performance on previously [4] E. Çano, D. Lamaj, D. “AlbNews: A Corpus of Headlines for Topic
unseen test data. Modeling in Albanian, ” arXiv preprint arXiv:2402.04028, 2024.
[5] V. Blaschke, H. Schuetze, B Plank. “A survey of corpora for Germanic
lowresource languages and dialects,” In Proceedings of the 24th Nordic
D. Limitations Conference on Computational Linguistics (NoDaLiDa), 2024, pp. 392–
One limitation is the inherent ambiguity and overlap be- 414.
[6] C. H. Papadimitriou, H. Tamaki, P. Raghavan, S. Vempala. “Latent
tween some class categories. Exploring various cases of mis- semantic indexing: a probabilistic analysis,” In Proceedings of the Sev-
classification revealed that in several cases, even for a human enteenth ACM SIGACT-SIGMODSIGART Symposium on Principles of
supervisor, it would be difficult to assign one unique category Database Systems, 1998, pp. 159–168.
[7] T. Hofmann. “Probabilistic latent semantic indexing,’ ’In Proceedings
to a certain news article headlines. For example, distinguishing of the 22nd Annual International ACM SIGIR Conference on Research
between culture and lifestyle categories is challenging as and Development in Information Retrieval, 1999, pp 50–57.
both classes share common themes. Similarly, current events, [8] D.R. Cox. “The regression analysis of binary sequences. Journal of the
Royal Statistical Society: Series B (Methodological)”, 1958, vol. 20(2),
politics and economics topics can be ambiguous, as all three pp 215–232.
categories may intersect with one another. [9] C. Cortes, V. Vapnik. “Support-vector networks. Machine Learning”,
Another limitation of neural networks, despite having the 1995, vol. 20(3), pp. 273–297.
[10] C. Sammut, G.I. Webb. “Decision Tree In Encyclopedia of Machine
highest performance, is that they are less explainable and Learning,” 2011, Springer, Boston, MA. [Link]
interpretable, making it challenging to understand the un- 387-30164-8 832
derlying decision-making process. To address these limita- [11] T. K. Ho. “Random decision forests,” In Proceedings of 3rd international
conference on document analysis and recognition, 1995, vol. 1, pp. 278–
tions, as future work, we will consider to refine classification 282.
boundaries by integrating domain-specific knowledge. Further- [12] J. H. Friedman, “Greedy function approximation: a gradient boosting
more gradient-based attribution methods and human-in-the- machine”, 2001, Annals of Statistics, pp. 1189–1232.
[13] L. A. Qadi, H. E. Rifai, S. Obaid and A. Elnagar. “Arabic Text
loop techniques can help to increase the interpretability of Classification of News Articles Using Classical Supervised Classifiers,”
deep learning models. 2nd International Conference on new Trends in Computing Sciences
(ICTCS), 2019, pp. 1–6.
V. C ONCLUSION [14] T. Walkowiak, P. Malak. “Polish Texts Topic Classification Evaluation,”
In ICAART, 2018, vol. 2, pp. 515–522.
This paper presented an approach that utilizes machine [15] A. Petukhova, N. Fachada.“MN-DS: A multilabeled news dataset for
learning classification algorithms to address the problem of news articles hierarchical classification”, Data, 2023, vol. 8.
[16] S. Parida, P. Motlicek, S.R. Dash. “German News Article Classification:
categorizing Albanian news article headlines into predefined A Multichannel CNN Approach,” Advances in Systems, Control and
classes. A novel dataset consisting of 9600 records was Automations, Lecture Notes in Electrical Engineering, 2021, vol. 708.
meticulously constructed and leveraged to advance research in [17] Z. Zhai, X. Zhang, F. Fang, , L. Yao. “Text classification of Chinese
news based on multi-scale CNN and LSTM hybrid model,” Multimedia
natural language processing for low-resource languages such Tools and Applications, 2023, vol. 82(14), pp. 20975–20988.
as Albanian. [18] L. Shkurti, F. Kabashi, V. Sofiu, A. Susuri. “Performance Comparison
The findings of this work underscore the effectiveness of of Machine Learning Algorithms for Albanian News articles”, 2022,
IFAC-PapersOnLine, vol. 55(39), pp. 292-295.
deep learning methods such as recurrent neural networks in [19] A. Kadriu, L. Abazi. “A Comparison of Algorithms for Text Classifica-
automated news article classification. Moreover, a wide range tion of Albanian News Articles”, ENTRENOVA - ENTerprise REsearch
of traditional machine learning approaches such as logistic InNOVAtion, 2017, vol. 3(1), pp. 62–68.
[20] A. Kadriu, L. Abazi, H. Abazi. “Albanian Text Classification: Bag of
regression, support vector machines and decisions trees, in Words Model and Word Analogies,” Business Systems Research, 2019,
addition to ensemble methods including random forest and vol. 10(1), pp. 74–87.
gradient boosting were evaluated on the constructed dataset. [21] E. Çano, D. Lamaj, D. “AlbNews: A Corpus of Headlines for Topic
Modeling in Albanian, ” arXiv preprint arXiv:2402.04028, 2024.
This work opens up several potential research directions. [22] C. Sammut, G.I. Webb. “TF-IDF In Encyclopedia of Machine Learning,”
One approach is to further explore deep learning architectures 2011, Springer, Boston, MA. [Link]
through techniques such as attention mechanisms, or leverage 8 832
[23] R.M., Schmidt, R. M. “Recurrent neural networks (rnns): A gentle
pre-trained large language models. Secondly, to overcome the introduction and overview,” 2019, arXiv preprint arXiv:1912.05911.
inherent ambiguity between certain news article categories, [24] S.F. Ahmed, M.S.B. Alam, M. Hassan, M.R. Rozbu, T. Ishtiak, N. Rafa,
we might explore more sophisticated models, or human in- A.H. Gandomi. “Deep learning modelling techniques: current progress,
applications, advantages, and challenges”, 2023, Artificial Intelligence
tervention to better capture subtle textual nuances. Thirdly, Review, vol. 56(11), pp. 13521–13617.
the expansion of the current dataset, both in terms of records, [25] S. Hochreiter, J. Schmidhuber. “Long short-term memory. Neural Com-
features and categories could increase the generalization ca- putation,” 1997, vol. 9(8), pp. 1735–1780.
pabilities of our model.
Authorized licensed use limited to: National University Fast. Downloaded on November 01,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.