0% found this document useful (0 votes)

88 views6 pages

A Machine Learning Framework For Automated News Article Title Classification in Albanian

This paper presents a machine learning framework for automated classification of news article titles in Albanian, addressing the challenges posed by limited text corpora and the complexity of the Albanian language. A dataset of 9600 news titles across six categories is introduced, and various machine learning algorithms, particularly recurrent neural networks, are evaluated for their effectiveness in classifying these titles. The study demonstrates that deep learning methods outperform traditional classifiers in accurately categorizing news articles in low-resource languages like Albanian.

Uploaded by

Ameer Hamza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

88 views6 pages

A Machine Learning Framework For Automated News Article Title Classification in Albanian

Uploaded by

Ameer Hamza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

A Machine Learning Framework for Automated

News Article Title Classiﬁcation in Albanian

Evis Plaku∗ , Klei Jahaj † , Arben Cela ‡ , and Nikolla Civici ‡

∗ AI
Laboratory, University Metropolitan Tirana
Sotir Kolea Street, Tirana, Albania
Email: [Link]@[Link]
† Faculty of Computer Science and IT, University Metropolitan Tirana

Sotir Kolea Street, Tirana, Albania

Email: [Link]@[Link]
‡ Laboratory of Images Signals and Intelligent Systems, ESIEE Paris

Noisy-le-Grand CEDEX, Paris, France

Email: [Link]@[Link]
§ Faculty of Engineering, University Metropolitan Tirana

Sotir Kolea Street, Tirana, Albania

Email: ncivici@[Link]

Abstract—Automated news article classification is a method significant challenges [4]. Currently, the majority of available
of categorizing textual data into predefined classes. Addressing text corpora, which algorithms are trained on, are in English.
this problem finds applications in diverse domains, including For under-represented languages such as Albanian, the present
information retrieval, topic modeling, sentiment analysis and
content recommendation systems. In Albanian, though there is a text corpora is limited and small in size. An additional
rapid increase of digital content, there is limited availability of challenge is caused from the inherent ambiguity of news
text corpora, presenting significant obstacles for advancement of articles, often falling under multiple categories. For example,
natural language processing research and applications. an article discussing sporting events may also include elements
The contribution of this paper is twofold. First, we introduce of social or cultural significance, making it difficult to assign
a dataset consisting of 9600 news article titles spanning across
various categories. Second, we utilize this dataset to assess the a single, definitive label. Moreover, the grammatical structure
effectiveness of several machine learning algorithms for topic and writing style of Albanian text significantly differs from
classification. Experimental results demonstrate the efficacy of that of English, posing further obstacles for accurate classifi-
recurrent neural networks in comparison to simpler classifiers cation. Collectively, these limitations affect the performance of
and ensemble methods. machine learning algorithms to semantically understand text
Index Terms—news classification, machine learning, NLP
for low-resource languages [4], [5].
I. I NTRODUCTION To address these challenges, we introduce a comprehensive
dataset of 9600 news article titles in Albanian, covering six
Text classification is a central problem in natural lan- distinct categories. News titles are sourced from several news-
guage processing with applications in information retrieval papers, ensuring a balanced representation across categories
and summarization, aggregation of news sources by topic, such as politics, economy, sport, culture, lifestyle and current
customer feedback segmentation, and content personalization affairs. This labeled dataset provides a substantial body of
[1]. Due to the vast volume of digital content, it is necessary news article titles available for training. We leverage the col-
to understand, examine and organize text [2]. This work lected dataset to address the task of news article classification
focuses on classification of news articles in the Albanian by examining the effectiveness of various machine learning
language. Currently available content is characterized by small models. We begin by testing traditional models such as lo-
volume of information provided, a wide variety of lexical and gistic regression, support vector machines and decision trees,
grammatical structures, and direct and formal writing style. which serve as benchmarks for comparison. Additionally, we
Recent advancements in machine learning, natural language investigate the performance of ensemble learning methods
processing (NLP) and large language models are playing like random forest and gradient boosting. Experimental results
a pivotal role in achieving high level of accuracy in text identify recurrent neural networks as the better performing
understanding, generation and classification [3]. Such meth- model, able to capture sequential and contextual information
ods leverage large training data and sophisticated algorithms in news headline data.
to extract text patterns, semantic representations and under-
standing of language nuances. Despite these advancements, II. P ROBLEM D EFINITION AND R ELATED W ORK
classification of news article titles in Albanian presents several Let D be a dataset containing N news article titles, where
979-8-3503-6813- 0/24/$31.00 ©2024 IEEE each title xi is associated with a category label yi such

Authorized licensed use limited to: National University Fast. Downloaded on November 01,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
that yi ∈ {1, 2, . . . , K} with K being the total number the effectiveness of deep learning methods for low-resource
of categories. Each news article title xi is composed of a languages.
sequence of words represented as xi = (wi1 , wi2 , . . . wiNi ) News article classification tasks with Albanian text corpora
where Ni is the number of words in title xi . These words is also addressed by previous research. A study comparing
form the features used for classification. the effectiveness of various classifiers such as Multinomial
The goal is to leverage the training data to learn a function Naı̈ve Bayes, Logistic Regression, SVM, and others in terms
f : X → Y where X denotes the set of news article of accuracy and execution time, has shown that Passive
titles, while Y represents the set of corresponding true labels. Aggressive algorithm achieves the highest accuracy, while
Function f maps each title xi to its corresponding category Random Forest performs the poorest [15]. In other work,
yi , with the goal of accurately predicting the category of new, beside traditional classifiers, bag of word models and hierar-
previously unseen titles. In other words, the objective is to chical classifiers are also employed, focused on semantic and
identify optimal function f ∗ that minimizes a predefined loss syntactical similarities between words, resulting in models that
function L(f (xi ), yi ) over the entire dataset D, such that achieve high accuracy in multi-label text classification [16] -
[17]. In a more closely related work, a series of traditional and
N
ensemble classification algorithms is employed on a relatively
f ∗ = argmin L(f (xi ), yi ) (1) small dataset of Albanian news article headlines. Experimental
f i=1
results demonstrate that basic models outperform ensemble
where N denotes the total number of news article titles, learning methods [18]. Though previous research have ex-
while L represents the loss incurred by the prediction of the amined news article topic classification in Albanian relying
model. mostly on traditional classification algorithms, our approach
Various methods are employed to address this problem, distinguishes itself with a large dataset of over 9000 news
from probabilistic models to recent deep neural architectures. article titles. In addition, we also employ deep learning meth-
Early contributions to topic modeling include methods such as ods, such as recurrent neural networks and achieve a higher
Latent Semantic Indexing (LSI) [6] and Probabilistic Latent accuracy on news article topic classification.
Semantic Indexing (PLSI) [7], aiming to discover hidden
thematic structures within a body of text based on word usage III. P ROPOSED M ETHODOLOGY
statistics. While these approaches seek to discover laten topics We employ an approach with three key modules. First,
without prior knowledge of categories, we use labeled training we construct a dataset by web scraping news article titles
data to predict the category of previously unseen article spanning several categories. Second, we process the dataset by
titles. Traditional machine learning approaches such as logistic tokenizing and converting it to numerical tokens. Labels are
regression [8] model the likelihood of a given news article title also encoded into numerical values. Third, we build a variety
belonging to a particular category. Support Vector Machine of classification algorithms, from traditional approaches, to
(SVM) [9] aim to find an optimal hyperplane that separate data ensemble and deep learning models and train them with the
points into distinct categories by creating decision boundaries objective of building a model to classify previously unseen
that maximize the separation between classes. Decision trees news article titles. Figure 1 provides an illustration.
maximize information gain and assign categories based on
majority of data points [10], while ensemble models such as Data Extraction Module
random forests leverage multiple decision trees and use major- Newspapers
Article Raw Data Article Headlines
ity votes to improve prediction accuracy [11]. Other ensemble online editions Web scraping Filtering

models such as gradient boosting build a series of decision

trees, each aiming to correct and improve the performance Data Processing Module
of the previous ones [12]. In comparison, our work extends Construct News
Preprocessing Label Train test
beyond linear relationships in input data by incorporating deep Article Title
Dataset
and tokenization encoding validation split

learning models, such as recurrent neural networks, which

have demonstrated high performance in capturing complex
Classification Module
patterns and non-linear relationships in the data.
Train
Vectorize input Predict new Evaluate
News article topic classification becomes even more chal- features
classification
unseen data performance
model
lenging when addressing low-resource languages where the
available labeled data is limited and scarce. Classical su-
pervised learning classifiers have been developed for several Fig. 1. Schematic representation of key modules in our approach
languages beside English, including Arabic [10], Polish [11],
Italian [12] and German [13], among others. Deep learning
models have also been utilized to address news article headline A. Data Preparation and Text Representation
classification, including convolutional neural networks and We build a dataset of records by scraping the web for news
recurrent neural networks [14]. We specifically address the article titles, including six high-interest distinct categories,
classification of Albanian news article titles, demonstrating namely: politics, economy, current affairs, sport, culture and

Authorized licensed use limited to: National University Fast. Downloaded on November 01,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
lifestyle. To ensure robustness of our approach, we include po- TABLE II
tentially overlapping categories, such as culture and lifestyle. I LLUSTRATION OF T EXT T OKENIZATION AND PADDING
An example is shown in table I. Original Text Vectorized and Padded Text
Once the data are assembled, to ensure uniform formatting,
tirana mposht pastër teutën dhe [ 0 0 0 0 0 0 0 0 0 0 0 35 1
a preprocessing phase takes place to remove punctuations, rikthehet tek fitorja dopietë e 29 12 148 8 360 39 4 38 118
special symbols, convert all data to lowercase and remove florent hasanit 363 32]
titles with insufficient length. We ensure a well balanced
representation of each category across the entire dataset. A
human supervisor checks the quality and correctness of the
data into a format that is suitable to be utilized as input for
data to ensure quality and relevance of the training data.
machine learning algorithms.

TABLE I B. Recurrent Neural Networks for News Article Classiﬁcation

I LLUSTRATION OF DATA S AMPLES
When analysing text, humans adopt an almost instinctive
Text Topic principle of breaking down larger pieces of content into
Liverpool mposht Atalanten por eliminohet nga Eu- spo smaller absorbable chunks, creating an internal model to
ropa, Leverkusen mbetet e pathyeshme remember the most relevant aspects that convey meaning and
Reforma zgjedhore: vota e emigranteve, lehtesisht e pol understanding. Recurrent neural networks (RNN) refer to a
arritshme
Disa arsye pse mund te jeni beqare, sipas shkences lif deep learning approach that in order to predict an output, rely
Rritja ekonomike: Turizmi ne zhvillim, kete vit eko on information from prior inputs while maintaining a state of
presim mbi 5 miliarde euro what the network has observed up to that point [20]. RNNs
Drejtonte urbanin ne gjendje te dehur kro
RTSH publikon vendimin e jurise kul possess an internal memory that empowers them to make
past information persist. This peculiar characteristics has made
RNNs widely applicable in natural language processing tasks
Preprocessed cleaned raw data are then transformed into such as news article topic classification. In our context, when
an appropriate format that machine learning algorithms can analysing a news article title, the memory unit of a RNN will
be trained on. That involves tokenizing text, converting it to be utilized to maintain information about past words in the
numerical representation and adding padding to ensure all sequence. As new information comes in, the internal state will
input sequences have the same length. Tokenization consists be continuously updated, therefore establishing connections
of splitting text into smaller units known as tokens, that between past and present elements in the text.
is, news article titles are broken down into a series of in- However, as shown in [21], RNNs in their basic architecture,
dividual words. The tokenized input is then converted into can face significant challenges to properly assign the correct
numerical representations, a form that can be understood and weight (importance) to words that are distant from each other
processed by classification algorithms. This process, known in the sequence. This problem is known as vanishing gradient
as text vectorization, aims to transform a series of tokens problem, where gradients for inputs that are too distant become
into numerical vectors. We leverage the Term Frequency - extremely small during the training process, prohibiting there-
Inverse Document Frequency (TF-IDF) method [19], which fore the model to learn well. To address this issue, a variant of
assign weights to words based on their relevance in a document recurrent neural networks known as Long Short-Term Memory
(that is news article title) and across the entire dataset. This Units (LSTM) were proposed [22]. LSTMs are able to handle
method combines term frequency, that is word occurrence in a well long-term dependencies, because they preserve relevant
document, with inverse document frequency, a way to penalize information from earlier sequences and carry it forward in the
terms that are common across all documents. The objective is network. These types of networks can even learn from events
to create a sparse matrix representation where values denote that have a significant time lag between them.
the importance of each word in the document relative to the The particularity of a long short-term memory unit is the
entire dataset. way that the next state of the carried information is computed.
As it is common in natural language processing tasks, The LSTM can add or remove information from the cell
uniformity in data size is required to effectively process text state based on regulations imposed by some special structures,
data as input. Since news article titles can vary in length, we known as gates. The role of gates is to decide if information
use padding to achieve constant length for all inputs. That is, will pass through or not. Three distinct transformations are
we add a series of zeros to the beginning of sequences to involved that are described by three types of gates that control
ensure that all inputs have the same length. Clearly, by adding the flow of information in the cell state. Figure 2 provides an
a series of zeros we do not affect the semantic meaning of data illustration.
points, but rather ensure efficient computation and processing. The forget gate has to decide which information is relevant
Table II provides an example of a single news article title, in to keep from the prior cell state. It considers the input at a
its original form and then its representation after tokenization, given timestep Xt and the hidden state ht−1 and applies a
vectorization and padding. The objective is to convert raw text sigmoid function, which generates a result between zero and

Authorized licensed use limited to: National University Fast. Downloaded on November 01,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
2) Model Optimization: Once the architecture of the neural
network is defined, the model is trained using the Adam
optimizer, a well known enhancement of gradient descent,
widely applicable in neural network trainings. Adam is renown
for its robustness to deal with sparse gradients and noisy data.
Categorical crossentropy is chosen as the loss function for
our multi-class classification task. It measures the discrepancy
between the predicted probability distribution and the true
distribution of the classes. Because neural network architec-
tures tend to have a larger number of trainable parameters in
comparison to more traditional approaches, with the objective
Fig. 2. Illustration of Recurrent Neural Network Gates. The figure depicts of avoiding overtraining, we are implementing early stopping
the architecture of a typical LSTM cell, featuring a cell state an input gate, a as a regularization technique to optimize model performance.
forget gate, an output gate, and a candidate cell state In addition, we employ other regularization techniques such
as Dropout and Batch Normalization to ensure that the model
is robust enough to learn meaningful patterns and generalizes
one. Values of zero denote information that is discarded, while well to previously unseen data.
values of one shows information that is remembered, with
anything in between being partially remembered. C. Traditional Classification Algorithms
The role of the input gate is to identify the elements that In our work, we consider several traditional classification
need to be added to the cell state and the long-term memory models to address the task of news article title classification,
of the network. The input gate decides which values will be including simpler models such as logistic regression, support
updated. To achieve that, a sigmoid function is applied on vector machines and decisions trees, and ensemble models
the current state Xt and the hidden state ht−1 , transforming including random forest and gradient boosting.
the values in the range zero (not relevant) to one (relevant). Logistic regression is a simple, yet effective linear model
The input gate is charged with the task of deciding what whose objective is to estimate the probability of a data point
information is relevant to update in the current cell state. At belonging to a particular category [8]. Logistic regression
this point, the network has enough information to calculate the uses the logistic function to identify underlying relationship
cell state and it is ready to store the information in it. between the input features (that is word embeddings) and
The objective of the output gate is to decide what to output category outputs. Logistic regression is computationally effi-
from the memory cell. It contains information on previous cient, highly interpretable and has demonstrated to work well,
inputs and decides the value of the next hidden state. especially if classes are well separated among one another.
1) Model Architecture: To address the news article topic Support vector machines, on the other hand, identify a
classification problem, we propose a neural network archi- hyperplane to separate data points into distinct categories,
tecture composed of several sequential layers, whose goal maximizing the distance between categories [9]. In comparison
is to process vectorized text data and understand hidden with logistic regression, SVMs are better suited to handle
relationships to perform classification for new unseen article high-dimensional data and capture complex non-linear rela-
titles. tionships, while logistic regression inherently assumes a linear
To map each word in the input news article title to a relationship between the input features and output categories.
dense vector representation, we use an embedding layer, whose Another traditional classification model we utilize is a
goal is to capture semantic meaning of words based on decision tree [10]. The objective of decision trees, as non-
their contextual usage in the dataset. Next, we connect the parametric models, is to partition the input space into non-
embedded layer to a LSTM layer consisting of 128 memory overlapping regions and recursively split this space based
units. The goal is to leverage the LSTM layer to capture decision nodes represented by the most informative features.
long-term dependencies in the input sequences, enabling the Decision trees aim to maximize information gain and assign
algorithm to gain the relevant context for topic understanding, categories based on majority of data points within a region.
and ultimately accurate predictions. Next, after the LSTM Though highly interpretable, decision trees may suffer from
layer has processed input sequences, our model uses a dense poor generalization capabilities and be prone to overfitting.
layer with 64 neuron units aiming to extract high-level features To improve the performance of single decision trees, another
and abstract representations of the input data. This dense common approach is to utilize random forests, as an ensemble
layer drives the classification process to discriminate between learning methodology that constructs multiple decision trees
different news article topics. Finally, a dense layer with six and makes a decision based on the majority vote [11]. In
different units is used as the output layer, with the goal of particular, a single decision tree is trained on a random subset
performing the final classification, yielding the probability of the data, with the goal of introducing randomness and
distributions enabling the model to predict the most likely diversity in the ensemble. More concretely, a random forest
category given a news article title. model aggregates individual predictions to allow the model

Authorized licensed use limited to: National University Fast. Downloaded on November 01,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
to improve its overall performance. Because random forests C. Performance of Classification Algorithms
are ensemble models, they are less prone to overfitting in A series of experimental results were conducted to assess
comparison with decision trees, and generally more robust. the performance of various classification models. Accuracy
Another ensemble learning method utilized in our approach score is used as evaluation measure in test data. Note that
is gradient boosting. The objective is to iteratively build a accuracy is calculated as the proportion of instances classified
sequence of decision trees, each aiming to correct errors made correctly (i.e., true positives and true negatives) over the entire
by preceding models [12]. In the context of news article number of records. Figure 3 shows the performance of various
topic classification, gradient boosting model focuses on the classification algorithms.
most informative parts of the input space, gradually improving
predictive capabilities. Gradient boosting is able to capture
nonlinear relationships of input data, but is computationally
expensive and sensitive to the choice of hyperparameters.

IV. E XPERIMENTS AND R ESULTS

A. Data Preparation
We constructed a dataset consisting of 9600 news article
titles, equally distributed among six categories of 1600 titles
each. The included categories, namely politics, economy,
current affairs, sport, culture and lifestyle provide a diverse
representation of news topics. A balanced distribution across
categories helps to train on a diverse range of topics, while
minimizing bias that might result from uneven class represen- Fig. 3. Performance of classification algorithms with the recurrent neural
tation. However, classification of news article headlines into network achieving the highest accuracy and the decision tree the lowest.
these categories is far from being straightforward. Textual data
are inherently ambiguous. Furthermore, several categories can The recurrent neural network architecture achieves the
overlap and the distinction between them might be blurred. highest accuracy score of 90%. This performance can be
For example, distinguishing cultural phenomena and lifestyle attributed to several key factors. First, RNNs are well-suited
trends, or similarly, current affairs with political or econom- to process sequential data involving text, as it the case of
ical events is challenging due to subjectivity and overlap of news article topic classification. RNN architectures are able to
information categorization. capture dependencies between words in a news article headline
To better understand the structure of the dataset, we con- leveraging their ability to remember information from previous
ducted character and token length analysis. News article titles tokens in the input. This capability empowers RNNs to extract
range from 19 to 143 characters, with approximately 73 the necessary context and meaning of news article titles by
characters on average. In addition, each title is between 6 to considering the entire headline when making predictions and
30 tokens, with an average number of 13 tokens. That means not only words in isolation. Moreover, RNNs are highly
that news article titles are short, concise, yet contain sufficient adaptable and therefore such models are able to recognize
information for classification. well subtle differences or nuances between news article titles
belonging to different categories.
The entire dataset was constructed by web scraping major
Besides RNNs, simpler and more traditional models such
online editions Albanian newspapers. Titles were annotated
as logistic regression and support vector machines achieve a
with their respective category through an automated process,
notably high accuracy score of 85%. Despite their simplicity,
whenever such information was clearly available on the web.
such models are effective when required to separate data points
A human supervisor labeled the rest of the data and double
into distinct classes, especially when data may exhibit clear
checked the accuracy of headlines categorized automatically,
boundaries between categories. That is often the case of some
to ensure accuracy and correctness of data.
of the chosen categories, such as sport, politics and economy.
Further digging into results shows that both models struggle
B. Training Procedure and Data Splitting
more when boundaries between categories (as is the case of
The entire dataset comprising of 9600 records was randomly culture and lifestyle) is blurred and often overlapping.
divided into distinct subsets for training, validation and testing. In contrast, ensemble models such as random forest and
In particular, following commonly established practices 80% gradient boosting showed poorer performance in comparison,
of the data was allocated for training and the remaining with accuracy scores of 77% and 75%. Though generally such
20% for testing. To increase the robustness of the model and models are less prone to overfitting, they might struggle to dis-
minimize the risk of overfitting, the training set was further tinguish between nuances present in the text. In addition, the
divided, setting aside 25% of it for validation. Reproducibility dependence of random forest models on the mode of classes
of results across different runs is also ensured. and the iterative correction of errors in gradient boosting might

Authorized licensed use limited to: National University Fast. Downloaded on November 01,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.
not be able to fully capture the overlapping nuances of various R EFERENCES
news article topic headlines, leading to lower accuracy in [1] A. Palanivinayagam, C. El-Bayeh, R. Damaševičius. “Twenty Years
comparison to RNNs and logistic regression. of Machine-Learning-Based Text Classification: A Systematic Review,”
Lastly, decision tree models achieved a low accuracy of Algorithms, 2023.
[2] A. Joshi, E. Fidalgo, E. Alegre, L. Fernández-Robles. “L DeepSumm:
only 58%. Decision trees segment the input feature space using Exploiting topic models and sequence to sequence networks for extrac-
decision nodes and categorize each news article headline based tive text summarization,” Expert Systems with Applications, 2023.
on a majority class. This procedure, however, tends to overfit [3] D. Khurana, A. Koli, K. Khatter, K, S. Singh, S. “Natural language
processing: State of the art, current trends and challenges,” Multimedia
the training data, resulting therefore in poor generalization tools and applications, 2023, vol 82, pp. 3713–3744.
capabilities, and consequently, low performance on previously [4] E. Çano, D. Lamaj, D. “AlbNews: A Corpus of Headlines for Topic
unseen test data. Modeling in Albanian, ” arXiv preprint arXiv:2402.04028, 2024.
[5] V. Blaschke, H. Schuetze, B Plank. “A survey of corpora for Germanic
lowresource languages and dialects,” In Proceedings of the 24th Nordic
D. Limitations Conference on Computational Linguistics (NoDaLiDa), 2024, pp. 392–
One limitation is the inherent ambiguity and overlap be- 414.
[6] C. H. Papadimitriou, H. Tamaki, P. Raghavan, S. Vempala. “Latent
tween some class categories. Exploring various cases of mis- semantic indexing: a probabilistic analysis,” In Proceedings of the Sev-
classification revealed that in several cases, even for a human enteenth ACM SIGACT-SIGMODSIGART Symposium on Principles of
supervisor, it would be difficult to assign one unique category Database Systems, 1998, pp. 159–168.
[7] T. Hofmann. “Probabilistic latent semantic indexing,’ ’In Proceedings
to a certain news article headlines. For example, distinguishing of the 22nd Annual International ACM SIGIR Conference on Research
between culture and lifestyle categories is challenging as and Development in Information Retrieval, 1999, pp 50–57.
both classes share common themes. Similarly, current events, [8] D.R. Cox. “The regression analysis of binary sequences. Journal of the
Royal Statistical Society: Series B (Methodological)”, 1958, vol. 20(2),
politics and economics topics can be ambiguous, as all three pp 215–232.
categories may intersect with one another. [9] C. Cortes, V. Vapnik. “Support-vector networks. Machine Learning”,
Another limitation of neural networks, despite having the 1995, vol. 20(3), pp. 273–297.
[10] C. Sammut, G.I. Webb. “Decision Tree In Encyclopedia of Machine
highest performance, is that they are less explainable and Learning,” 2011, Springer, Boston, MA. [Link]
interpretable, making it challenging to understand the un- 387-30164-8 832
derlying decision-making process. To address these limita- [11] T. K. Ho. “Random decision forests,” In Proceedings of 3rd international
conference on document analysis and recognition, 1995, vol. 1, pp. 278–
tions, as future work, we will consider to refine classification 282.
boundaries by integrating domain-specific knowledge. Further- [12] J. H. Friedman, “Greedy function approximation: a gradient boosting
more gradient-based attribution methods and human-in-the- machine”, 2001, Annals of Statistics, pp. 1189–1232.
[13] L. A. Qadi, H. E. Rifai, S. Obaid and A. Elnagar. “Arabic Text
loop techniques can help to increase the interpretability of Classification of News Articles Using Classical Supervised Classifiers,”
deep learning models. 2nd International Conference on new Trends in Computing Sciences
(ICTCS), 2019, pp. 1–6.
V. C ONCLUSION [14] T. Walkowiak, P. Malak. “Polish Texts Topic Classification Evaluation,”
In ICAART, 2018, vol. 2, pp. 515–522.
This paper presented an approach that utilizes machine [15] A. Petukhova, N. Fachada.“MN-DS: A multilabeled news dataset for
learning classification algorithms to address the problem of news articles hierarchical classification”, Data, 2023, vol. 8.
[16] S. Parida, P. Motlicek, S.R. Dash. “German News Article Classification:
categorizing Albanian news article headlines into predefined A Multichannel CNN Approach,” Advances in Systems, Control and
classes. A novel dataset consisting of 9600 records was Automations, Lecture Notes in Electrical Engineering, 2021, vol. 708.
meticulously constructed and leveraged to advance research in [17] Z. Zhai, X. Zhang, F. Fang, , L. Yao. “Text classification of Chinese
news based on multi-scale CNN and LSTM hybrid model,” Multimedia
natural language processing for low-resource languages such Tools and Applications, 2023, vol. 82(14), pp. 20975–20988.
as Albanian. [18] L. Shkurti, F. Kabashi, V. Sofiu, A. Susuri. “Performance Comparison
The findings of this work underscore the effectiveness of of Machine Learning Algorithms for Albanian News articles”, 2022,
IFAC-PapersOnLine, vol. 55(39), pp. 292-295.
deep learning methods such as recurrent neural networks in [19] A. Kadriu, L. Abazi. “A Comparison of Algorithms for Text Classifica-
automated news article classification. Moreover, a wide range tion of Albanian News Articles”, ENTRENOVA - ENTerprise REsearch
of traditional machine learning approaches such as logistic InNOVAtion, 2017, vol. 3(1), pp. 62–68.
[20] A. Kadriu, L. Abazi, H. Abazi. “Albanian Text Classification: Bag of
regression, support vector machines and decisions trees, in Words Model and Word Analogies,” Business Systems Research, 2019,
addition to ensemble methods including random forest and vol. 10(1), pp. 74–87.
gradient boosting were evaluated on the constructed dataset. [21] E. Çano, D. Lamaj, D. “AlbNews: A Corpus of Headlines for Topic
Modeling in Albanian, ” arXiv preprint arXiv:2402.04028, 2024.
This work opens up several potential research directions. [22] C. Sammut, G.I. Webb. “TF-IDF In Encyclopedia of Machine Learning,”
One approach is to further explore deep learning architectures 2011, Springer, Boston, MA. [Link]
through techniques such as attention mechanisms, or leverage 8 832
[23] R.M., Schmidt, R. M. “Recurrent neural networks (rnns): A gentle
pre-trained large language models. Secondly, to overcome the introduction and overview,” 2019, arXiv preprint arXiv:1912.05911.
inherent ambiguity between certain news article categories, [24] S.F. Ahmed, M.S.B. Alam, M. Hassan, M.R. Rozbu, T. Ishtiak, N. Rafa,
we might explore more sophisticated models, or human in- A.H. Gandomi. “Deep learning modelling techniques: current progress,
applications, advantages, and challenges”, 2023, Artificial Intelligence
tervention to better capture subtle textual nuances. Thirdly, Review, vol. 56(11), pp. 13521–13617.
the expansion of the current dataset, both in terms of records, [25] S. Hochreiter, J. Schmidhuber. “Long short-term memory. Neural Com-
features and categories could increase the generalization ca- putation,” 1997, vol. 9(8), pp. 1735–1780.
pabilities of our model.

Authorized licensed use limited to: National University Fast. Downloaded on November 01,2024 at [Link] UTC from IEEE Xplore. Restrictions apply.

Bogery Et Al. - 2019 - Automatic Semantic Categorization of News Headline
No ratings yet
Bogery Et Al. - 2019 - Automatic Semantic Categorization of News Headline
8 pages
AlbNews: Albanian News Headline Corpus
No ratings yet
AlbNews: Albanian News Headline Corpus
5 pages
Lec # 4-1
No ratings yet
Lec # 4-1
15 pages
Text Classification Lecture Notes
No ratings yet
Text Classification Lecture Notes
26 pages
Dynamic CNN for Multi-Label Text Classification
No ratings yet
Dynamic CNN for Multi-Label Text Classification
10 pages
Unit 2 Notes
No ratings yet
Unit 2 Notes
27 pages
The Text Classification Pipeline: Starting Shallow, Going Deeper
No ratings yet
The Text Classification Pipeline: Starting Shallow, Going Deeper
157 pages
Nepali News Classification
No ratings yet
Nepali News Classification
5 pages
NM Report
No ratings yet
NM Report
18 pages
Project Proposal - Group 17-2-5
No ratings yet
Project Proposal - Group 17-2-5
4 pages
Article Classification with NLP & ML Techniques
No ratings yet
Article Classification with NLP & ML Techniques
8 pages
Automated News Classification Analysis
No ratings yet
Automated News Classification Analysis
13 pages
Machine Learning Models For News Article Classification
No ratings yet
Machine Learning Models For News Article Classification
8 pages
Arabic Text Classification: The Need For Multi-Labeling Systems
No ratings yet
Arabic Text Classification: The Need For Multi-Labeling Systems
25 pages
P06 SIJMR Volume 2 Issue 2
No ratings yet
P06 SIJMR Volume 2 Issue 2
8 pages
Science Research Journal
No ratings yet
Science Research Journal
7 pages
Indian Language Document Classifiers
No ratings yet
Indian Language Document Classifiers
5 pages
A Complete Process of Text Classification System Using State of The Art NLP Models
No ratings yet
A Complete Process of Text Classification System Using State of The Art NLP Models
26 pages
Deep Learning
No ratings yet
Deep Learning
42 pages
L2 Cse256 Fa24 TC
No ratings yet
L2 Cse256 Fa24 TC
65 pages
IEEE-paper (1) Original
No ratings yet
IEEE-paper (1) Original
3 pages
News Article Classification Model
No ratings yet
News Article Classification Model
10 pages
Text Classification Reseach Paper
No ratings yet
Text Classification Reseach Paper
4 pages
Topic Modeling On The Indian Express News Article
No ratings yet
Topic Modeling On The Indian Express News Article
7 pages
NLP Text Classification Models Analysis
No ratings yet
NLP Text Classification Models Analysis
3 pages
News Article Classification with ML
No ratings yet
News Article Classification with ML
2 pages
News Classsification
No ratings yet
News Classsification
11 pages
Overview of Natural Language Processing
No ratings yet
Overview of Natural Language Processing
71 pages
A New Text Mining Approach Based On HMM-SVM For Web News Classification
No ratings yet
A New Text Mining Approach Based On HMM-SVM For Web News Classification
8 pages
A Neural Network For Classifying News Wires (Multi Class Classification) Using Reuters Dataset
No ratings yet
A Neural Network For Classifying News Wires (Multi Class Classification) Using Reuters Dataset
16 pages
Text Classification PDF
No ratings yet
Text Classification PDF
7 pages
Sarcasm Detection in Headline News Using Machine and Deep Learning Algorithms
No ratings yet
Sarcasm Detection in Headline News Using Machine and Deep Learning Algorithms
8 pages
Luận Văn an Improved Term Weighting Scheme for Text Categorization
No ratings yet
Luận Văn an Improved Term Weighting Scheme for Text Categorization
16 pages
Bengali News Text Classification Report
No ratings yet
Bengali News Text Classification Report
10 pages
Fake News Detection Using NLP Techniques
No ratings yet
Fake News Detection Using NLP Techniques
62 pages
PROJECT REPORT For Machine Learning
100% (1)
PROJECT REPORT For Machine Learning
22 pages
News Classification Project by Saldanha
No ratings yet
News Classification Project by Saldanha
26 pages
An Analysis Method For Interpretability of CNN Text Classification Model
No ratings yet
An Analysis Method For Interpretability of CNN Text Classification Model
14 pages
Impact of Convolutional Neural Network and Fasttext Embedding On Text Classification
No ratings yet
Impact of Convolutional Neural Network and Fasttext Embedding On Text Classification
17 pages
NLP Techniques for ML Experts
No ratings yet
NLP Techniques for ML Experts
97 pages
Unsupervised Text Classification Methods
No ratings yet
Unsupervised Text Classification Methods
19 pages
Integrating Handcrafted Features With Machine Lear
No ratings yet
Integrating Handcrafted Features With Machine Lear
13 pages
Algerian Offensive Language Detection
No ratings yet
Algerian Offensive Language Detection
55 pages
Classification Survey
No ratings yet
Classification Survey
40 pages
News Article Classification Techniques
No ratings yet
News Article Classification Techniques
26 pages
Comparison of Text Classifiers On News Articles
No ratings yet
Comparison of Text Classifiers On News Articles
5 pages
Complex Linguistic Features For Text Classification: A Comprehensive Study
No ratings yet
Complex Linguistic Features For Text Classification: A Comprehensive Study
15 pages
Report Rohun Sjmoon
No ratings yet
Report Rohun Sjmoon
6 pages
Text Classification Research With Attention-Based Recurrent Neural Networks
No ratings yet
Text Classification Research With Attention-Based Recurrent Neural Networks
12 pages
PROJECT REPORT For Machine Learning
No ratings yet
PROJECT REPORT For Machine Learning
22 pages
Research On Short Text Classification Based On Tex
No ratings yet
Research On Short Text Classification Based On Tex
8 pages
Text Classification for Online News
No ratings yet
Text Classification for Online News
2 pages
Text Classification Using Machine Learning Techniq
No ratings yet
Text Classification Using Machine Learning Techniq
10 pages
Automated Language Detection Using NLP
No ratings yet
Automated Language Detection Using NLP
6 pages
Luận Văn Cải Tiến Chất Lượng Hệ Dịch Máy Thống Kê Bằng Cách Sử Dụng Kho Ngữ Liệu Đơn Ngữ Trong Ngôn Ngữ Nguồn
No ratings yet
Luận Văn Cải Tiến Chất Lượng Hệ Dịch Máy Thống Kê Bằng Cách Sử Dụng Kho Ngữ Liệu Đơn Ngữ Trong Ngôn Ngữ Nguồn
16 pages
Mihai Surdeanu, Marco Antonio Valenzuela-Escarcega - Deep Learning For Natural Language Processing - A Gentle Introduction-Cambridge University Press (2024)
No ratings yet
Mihai Surdeanu, Marco Antonio Valenzuela-Escarcega - Deep Learning For Natural Language Processing - A Gentle Introduction-Cambridge University Press (2024)
345 pages
Advances in Natural Language Processing
No ratings yet
Advances in Natural Language Processing
7 pages
A Comparative Analysis of Word Embeddings Techniques For Italian News Categorization
No ratings yet
A Comparative Analysis of Word Embeddings Techniques For Italian News Categorization
17 pages
A Semantic Similarity-Based Identification Method For Implicit Citation Functions and Sentiments Information
No ratings yet
A Semantic Similarity-Based Identification Method For Implicit Citation Functions and Sentiments Information
18 pages
Strategies For Enhancing The Performance of News Article Classification in Bangla Handling Imbalance and Interpretation
No ratings yet
Strategies For Enhancing The Performance of News Article Classification in Bangla Handling Imbalance and Interpretation
21 pages
Machine Learning Application For News Text Classification
No ratings yet
Machine Learning Application For News Text Classification
4 pages
Enhancing Text Classification Through LLM-Driven Active Learning and Human Annotation
No ratings yet
Enhancing Text Classification Through LLM-Driven Active Learning and Human Annotation
14 pages
romanov-DIY Mixed Order Ambisonics Microphone Array
No ratings yet
romanov-DIY Mixed Order Ambisonics Microphone Array
30 pages
Deep Learning Viva Questions
No ratings yet
Deep Learning Viva Questions
3 pages
DL QB 24-25
No ratings yet
DL QB 24-25
3 pages
Algorithms 17 00048 v2
No ratings yet
Algorithms 17 00048 v2
23 pages
Smooth 2D Models from MT Data Inversion
No ratings yet
Smooth 2D Models from MT Data Inversion
12 pages
A Practical PINN Framework For Multi Scale Problems 2024 Journal of Computat
No ratings yet
A Practical PINN Framework For Multi Scale Problems 2024 Journal of Computat
19 pages
Applied ML Course: Python & Data Science
No ratings yet
Applied ML Course: Python & Data Science
31 pages
UNIT 1 Neural Networks & DL
No ratings yet
UNIT 1 Neural Networks & DL
123 pages
171 Iccipc2025
No ratings yet
171 Iccipc2025
8 pages
Mitigating Masking Effects in Passive Radar
No ratings yet
Mitigating Masking Effects in Passive Radar
17 pages
TQWT Toolbox for MATLAB Users
No ratings yet
TQWT Toolbox for MATLAB Users
36 pages
Newton-Raphson for Cohesive Zone Models
No ratings yet
Newton-Raphson for Cohesive Zone Models
18 pages
Numerical Interconversion in Viscoelasticity
No ratings yet
Numerical Interconversion in Viscoelasticity
13 pages
SLIM Sparse Linear Methods For Top-N Recommender Systems
No ratings yet
SLIM Sparse Linear Methods For Top-N Recommender Systems
10 pages
NCA-GENL Exam Dumps
100% (2)
NCA-GENL Exam Dumps
13 pages
Applications of Machine Learning To Machine Fault Diagnosis A Review and Roadmap
No ratings yet
Applications of Machine Learning To Machine Fault Diagnosis A Review and Roadmap
136 pages
Algorithm Unrolling Interpretable Efficient Deep Learning For Signal and Image Processing
No ratings yet
Algorithm Unrolling Interpretable Efficient Deep Learning For Signal and Image Processing
27 pages
Artificial Neural Network Notes
No ratings yet
Artificial Neural Network Notes
9 pages
Ren2021 - Criterion For GFA - Machine Learning
No ratings yet
Ren2021 - Criterion For GFA - Machine Learning
11 pages
SQ Preview
No ratings yet
SQ Preview
63 pages
Lasso Regression Homework
No ratings yet
Lasso Regression Homework
11 pages
Question Bank - Module 2 - Module-3 Module 4 - Module 5
No ratings yet
Question Bank - Module 2 - Module-3 Module 4 - Module 5
4 pages
ML Unit 2 Notes
No ratings yet
ML Unit 2 Notes
14 pages
Exam Topics 1
No ratings yet
Exam Topics 1
7 pages
Spatial Source Reconstruction in Diffusion
No ratings yet
Spatial Source Reconstruction in Diffusion
29 pages
Huawei: H13-311 - V3.0 Exam
100% (2)
Huawei: H13-311 - V3.0 Exam
93 pages
Predicting Real Estate Prices With Machine Learning
No ratings yet
Predicting Real Estate Prices With Machine Learning
69 pages
Documentation of Our Project
No ratings yet
Documentation of Our Project
21 pages
Karnataka Unauthorized Construction Regularization Guide
No ratings yet
Karnataka Unauthorized Construction Regularization Guide
46 pages
Matrix Factorization Technique
No ratings yet
Matrix Factorization Technique
12 pages

A Machine Learning Framework For Automated News Article Title Classification in Albanian

Uploaded by

A Machine Learning Framework For Automated News Article Title Classification in Albanian

Uploaded by

A Machine Learning Framework for Automated

News Article Title Classiﬁcation in Albanian

Evis Plaku∗ , Klei Jahaj † , Arben Cela ‡ , and Nikolla Civici ‡

Sotir Kolea Street, Tirana, Albania

Noisy-le-Grand CEDEX, Paris, France

Sotir Kolea Street, Tirana, Albania

models such as gradient boosting build a series of decision

learning models, such as recurrent neural networks, which

TABLE I B. Recurrent Neural Networks for News Article Classiﬁcation

IV. E XPERIMENTS AND R ESULTS

You might also like