Professional Documents
Culture Documents
Event Categorization From News Articles Using Machine Learning Techniques.. (1) ..
Event Categorization From News Articles Using Machine Learning Techniques.. (1) ..
Dr. KOGILAVANI S. V.
Associate Professor (Sr. G) – Department of AI
Outline
Problem Statement
Objectives
Introduction
Literature Survey
Methodology
Performance Evaluation
Conclusion
References
Problem Statement
Secondary Objectives:
1.To evaluate the performance of various machine learning algorithms, including Logistic
Regression, Gaussian Naïve Bayes, Random Forest, Multinomial Naïve Bayes, K-NN,
Decision Tree, and Support Vector Classifier, for event detection.
2.To compare the effectiveness of Count Vectorizer and TF-IDF in transforming raw text data
into numerical representations for event detection.
3. To identify the key challenges and limitations in the current approach and propose potential
enhancements for more accurate event detection.
Tertiary Objectives:
2. To investigate the scalability of the proposed system for handling large volumes of news
articles in real-time.
Introduction
Events categorization will make us easier to read the
desired news articles as per our needs.
Without categorization, it will be time consuming for us
to find a desired article.
We have created a Machine Learning system to
categorize these articles using the words which
represents a particular event.
The detection of these events is done through ML
Regression Algorithms.
The Random Forest Regression Algorithm is the best
system with accuracy of 98.43% and precision 0.98.
.
Literature Survey
S.N TITLE NAME OF THE AUTHOR ALGORITHM
O USED
1. A Universal Felix Hamborg1, Corinna Random Forests
System for Breitinger1, Bela Gipp2
Extracting Main
Events from
News Articles
2. Event Fazlourrahman Balouchzahi, H L Space Syntax Analysis,
detection Shashirekha Linear Regression
from News in Model, Spatial
Indian Regression Model, OLS
Languages Regression
using linear
SVC.
3. Temporal Shafiq Ur Rehman Khan Neural Networks,
Information , Muhammad Arshad Islam Random forests, IDW,
retrieval and and kriging
text
classification.
4. Detect news . Hassan Sayyadi, Alireza XGBoost Regression;
event using a Sahraei, and Hassan Abolhassani Gradient Boost;
label-based Ensemble Learning
clustering
approach.
Methodology
We have chosen two datasets, namely BBC News Train and BBC
News Test to train and test the ML model.
The train dataset consists of 1490 rows x 30 columns.
1. Data Pre-processing
The train data is needed to be pre-processed to get fit into the
system. The category column of the train data consist of ‘business’,
‘tech’, ‘politics’, ‘sport’, ‘entertainment’ as their values. These
object values are label encoded to numerical values such as 0, 1, 2, 3
to perform Regression Techniques.
2. Data Analysis
The data needs to be analysed for the further process. The
count of the categories in the data set is visualized and it can be seen
in fig 1 and fig 2.
fig 1 fig 2
COUNT VECTORIZER :
Conclusion
Utilizing machine learning techniques, such as Count Vectorizer
and TF-IDF feature extraction, proves robust for event detection
in news articles.
Count Vectorizer captures word frequencies, while TF-IDF
emphasizes rare yet significant terms.
Employing various algorithms like Logistic Regression and
Random Forest facilitates efficient processing, with the latter
outperforming in TF-IDF.
The results affirm the superiority of the Random Forest model
for both training and test data
References
1] Hamborg, F., Breitinger, C., & Gipp, B. (2019). Giveme5w1h: A universal system for
extracting main events from news articles. arXiv preprint arXiv:1909.02766.
2] Balouchzahi, F., & Shashirekha, H. L. (2020, December). An Approach for Event
Detection from News in Indian Languages using Linear SVC. In FIRE (Working
Notes) (pp. 829-834).
3] Khan, S. U. R., & Islam, M. A. (2019). Event-Dataset:
Temporal information retrieval and text classification dataset. Data in brief, 25,
104048.
4] Toda, H., & Kataoka, R. (2005, November). A search result clustering method
using informatively named entities. In Proceedings of the 7th annual ACM international
workshop on Web information and data management (pp. 81-86).
5] L. Hu, B. Zhang, L. Hou, J. Li, Adaptive online event detection in news streams,
Knowledge-Based Systems 138 (2017) 105–112.
6] J. Weng, B.-S. Lee, Event detection in twitter., Icwsm 11 (2011) 401–408.
7] Khodra, M.L. 2015. Event extraction on Indonesian news article using multiclass
categorization. ICAICTA 2015 - 2015 International Conference on Advanced Informatics:
Concepts, Theory and Applications (2015).
8] Lejeune, G. et al. 2015. Multilingual event extraction for epidemic detection.
Artificial Intelligence in Medicine. (2015).
9] R. Campos, G. Dias, A.M. Jorge, A. Jatowt, Survey of temporal information
retrieval and related applications, ACM Comput. Surv. 47 (2) (2014) 15.
10] P.K. Choubey, K. Raju, R. Huang, Identifying the most dominant event in a news
article by mining event coreference relations, in: Proceedings Of the 2018 Conference Of
the North American Chapter Of the Association For Computational Linguistics: Human
Language Technologies, vol. 2, 2018, pp. 340e345. Short Papers.
11] S. Upadhyay, C. Christodoulopoulos, D. Roth, Making the news': identifying
noteworthy events in news articles, in: Proceedings Of the Fourth Workshop On Events,
2016, pp. 1e7.
12] A. Jatowt, C. Man, A. Yeung, K. Tanaka, Generic method for detecting focus time
of documents, Inf. Process. Manag. 51 (6)(2015) 851e868.
May 8, 2024
THANK YOU
May 8, 2024