Professional Documents
Culture Documents
Interim Report
Project Title:
by
Name:
Date of submission:
Supervisor:
Moderator:
Date of submission:
Supervisor:
Moderator
Introduction:
The aim of this project is to investigate the most effective methods for
scraping data from open-source websites and databases, to use feature
engineering techniques to extract relevant features from the data, and to use
these features to classify the data into target labels using machine learning
and deep learning approaches. The project will also aim to compare the
performance of different machine learning and deep learning models and to
evaluate the performance of these models using various metrics.
In this report, we will first discuss the background and motivation of the
project, followed by a review of the relevant literature. We will then describe
the methodology used to collect and preprocess the data, perform feature
engineering, train and evaluate the machine learning models, and compare
the performance of different models. We will also present and discuss the
results obtained, and finally, we will provide conclusions and
recommendations for future work.
1.1 Background and Motivation:
The COVID-19 pandemic has been a global crisis, affecting millions of people
around the world. As a result, there is a wealth of data available on COVID-19
patients, including information on their symptoms, treatment, and outcomes.
This data is often publicly available on websites or databases, making it an
ideal candidate for analysis using machine learning and deep learning
techniques.
The ability to accurately classify COVID-19 patients into target labels has
several potential applications. For example, it could be used to identify high-
risk patients who require immediate medical attention, to predict the likelihood
of recovery or survival, or to monitor the spread of the disease. However, the
process of manually classifying patients is time-consuming and resource-
intensive, making it impractical for large datasets.
Machine learning and deep learning techniques have been widely used to
classify and predict outcomes in various medical fields, including COVID-19.
These techniques have several advantages, including their ability to handle
large amounts of data, identify complex patterns and relationships, and
provide accurate and consistent predictions. However, the effectiveness of
these techniques depends on the quality of the data, the features used for
classification, and the choice of model.
The motivation behind this project is to investigate the most effective methods
for scraping data from open-source websites and databases, to use feature
engineering techniques to extract relevant features from the data, and to use
these features to classify the data into target labels using machine learning
and deep learning approaches. By comparing the performance of different
models, we can identify the most effective approach for classifying COVID-19
patients, and potentially improve the accuracy of predictions and outcomes.
1.2 Outline (overview) the and the overall aim of the project:
This project aims to investigate the most effective methods for scraping data
from open-source websites and databases, to use feature engineering
techniques to extract relevant features from the data, and to use these
features to classify the data into target labels using machine learning and
deep learning approaches. The project will also aim to compare the
performance of different machine learning and deep learning models and to
evaluate the performance of these models using various metrics.
Our first step in analysing COVID-19 data is to acquire relevant
information and data from a reliable website. We will obtain this
information in the future and make sure it is up-to-date and
comprehensive in order to conduct further analysis and make
accurate predictions.
In the next step which will be data analysis, we will analyse the
acquired COVID-19 data using tools and techniques such as
Pandas, Scikit-learn, and Matplotlib. This step, Data Analysis,
also includes data cleaning, which is an important aspect of the
analysis process. The data cleaning process will involve
converting the data into a suitable format, replacing missing
values using the SOTE (Synthetic Over-sampling Technique)
model, and re-indexing the data.
After the data has been pre-processed, the next step is feature
engineering. In this step, we will create new features based on
the existing features in the dataset. We will perform this step by
using methods such as PCA (Principal Component Analysis) or
encoding techniques such as one-hot encoding. This step aims
to convert the data into a numerical form that is easier to
understand and process by machine learning algorithms.
The literature review for this project will focus on the most effective techniques
for scraping data from open-source websites and databases and for using
machine learning and deep learning approaches to classify the data into
target labels.
Scraping data from open-source websites and databases can be challenging
due to the unstructured nature of the data. Several techniques have been
developed for scraping data, including web scraping tools such as Beautiful
Soup, Selenium, and Scrapy. These tools can be used to extract data from
websites and databases in a structured format.
Pre-processing the data is an essential step in data analysis. Data cleaning
techniques can be used to handle missing values, outliers, and other unusual
values in the data. Balancing the dataset is also important, especially if the
data is imbalanced. Techniques such as undersampling, oversampling, or
SMOTE can be used to balance the dataset. Standardizing the data is also
important to bring the values to a similar scale. Techniques such as MinMax
or Scaling can be used to standardize the data.
Feature engineering is the process of creating new features based on the
existing features in the dataset. Feature engineering can be performed using
methods such as PCA or encoding techniques such as one-hot encoding.
This step aims to convert the data into a numerical form that is easier to
understand and process by machine learning algorithms.
Several machine learning and deep learning models can be used to classify
data. Some of the most commonly used machine learning models include
SVM, Random Forest, Decision Tree, Logistic Regression, and Gradient
Boosting
2.1 Introduction to deep learning:
Deep learning is a subfield of machine learning that involves training artificial
neural networks to learn from large amounts of data. The neural networks are
modeled after the human brain and can be used for a variety of tasks,
including image and speech recognition, natural language processing, and
predictive analytics. In this project, we aim to investigate the most effective
methods for classifying scraped data on Covid-19 patients using machine
learning and deep learning techniques. We will compare the performance of
different models, including SVM, Gradient Boosting, and others, to determine
the best approach for this task. Our project will involve acquiring relevant data
from a reliable website, analyzing and preprocessing the data, creating
features using feature engineering techniques, and evaluating the
performance of different models using metrics such as precision, recall, and
F1 score. The proposed project is closely related to several courses in data
analytics, database management, and machine learning.
2.2 Applications of deep learning :
Deep learning has a vast number of applications in the field of data analytics,
including the structural research report titled "Comparative study of models
like SVM, Gradient Boosting, and others for Deep Learning in classifying
Scraped Data on Covid-19 Patients using binary classification." In this report,
the aim is to investigate the most effective methods for scraping data from
open-source websites and databases, using feature engineering techniques
to extract relevant features from the data, and classifying the data into target
labels using machine learning and deep learning approaches. The report also
aims to compare the performance of different machine learning and deep
learning models and to evaluate the performance of these models using
various metrics. The use of deep learning in this project can help improve the
accuracy of the classification task, especially with the abundance of data
available related to COVID-19 patients. Deep learning models such as
Convolutional Neural Networks (CNNs) can be used for image classification,
which can be useful for detecting patterns in X-ray images. Additionally,
Recurrent Neural Networks (RNNs) can be used for time-series analysis,
which can be beneficial for predicting the progression of the disease. Overall,
the use of deep learning in this project can help in providing accurate
classification of data related to COVID-19 patients, leading to better
predictions and decision-making.
Research Methodology:
The model architecture for the comparative study of machine learning and
deep learning models for classifying scraped data on COVID-19 patients
using binary classification involves several steps that must be followed
systematically. The steps are:
2. Data cleaning: Check for missing values, outliers, and other unusual
values in the data. Various techniques can be used to handle missing
values, such as dropping them, imputing them with mean or median, or
using machine learning models to predict missing values. Outliers are
also removed if necessary. Additionally, the data must be in the correct
format and converted if required.
5. Splitting the data: Once the data has been cleaned and standardized, it
must be split into training, validation, and test sets. The training set is
used to train the machine learning models, the validation set is used to
tune the hyperparameters, and the test set is used to evaluate the final
performance of the model.
8. Model testing and evaluation: Once the models have been trained, they
must be tested on the test set to evaluate their performance. Metrics
such as the confusion matrix, precision, recall, F1 score, and sensitivity
are used to provide a comprehensive understanding of the model's
ability to make accurate predictions on unseen data.
3.2 Tools and libraries used:
1. Pandas: Used for data analysis and manipulation.
The above tools and libraries were used in the project titled "Comparative
study of models like SVM, Gradient Boosting and others for Deep Learning in
classifying Scraped Data on Covid-19 Patients using binary classification."
These tools were used for data analysis, cleaning, preprocessing, feature
engineering, model training, and evaluation. The project aimed to investigate
the most effective methods for scraping data from open-source websites and
databases, to use feature engineering techniques to extract relevant features
from the data, and to use these features to classify the data into target labels
using machine learning and deep learning approaches.
Comparison of models like SVM, Gradient Boosting, and others for Deep
Learning in classifying Scraped Data on Covid-19 Patients using binary
classification can be a useful tool in identifying risk factors associated with
COVID-19 patients. Through this study, we have shown how machine
learning and deep learning techniques can be used to effectively classify
COVID-19 data into target labels.
Once the models are trained, we will evaluate their performance on the test
set using various metrics such as accuracy, precision, recall, F1 score, and
AUC-ROC curve. These metrics will provide us with a comprehensive
understanding of the model's ability to make accurate predictions on unseen
data. We will also perform a comparative study of different machine learning
models to find the best model for the given dataset.
3.5 Result
The project titled "Comparative study of models like SVM, Gradient Boosting
and others for Deep Learning in classifying Scraped Data on Covid-19
Patients using binary classification" aims to investigate the effectiveness of
different machine learning and deep learning approaches in scraping data
from open-source websites and databases, extracting relevant features, and
classifying the data into target labels. The research question posed in this
project is how to successfully classify scraped data into selected target labels
using machine learning and deep learning techniques by contrasting the
results of various models against each other.
The project titled "Comparative study of models like SVM, Gradient Boosting
and others for Deep Learning in classifying Scraped Data on Covid-19
Patients using binary classification" aims to investigate the most effective
methods for scraping data from open-source websites and databases, to
extract relevant features from the data, and to classify the data into target
labels using machine learning and deep learning approaches. The project
also aims to compare the performance of different machine learning and deep
learning models and to evaluate the performance of these models using
various metrics. The project's contribution to the field is significant as it aims
to find the most effective machine learning and deep learning models that can
accurately classify scraped data from open-source websites and databases.
The results of this study can help researchers and data scientists working on
similar projects to choose the most suitable machine learning and deep
learning models for their data classification tasks. Additionally, this project can
help in providing a better understanding of the use of machine learning and
deep learning models in data classification tasks, especially during the
ongoing Covid-19 pandemic.
Comparison with state-of-the-art:
The comparative study of models like SVM, Gradient Boosting, and others for
deep learning in classifying scraped data on COVID-19 patients using binary
classification is an essential structural research report. The project aims to
investigate the most effective methods for scraping data from open-source
websites and databases and compare the performance of different machine
learning and deep learning models. The project's approach involves a step-
by-step analysis of the COVID-19 data, feature engineering, training machine
learning models, and evaluating their performance. The project objectives
include data cleaning, balancing, standardizing, splitting, model training,
testing, and evaluation. The literature review will serve as the main report of
the project, exploring different models' comparison to gain valued insights.
The project's proposed research question and tasks are aligned with the MSc
program stream, covering key tasks in data analytics and database
management. The report's comparison with state-of-the-art models will
provide insights into the most effective deep learning techniques for
classification tasks and highlight the project's unique contribution to the
research domain.
Discussion:
References:
Alimadadi, A., Aryal, S., & Manandhar, I. (2020). Artificial intelligence and
machine learning to fight COVID-19. Physiological genomics, 52(4), 200-202.
Arora, A., Karande, V., & Ahuja, S. (2021). Machine learning and deep
learning techniques for COVID-19 prediction using symptoms: a meta-
analysis. Informatics in Medicine Unlocked, 23, 100523.
Gupta, N., Kumar, V., & Singh, P. (2021). Comparison of machine learning
algorithms for COVID-19 prediction using chest X-ray images. Journal of
ambient intelligence and humanized computing, 12(6), 5621-5631.
Jaiswal, A., Gautam, A., Kaur, H., & Gupta, A. (2021). Comparative
analysis of machine learning algorithms for COVID-19 prediction. Journal of
Ambient Intelligence and Humanized Computing, 12(4), 3609-3623.