You are on page 1of 12

DSBDA- Data Science and Big Data Analysis

Mini Project

“A Case Study on Chatbot for Student Information with Data Science”

Submitted by

Yashowardhan Shinde (TC69)


Rutuja Jangle (TC48)
Aryan Kenchappagol (TC14)
Akalbir Singh Chadha (TC04)

THIRD YEAR COMPUTER

ENGINEERING Under the Guidance of

Prof. Ashwini Jarali

Department of Computer Engineering Hope Foundation's International


Institute of Information Technology
Hinjawadi, Pune – 411057
AY 2020-2021
Semester-6
The Data Science Lifecycle is centered on the application of machine
learning and various analytical methodologies to extract insights and
predictions from data in order to achieve a commercial company goal. A
number of processes are included in the complete method, including data
cleaning, preparation, modeling, and model evaluation. It is a
time-consuming technique that could take many months to finish.
A data science life cycle is a series of data science steps that you go
through to complete a project or analysis. Because each data science project
and team is unique, each data science life cycle is also unique. Most data
science projects, on the other hand, follow a similar generic data science life
cycle.

1. Phase 1: Discovery

A central objective of the discovery phase is to identify the key business


variables that the analysis needs to predict. We refer to these variables as the
model targets, and we use the metrics associated with them to determine the
success of the project. Two examples of such targets are sales forecasts or the
probability of an order being fraudulent.
Define the project goals by asking and refining "sharp" questions that are
relevant, specific, and unambiguous. Data science is a process that uses
names and numbers to answer such questions. We typically use data science
or machine learning to answer five types of questions:

1. How much or how many? (regression)


2. Which category? (classification)
3. Which group? (clustering)
4. Is this weird? (anomaly detection)
5. Which option should be taken? (recommendation)

Determine which of these questions you're asking and how answering it


achieves your business goals.

The initial problem statement was to build a chatbot which will display the
students results and test analysis. A Chatbot can assist in increasing student
engagement.

Engagement can be driven by user data and made more engaging by


deploying conversational AI chatbots. Furthermore, bots may provide
consistent responses, which helps you avoid providing customers with
useless information. Customers will stay longer on your website and continue
the conversation if you provide meaningful and timely responses.

Initial Hypothesis -
1. Examining the data and coming up with ideas on how to develop our
chatbot model by identifying the goals and motives of our project.
2. Discovering Plotly’s various features for developing the dashboard of
our project.
3. Searching for the data required by the user and displaying it on the
dashboard as per the requirements.
2. Phase 2: Data Preparation

Data Preparation is an approach similar to initial data analysis, whereby a


data analyst uses visual exploration to understand what is in a dataset and the
characteristics of the data. Preparation comes before any statistical analysis
and machine learning model. We need to extract, load and transform the
dataset before performing any further operations on the dataset. This is
critical to avoid an insidious danger: summary indicators, such as mean and
standard deviation. The Simpson’s paradox is a well-known example which
shows how global indicators may be superficial and misleading. It is of
course an academic example but something similar may also happen in the
real world, as you will see in a minute.

Data Preparation happens in the presence of an analytical sandbox when a


data analyst uses visual exploration to understand what is in a dataset: of
course, it is more complex than this. Imagine reading a huge table, with
thousands of rows and tens of columns, full of numbers. You are visually
exploring the data but there is no way you may get some insights. That’s
because we are not designed to crunch huge tables of numbers. We are great
at reading the world in terms of shapes, dimensions and colors. And that’s
what Data Visualization enables; once translated into lines, points, and
angles, numbers are way easier to read.

In order to better understand the nature of the data, data analysts utilize this
data visualization and statistical tools to convey dataset characterizations,
such as size, amount, and accuracy.

The potential data sources for this project will be full csv files with the results
of the students which include the subjects, seat number, name of the student,
internal, theory marks and grades. This data source is further cleaned and
made visualization ready for further analysis.
Data preparation involves data cleaning. Data scientists spend a large amount
of their time cleaning datasets and getting them down to a form with which
they can work. In fact, a lot of data scientists argue that the initial steps of
obtaining and cleaning data constitute 80% of the job.

Therefore, if you are just stepping into this field or planning to step into
this field, it is important to be able to deal with messy data, whether
that means missing values, inconsistent formatting, malformed records,
or nonsensical outliers.

Data Cleaning Includes:


● Dropping unnecessary columns in a DataFrame
● Changing the index of a DataFrame
● Using .str() methods to clean columns
● Using the DataFrame.applymap() function to clean the entire
dataset, element-wise
● Renaming columns to a more recognizable set of labels
● Skipping unnecessary rows in a CSV file

In our dataset the raw data was very noisy with multiple data types,
missing columns and complex file structure. Our data was structured as well
as unstructured. Structured data was the Student Result data and the chats
used to train the chatbot was unstructured data. In order to clean the data we
used pandas and excel filtering. We segregated the data according to different
criteria like Semester Marks, Summary, Subjects, All Student Data. This was
helpful for efficiently querying the data whenever needed.

3. Phase 3: Model Planning


Model Planning is the phase where we decide which methods, techniques and
workflow to use for building our project. This step is a prerequisite step for
the model building phase as the workflow is defined in this step itself.

The technologies that we used for building our model which consists of
integrating a chatbot with a dashboard for displaying the exam results of
students according to their Seat numbers are as follows:

1. Chatbot- We have used the flask app for building our chatbot. The
various libraries used are- Natural language Toolkit (NLTK), Pytorch
and Deep Neural networks.
2. Dashboard- We have used a dash app for building our dashboard. It
consists of technologies such as plotly for visualization and dash.

Common tools that we are going to use in our project are- Python
programming language, Flask, Dash and Plotly.

4. Phase 4: Model Building

Model building is the phase which makes use of the technologies mentioned
in the Model Planning phase and follows the workflow accordingly to build
the chatbot model.
Although the modeling techniques and logic required to develop models can
be highly complex, the actual duration of this phase can be short compared to
the time spent preparing the data and defining the approaches. Creating
robust models that are suitable to a specific situation requires thoughtful
consideration to ensure the models being developed ultimately meet the
objectives outlined in Phase 1.

The common tools that we used in this phase of model building are Visual
studio Code and live server.
The most important thing in model building is data visualization. It helps in
discovering the trends in data. After all, it is much easier to observe data
trends when all the data is laid out in front of you in a visual form as
compared to data in a table. For example, The Tableau dashboard of sales
data demonstrates the sum of sales made by each customer in descending
order. However, the color red denotes loss while gray denotes profits. So it is
very easy to observe from this visualization that even though some customers
may have huge sales, they are still at a loss. This would be very difficult to
observe from a table.
The different data visualizations used in our Data Science Life Cycle are:

1. Bar charts
A barplot (or barchart) is one of the most common types of graphics. It shows
the relationship between a numeric and a categorical variable. Each entity of
the categoric variable is represented as a bar. The size of the bar represents its
numeric value. In the dashboard bar chart is used to display the minimum,
maximum and average marks of the subjects given in the excel sheet.

2. Pie chart
Pie chart is a circle divided into slices. Each slice represents a numerical
value and has slice size proportional to the value.

Pie chart types


Circle chart: this is a regular pie chart.
3D pie chart: the chart has a 3D look.
Donat pie chart: the center circle of the pie chart is missing and the chart has
the shape of a donat.
Exploded pie chart: one slice is separated from the chart to focus the slice.

In this project we have used pie chart to display how many students belong to
a specific class i.e. Distinction, First Class, Pass, Fail etc.
Here, in this project we have used scatter plots to compare the different
subjects with each other and find correlation between two subjects.

1. Scatter Plots
A scatter plot (also called a scatter plot, scatter graph, scatter chart,
scattergram, or scatter diagram) is a type of plot or mathematical diagram
using Cartesian coordinates to display values for typically two variables for a
set of data.

2. Histograms
A histogram is a graphical representation that organizes a group of data
points into user-specified ranges. Similar in appearance to a bar graph, the
histogram condenses a data series into an easily interpreted visual by taking
many data points and grouping them into logical ranges or bins.
Here, we have used histogram to display results of each individual subject.

5. Phase 5: Communicate Results

After executing the model, we have compared the outcomes of various


models to decide whether our model is a success or failure.
Our chatbot was successful in displaying the student results according to the
choices entered by the student by retrieving the data from the dataset and our
dashboard successfully displayed the results summary for the academic year
2018-19 in various formats like- subject wise results, overall results, semester
wise results, etc.
We can check the result summary by just choosing a subject and it will be
displayed in the form of scatter plots and histograms, as for the grade wise
distribution, it is displayed in the form of a pie chart.
Our chatbot model is giving accurate results for both the training and testing
dataset in a very short time.
We can use this chatbot in our college system and make it easy for all
students to check their results anytime by just visiting the college portal and
interacting with the chatbot and then they can also compare their results with
the overall result summary displayed on the dashboard.

6. Phase 6: Operationalize

The key observations of our project include the following points:


● Students can obtain their results of any particular semester or overall
academic year through our chatbot.
● We can easily check and compare any result data of either a single
student or the whole batch for any particular academic year.
● Our project is beneficial for both- the students as well as the teachers
and the college authorities. The faculty can keep track of all the student
result data and retrieve any data anytime with the help of this project
model.

Following are a few of the snapshots of our project results. The chatbot can
be seen showing the results of a student after he enters his seat number. The
dashboard shows generalized results in pie charts, histograms, bar graphs and
scatter plots.

You might also like