You are on page 1of 18

DS Unit 1 Essay Answers.

1. Define Data science. What are the traits of Data science? Discuss the applications of Data science with suitable
examples -Mid 1 Question

Answer :

Data science
is a multi-disciplinary field that uses scientific methods, processes, algorithms, systems to extract knowledge, insights
from structured and unstructured data
Data Science is the science which uses in computer science, statistics, machine learning, visualization and human-
computer interactions to collect, clean, integrate, analyze, visualize, interact with data to create data products
Traits of Data science / Big Data
1. Volume
How much data is there?
refers to data generated from many sources
2. Variety
How diverse are different types of data?
data can be structures, unstructured or semi structured
3. Velocity
At what speed is new data generated? ->  ie speed of generation of data
4. Veracity
How accurate is the data? -> ie how much the data is reliable
5. Value
ability to transform big data into valuable data and store
Applications of Data science
1. Advanced Image Recognition
Eg : Face Mask detection
2. Recommendation System
Eg : in youtube, google news recommendation
3. Banking
Eg : Fraud detection, NPA risk modeling
4. Transport
Eg : Self driving cars

2. Write a brief note on various measures of data similarity and dissimilarity

Answer :

Data Similarity
is Numerical measure of how alike two data objects are.
It is higher when objects are more alike.
it often falls in the range [0,1]
Data Dissimilarity
Numerical measure of how different are two data objects
it is lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Measures of Similarity/Dissimilarity for Simple Attributes
Distances
Minkowski distance

1. h = 1: Manhattan (city block, L1 norm)distance

2. h = 2: (L2 norm) Euclidean distance

3. h → infinity . “supremum” (Lmax norm, L infinity norm) distance

3. What is Data matrix? Explain using an example how to find a Dissimilarity matrix

Answer :

Data Matrix
representing n data points with p dimensions

Dissimilarity matrix
is a triangular matrix which represents n data points, but registers only the distance
Example using Eucledian distance
Considering below data Matrix

Solution
1. Calculating Eucledian distances

2. Answer : Writing distances in form of Matrix


4. Using an example discuss similarity of Binary variables

Answer :

Proximity Measure for Binary Attributes


Consider the example

Dissimilarity in Binary Variables

Answer :

Similarity in Binary Variables


5. Using an example Table below discuss similarity of any two types of variables what you have identified

Answer :

Take variable
1. one as Half-Yearly -this is ordinal attribute and solve in same way as 6 th question
2. second as Final -this is numeric attribute so use formula and solve in same way as 3rd question

6. Define Proximity matrix. Find the similarity matrix for given DS Lab continuous evaluation grades (Ordinal attribute)
data set
Answer :

Proximity matrix
is a square matrix in which the entry in cell (j, k) is some measure of the similarity (or distance) between the items to
which row j and column k correspond.
Proximity matrices form the data for multidimensional scaling
7. What is data pre-processing and why do we need it? Explain cleaning of data in brief.

Answer :

Data preprocessing
is a technique that involves transforming raw data into useful and efficient format so that data mining analytics can be
applied
Major Tasks in Data Preprocessing
1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation and data discretization
Why Preprocess the Data?
1. Accuracy
correct or wrong, accurate or not
2. Completeness
not recorded, unavailable, …
3. Consistency
some modified but some not, dangling, …
4. Timeliness
timely update?
5. Believability
how trustable the data are correct?
6. Interpretability
how easily the data can be understood?
Explain cleaning of data in brief
Data Cleaning
data can have many irrelevant and missing parts so data cleaning is done which involves handling of missing
data, noisy data or resolving the inconsistencies in the data

1. Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are missing within a
tuple.
Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or
the most probable value
2. Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data
collection, data entry errors etc.
It can be handled in following ways :
Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size
and then various methods are performed to complete the task. Each segmented is handled separately. One can
replace all data in a segment by its mean or boundary values can be used to complete the task.
1. smoothing by bin means
2. smoothing by bin medians
3. smoothing by bin boundaries
Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be
1. linear (having one independent variable) or
2. multiple (having multiple independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the
clusters.

8. Write a python code for reading a dataset and removing the NaN values of filling the NaN values.

Answer :

# Importing the libraries


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Importing dataset
df = pd.read_csv('user_data.csv')

#Checking whether null values are there or not


df.isnull().sum()

#use this command for all those attributes which has NaN values
df.X.fillna(np.random.randint(0,2),inplace=True)       // here X is the attribute which has NaN values

#Checking whether the null values are filled


df.isnull().sum()

9. Explain data cleaning and munging with a suitable example


Answer :

Data munging, also known as data wrangling, is the data preparation process of manually transforming and
cleansing/cleaning  the data for better decision making.
Data Munging includes the following steps:
1. Data exploration: In this process, the data is studied, analyzed and understood by visualizing representations of data.
2. Dealing with missing values: Most of the datasets having a vast amount of data contain missing values of NaN, they
are needed to be taken care of by replacing them with mean, mode, the most frequent value of the column or simply by
dropping the row having a NaN value.
3. Reshaping data: In this process, data is manipulated according to the requirements, where new data can be added or
pre-existing data can be modified.
4. Filtering data: Some times datasets are comprised of unwanted rows or columns which are required to be removed or
filtered
5. Other: After dealing with the raw dataset with the above functionalities we get an efficient dataset as per our
requirements and then it can be used for a required purpose like data analyzing, machine learning, data visualization,
model training etc.
Eg :   Refer this link -> https://www.geeksforgeeks.org/data-wrangling-in-python/

10. Explain various ways of data transformation and data reduction techniques

Answer :

Data Transformation
data are transformed into forms appropriate for data analytic processing
Data transformation tasks:
1. Smoothing
Remove the noise from the data.
Techniques includes Binning, Regression, Clustering.
2. Normalization
the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1.0, 0.0 to 1.0
Types
1. Min-max normalization  to [new_minA , new_maxA ]

2. Z-score normalization ( μ: mean, σ: standard deviation )

3. Normalization by decimal scaling

3. Attribute construction (or feature construction ), Subset selection


new attributes are constructed and added from the given set of attributes to help the efficent process
4. Aggregation
aggregation operations are applied to the data
Eg :  On the left, the sales are shown per quarter. On the right, the data are aggregated to provide the
annual sales
5. Discretization
Dividing the range of a continuous attribute into intervals
Eg :  values for numerical attributes, like age, may be mapped to higher-level concepts, like youth,
middle-aged, and senior
6. Generalization
values for numerical attributes, like age, may be mapped to higher-level concepts, like youth, middle-
aged, and senior.
Eg :  categorical attributes, like street, can be generalized to higher-level concepts, like city or country

Data reduction
is a technique used to obtain a reduced representation of the data set that is much smaller in volume but yet produces
the same (or almost the same) analytical results
Data reduction strategies
1. Data compression
apply transformations to obtain reduced or compressed representation of original data
they are of 2 types
1. Lossless
If the original data can be reconstructed from the compressed data without any loss of
information
2. Lossy
If the original data can be reconstructed from the compressed data with loss of information, then
the data reduction is called lossy
Eg :
Wavelet transforms
Principal components analysis.
2. Dimensionality reduction-remove unimportant attributes/variables Eliminate the redundant attributes: which are weekly
important across the data.
Wavelet transforms
is a linear signal processing technique that, when applied to a data vector, transforms it to a numerically
different vector, of wavelet coefficients. The two vectors are of the same length. When applying this
technique to data reduction, we consider each tuple as an ndimensional data vector, that is, X=(x1 ,x,
…,xn ), depicting n measurements made on the tuple from n database attributes
Eg : using Fourier transform to reduce the data

Principal Components Analysis (PCA)


Feature subset selection
reduces the data set size by removing irrelevant or redundant attributes (or dimensions)
Typical heuristic attribute selection methods:
1. Best single attribute under the attribute independence assumption: choose by significance tests
2. Best step-wise( forward) feature selection:
The best single-attribute is picked first
Then next best attribute condition to the first, ...
3. Step-wise attribute( backward) elimination:
Repeatedly eliminate the worst attribute
4. Best combined attribute selection and elimination
5. Decision tree induction
Use attribute elimination and backtracking
feature creation
Create new attributes (features) that can capture the important information in a data set more effectively
than the original ones
Three general methodologies
1. Attribute extraction
Domain-specific
2. Mapping data to new space (see: data reduction)
E.g., Fourier transformation, wavelet transformation, manifold approaches (not covered)
3. Attribute construction
Combining features
Data discretization
3. Numerosity reduction- replace original data volume by smaller forms of data
Regression and Log-Linear Models
Histograms, clustering, sampling
Data cube aggregation

11. Explain the process of data discretization with a suitable example

Answer :

Typical methods of discretisation for numerical data


1. Binning
Top-down split, unsupervised
This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size
and then various methods are performed to complete the task. Each segmented is handled separately. One can
replace all data in a segment by its mean or boundary values can be used to complete the task.
1. smoothing by bin means
2. smoothing by bin medians
3. smoothing by bin boundaries
Eg :

2. Histogram analysis
Top-down split
is an unsupervised discretization technique because it does not use class information
A histogram partitions the values of an attribute, A, into disjoint ranges called buckets or bins.  
histogram partitions the values of an attribute, A, into disjoint ranges called buckets or bins
Eg :
for the dataset
we do Histogram Analysis in below way

3. Clustering analysis 
Either top-down split or bottom-up merge, unsupervised 
4. Entropy-based discretization
supervised, top-down split 
Eg : if want example then refer this https://natmeurer.com/a-simple-guide-to-entropy-based-discretization/

5. Interval merging by Analysis


supervised, bottom-up merge
12. . Explain the various ways of preparing the data for analysis.

Answer :

1. Questionnaire checking:
Questionnaire checking involves eliminating unacceptable questionnaires. These questionnaires may be incomplete,
instructions not followed, little variance, missing pages, past cutoff date or respondent not qualified.
2. Editing
Editing looks to correct illegible, incomplete, inconsistent and ambiguous answers.
3. Coding
Coding typically assigns alpha or numeric codes to answers that do not already have them so that statistical techniques
can be applied.
4. Transcribing 
Transcribing data involves transferring data so as to make it accessible to people or applications for further processing.
5. Cleaning 
Cleaning reviews data for consistencies. Inconsistencies may arise from faulty logic, out of range or extreme values.
6. Statistical adjustments 
Statistical adjustments applies to data that requires weighting and scale transformations.
7. Analysis strategy selection 
Finally, selection of a data analysis strategy is based on earlier work in designing the research project but is finalized
after consideration of the characteristics of the data that has been gathered.

13. What is the need for data visualization. Write on the libraries supported by python for data visulizations
Answer :

Data Visualization
is the graphical representation of information and data by using visual elements like charts, graphs, and maps
data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data
Need for Data Visualisation
1. To make easier in understand and remember
2. To discover unknown facts, outliers, and trends
3. To visualize relationships and patterns quickly
4. To ask a better question and make better decisions
5. To competitive analyze
6. To improve insights
7. Data visualization can identify areas that need improvement or modifications
8. Data visualization can clarify which factor influence customer behavior
9. Data visualization helps you to understand which products to place where
10. Data visualization can predict sales volumes
libraries supported by python for data visulizations
1. Matplotlib
used for exploration & data visualisation
can do charts, plots & also can be customised
2. Seaborn
visualisation for large data
has advanced plots
3. Plotly
provides hgih quality plots
provides more advanced plots & features than MATPLOTLIB
4. Bokeh
5. Altair
6. ggplo

14. Write a python code for plotting 5 different graphs with an example

Answer :

below plots are plotted using Matplotlib


Common in all plots
Define the x-axis and corresponding y-axis values as lists.
Plot them on canvas using .plot() function.
Give a name to x-axis and y-axis using .xlabel() and .ylabel() functions.
Give a title to your plot using .title() function.
Finally, to view your plot, we use .show() function

1. Line Plot
Code
Output

2. Bar Chart
Code

Output
3. Histogram
Code

Output

4. Scatter plot
Code
Output

5. Pie Plot
Code

Output

15. Write a brief note on scrapping the web using twitter data API

Answer :

Web Scraping is Extracting data from websites


It is also called web crawling(bot),web harvesting
Scrapping includes actions like
1. fetching the page,
2. Parsing HTML pages
The browser parses HTML into a DOM tree.
HTML parsing involves tokenization and tree construction.
HTML tokens include start and end tags, as well as attribute names and values.
Beautiful soup , a Python library for pulling data out of HTML and XML files is used
It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the
parse tree
3. extracting data from pages/ web sites
Use an API
Use web scrapping tool (walk to extract data)
Scrapping the web using twitter data API
Twitter API lets you read and write Twitter data.
can use it to compose tweets, read profiles, and access your followers' data and a high volume of tweets on particular
subjects in specific locations.
Process
For using the Twitter API you need to have a developer access Twitter account.
After completing with the set up to create an app, in it, we will get Keys and tokens, which will help us retrieve
data from Twitter.
They act as login credentials.
Save these credentials for further use.
To extract the twitter data you should be in login mode till you extract data
Types of credentials needed
access_token="”
access_token_secret="”
consumer_key="”
consumer_secret=""

16. a )List the visualization tools in python. b). Discuss the steps needed to perform Web scrapping to retrieve the III-
B.Tech-I sem students results from CVR website.

Answer :

Visualisation tools -  provide an accessible way to see and understand trends, outliers, and patterns in data
1. Matplotlib
used for exploration & data visualisation
can do charts, plots & also can be customised
2. Seaborn
visualisation for large data
has advanced plots
3. Plotly
provides hgih quality plots
provides more advanced plots & features than MATPLOTLIB

Discuss the steps needed to perform Web scrapping to retrieve the III-B.Tech-I sem students results from CVR website
Instead of Amazon use CVR Website -> https://www.youtube.com/watch?v=ecAJfHHppVs

You might also like