Professional Documents
Culture Documents
1. Define Data science. What are the traits of Data science? Discuss the applications of Data science with suitable
examples -Mid 1 Question
Answer :
Data science
is a multi-disciplinary field that uses scientific methods, processes, algorithms, systems to extract knowledge, insights
from structured and unstructured data
Data Science is the science which uses in computer science, statistics, machine learning, visualization and human-
computer interactions to collect, clean, integrate, analyze, visualize, interact with data to create data products
Traits of Data science / Big Data
1. Volume
How much data is there?
refers to data generated from many sources
2. Variety
How diverse are different types of data?
data can be structures, unstructured or semi structured
3. Velocity
At what speed is new data generated? -> ie speed of generation of data
4. Veracity
How accurate is the data? -> ie how much the data is reliable
5. Value
ability to transform big data into valuable data and store
Applications of Data science
1. Advanced Image Recognition
Eg : Face Mask detection
2. Recommendation System
Eg : in youtube, google news recommendation
3. Banking
Eg : Fraud detection, NPA risk modeling
4. Transport
Eg : Self driving cars
Answer :
Data Similarity
is Numerical measure of how alike two data objects are.
It is higher when objects are more alike.
it often falls in the range [0,1]
Data Dissimilarity
Numerical measure of how different are two data objects
it is lower when objects are more alike
Minimum dissimilarity is often 0
Upper limit varies
Measures of Similarity/Dissimilarity for Simple Attributes
Distances
Minkowski distance
3. What is Data matrix? Explain using an example how to find a Dissimilarity matrix
Answer :
Data Matrix
representing n data points with p dimensions
Dissimilarity matrix
is a triangular matrix which represents n data points, but registers only the distance
Example using Eucledian distance
Considering below data Matrix
Solution
1. Calculating Eucledian distances
Answer :
Answer :
Answer :
Take variable
1. one as Half-Yearly -this is ordinal attribute and solve in same way as 6 th question
2. second as Final -this is numeric attribute so use formula and solve in same way as 3rd question
6. Define Proximity matrix. Find the similarity matrix for given DS Lab continuous evaluation grades (Ordinal attribute)
data set
Answer :
Proximity matrix
is a square matrix in which the entry in cell (j, k) is some measure of the similarity (or distance) between the items to
which row j and column k correspond.
Proximity matrices form the data for multidimensional scaling
7. What is data pre-processing and why do we need it? Explain cleaning of data in brief.
Answer :
Data preprocessing
is a technique that involves transforming raw data into useful and efficient format so that data mining analytics can be
applied
Major Tasks in Data Preprocessing
1. Data cleaning
2. Data integration
3. Data reduction
4. Data transformation and data discretization
Why Preprocess the Data?
1. Accuracy
correct or wrong, accurate or not
2. Completeness
not recorded, unavailable, …
3. Consistency
some modified but some not, dangling, …
4. Timeliness
timely update?
5. Believability
how trustable the data are correct?
6. Interpretability
how easily the data can be understood?
Explain cleaning of data in brief
Data Cleaning
data can have many irrelevant and missing parts so data cleaning is done which involves handling of missing
data, noisy data or resolving the inconsistencies in the data
1. Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways.
Some of them are:
Ignore the tuples:
This approach is suitable only when the dataset we have is quite large and multiple values are missing within a
tuple.
Fill the Missing values:
There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or
the most probable value
2. Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines.It can be generated due to faulty data
collection, data entry errors etc.
It can be handled in following ways :
Binning Method:
This method works on sorted data in order to smooth it. The whole data is divided into segments of equal size
and then various methods are performed to complete the task. Each segmented is handled separately. One can
replace all data in a segment by its mean or boundary values can be used to complete the task.
1. smoothing by bin means
2. smoothing by bin medians
3. smoothing by bin boundaries
Regression:
Here data can be made smooth by fitting it to a regression function.The regression used may be
1. linear (having one independent variable) or
2. multiple (having multiple independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the
clusters.
8. Write a python code for reading a dataset and removing the NaN values of filling the NaN values.
Answer :
# Importing dataset
df = pd.read_csv('user_data.csv')
#use this command for all those attributes which has NaN values
df.X.fillna(np.random.randint(0,2),inplace=True) // here X is the attribute which has NaN values
Data munging, also known as data wrangling, is the data preparation process of manually transforming and
cleansing/cleaning the data for better decision making.
Data Munging includes the following steps:
1. Data exploration: In this process, the data is studied, analyzed and understood by visualizing representations of data.
2. Dealing with missing values: Most of the datasets having a vast amount of data contain missing values of NaN, they
are needed to be taken care of by replacing them with mean, mode, the most frequent value of the column or simply by
dropping the row having a NaN value.
3. Reshaping data: In this process, data is manipulated according to the requirements, where new data can be added or
pre-existing data can be modified.
4. Filtering data: Some times datasets are comprised of unwanted rows or columns which are required to be removed or
filtered
5. Other: After dealing with the raw dataset with the above functionalities we get an efficient dataset as per our
requirements and then it can be used for a required purpose like data analyzing, machine learning, data visualization,
model training etc.
Eg : Refer this link -> https://www.geeksforgeeks.org/data-wrangling-in-python/
10. Explain various ways of data transformation and data reduction techniques
Answer :
Data Transformation
data are transformed into forms appropriate for data analytic processing
Data transformation tasks:
1. Smoothing
Remove the noise from the data.
Techniques includes Binning, Regression, Clustering.
2. Normalization
the attribute data are scaled so as to fall within a small specified range, such as -1.0 to 1.0, 0.0 to 1.0
Types
1. Min-max normalization to [new_minA , new_maxA ]
Data reduction
is a technique used to obtain a reduced representation of the data set that is much smaller in volume but yet produces
the same (or almost the same) analytical results
Data reduction strategies
1. Data compression
apply transformations to obtain reduced or compressed representation of original data
they are of 2 types
1. Lossless
If the original data can be reconstructed from the compressed data without any loss of
information
2. Lossy
If the original data can be reconstructed from the compressed data with loss of information, then
the data reduction is called lossy
Eg :
Wavelet transforms
Principal components analysis.
2. Dimensionality reduction-remove unimportant attributes/variables Eliminate the redundant attributes: which are weekly
important across the data.
Wavelet transforms
is a linear signal processing technique that, when applied to a data vector, transforms it to a numerically
different vector, of wavelet coefficients. The two vectors are of the same length. When applying this
technique to data reduction, we consider each tuple as an ndimensional data vector, that is, X=(x1 ,x,
…,xn ), depicting n measurements made on the tuple from n database attributes
Eg : using Fourier transform to reduce the data
Answer :
2. Histogram analysis
Top-down split
is an unsupervised discretization technique because it does not use class information
A histogram partitions the values of an attribute, A, into disjoint ranges called buckets or bins.
histogram partitions the values of an attribute, A, into disjoint ranges called buckets or bins
Eg :
for the dataset
we do Histogram Analysis in below way
3. Clustering analysis
Either top-down split or bottom-up merge, unsupervised
4. Entropy-based discretization
supervised, top-down split
Eg : if want example then refer this https://natmeurer.com/a-simple-guide-to-entropy-based-discretization/
Answer :
1. Questionnaire checking:
Questionnaire checking involves eliminating unacceptable questionnaires. These questionnaires may be incomplete,
instructions not followed, little variance, missing pages, past cutoff date or respondent not qualified.
2. Editing
Editing looks to correct illegible, incomplete, inconsistent and ambiguous answers.
3. Coding
Coding typically assigns alpha or numeric codes to answers that do not already have them so that statistical techniques
can be applied.
4. Transcribing
Transcribing data involves transferring data so as to make it accessible to people or applications for further processing.
5. Cleaning
Cleaning reviews data for consistencies. Inconsistencies may arise from faulty logic, out of range or extreme values.
6. Statistical adjustments
Statistical adjustments applies to data that requires weighting and scale transformations.
7. Analysis strategy selection
Finally, selection of a data analysis strategy is based on earlier work in designing the research project but is finalized
after consideration of the characteristics of the data that has been gathered.
13. What is the need for data visualization. Write on the libraries supported by python for data visulizations
Answer :
Data Visualization
is the graphical representation of information and data by using visual elements like charts, graphs, and maps
data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data
Need for Data Visualisation
1. To make easier in understand and remember
2. To discover unknown facts, outliers, and trends
3. To visualize relationships and patterns quickly
4. To ask a better question and make better decisions
5. To competitive analyze
6. To improve insights
7. Data visualization can identify areas that need improvement or modifications
8. Data visualization can clarify which factor influence customer behavior
9. Data visualization helps you to understand which products to place where
10. Data visualization can predict sales volumes
libraries supported by python for data visulizations
1. Matplotlib
used for exploration & data visualisation
can do charts, plots & also can be customised
2. Seaborn
visualisation for large data
has advanced plots
3. Plotly
provides hgih quality plots
provides more advanced plots & features than MATPLOTLIB
4. Bokeh
5. Altair
6. ggplo
14. Write a python code for plotting 5 different graphs with an example
Answer :
1. Line Plot
Code
Output
2. Bar Chart
Code
Output
3. Histogram
Code
Output
4. Scatter plot
Code
Output
5. Pie Plot
Code
Output
15. Write a brief note on scrapping the web using twitter data API
Answer :
16. a )List the visualization tools in python. b). Discuss the steps needed to perform Web scrapping to retrieve the III-
B.Tech-I sem students results from CVR website.
Answer :
Visualisation tools - provide an accessible way to see and understand trends, outliers, and patterns in data
1. Matplotlib
used for exploration & data visualisation
can do charts, plots & also can be customised
2. Seaborn
visualisation for large data
has advanced plots
3. Plotly
provides hgih quality plots
provides more advanced plots & features than MATPLOTLIB
Discuss the steps needed to perform Web scrapping to retrieve the III-B.Tech-I sem students results from CVR website
Instead of Amazon use CVR Website -> https://www.youtube.com/watch?v=ecAJfHHppVs