PDS Imp

PYTHON FOR DATA SCIENCE
IMP’S
By
GTU MEDIUM
Telegram Website Whatsapp

SEM 5 | PDS IMPS
Q. Explain Regression with example.
Regression is defined as a statistical method that helps us to analyze and understand the
relationship between two or more variables of interest. The process that is adapted to perform
regression analysis helps to understand which factors are important, which factors can be
ignored, and how they are influencing each other.
In regression, we normally have one dependent variable and one or more independent
variables. Here we try to “regress” the value of the dependent variable “Y” with the help of the
independent variables. In other words, we are trying to understand, how the value of ‘Y’
changes w.r.t change in ‘X’.
2
SEM 5 | PDS IMPS
Therefore,
y = mx + c
3.6 = o.4 * 3 +c
We get, c= 2.4
Q. Differentiate Supervised and Unsupervised learning.
Supervised learning
Supervised learning, as the name indicates, has the presence of a supervisor as a teacher.
Basically supervised learning is when we teach or train the machine using data that is well
labelled. Which means some data is already tagged with the correct answer. After that, the
machine is provided with a new set of examples (data) so that the supervised learning
algorithm analyses the training data (set of training examples) and produces a correct
outcome from labelled data.
3
SEM 5 | PDS IMPS
For instance, suppose you are given a basket filled with different kinds of fruits. Now the first
step is to train the machine with all the different fruits one by one like this:
 If the shape of the object is rounded and has a depression at the top,
is red in color, then it will be labeled as –Apple.
 If the shape of the object is a long curving cylinder having Green-

Yellow color, then it will be labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana from
the basket, and asked to identify it.
Since the machine has already learned the things from previous data and
this time has to use it wisely. It will first classify the fruit with its shape and color
and would confirm the fruit name as BANANA and put it in the Banana
category. Thus the machine learns the things from training data(basket
containing fruits) and then applies the knowledge to test data(new fruit).
Supervised learning is classified into two categories of algorithms:
 Classification: A classification problem is when the output variable is a category, such

as “Red” or “blue” , “disease” or “no disease”.
 Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies that some data is
already tagged with the correct answer.
Types:-
 Regression
 Logistic Regression
 Classification
 Naive Bayes Classifiers
 K-NN (k nearest neighbors)
 Decision Trees
4
SEM 5 | PDS IMPS
 Support Vector Machine
Advantages:-
 Supervised learning allows collecting data and produces data output from previous
experiences.
 Helps to optimize performance criteria with the help of experience.
 Supervised machine learning helps to solve various types of real-world computation
problems.
Disadvantages:-
 Classifying big data can be challenging.

 Training for supervised learning needs a lot of computation time. So, it requires a lot
of time.
Unsupervised learning
Unsupervised learning is the training of a machine using information that is neither classified
nor labeled and allowing the algorithm to act on that information without guidance. Here the
task of the machine is to group unsorted information according to similarities, patterns, and
differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the
machine. Therefore the machine is restricted to find the hidden structure in unlabeled data
by itself.
For instance, suppose it is given an image having both dogs and cats which it has never
seen.
Thus the machine has no idea about the features of dogs and
cats so we can’t categorize it as ‘dogs and cats ‘. But it can
categorize them according to their similarities, patterns, and
differences, i.e., we can easily categorize the above picture
into two parts. The first may contain all pics having dogs in
them and the second part may contain all pics having cats in
5
SEM 5 | PDS IMPS
them. Here you didn’t learn anything before, which means no training data or examples.
It allows the model to work on its own to discover patterns and information that was
previously undetected. It mainly deals with unlabelled data.
Unsupervised learning is classified into two categories of algorithms:
 Clustering: A clustering problem is where you want to discover the inherent groupings in
the data, such as grouping customers by purchasing behavior.
 Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data, such as people that buy X also tend to buy Y.
Types of Unsupervised Learning:-
Clustering
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
Supervised vs. Unsupervised Machine Learning
Parameters Supervised machine learning Unsupervised machine learning
Input Data Algorithms are trained using Algorithms are used against data
labeled data. that is not labeled
6
SEM 5 | PDS IMPS
Computational Simpler method Computationally complex

Complexity
Accuracy Highly accurate Less accurate
No. of classes No. of classes is known No. of classes is not known
Data Analysis Uses offline analysis Uses real-time analysis of data
Algorithms used Linear and Logistics regression, K-Means clustering, Hierarchical

Random forest, clustering,
Support Vector Machine, Apriori algorithm, etc.

Neural Network, etc.
Q. Explain Labels, Annotation and Legends in MatPlotLib.
Labels help people understand the significance of each axis of any graph you create.
Without labels, the values portrayed don’t have any significance. In addition to a moniker,
such as rainfall, you can also add units of measure, such as inches or centimeters, so that your
audience knows how to interpret the data shown. The following example shows how to add
labels to your graph:
values = [1, 5, 8, 9, 2, 0, 3, 10, 4, 7]
import matplotlib.pyplot as plt
plt.xlabel('Entries')
plt.ylabel('Values')
plt.plot(range(1,11), values)
plt.show()
The call to xlabel() documents the x-axis of your graph, while the call to ylabel() documents
the y-axis of your graph.
7
SEM 5 | PDS IMPS
Annotation: You use annotation to draw special attention to points of interest on a graph. For
example, you may want to point out that a specific data point is outside the usual range
expected for a particular data set. The following example shows how to add annotation to a
graph.
values = [1, 5, 8, 9, 2, 0, 3, 10, 4, 7]
plt.annotate(xy=[1,1], s='First Entry')
plt.plot(range(1,11), values)
plt.show()
The call to annotate() provides the labeling you need. You must provide a location for the
annotation by using the xy parameter, as well as provide text to place at the location by using
the s parameter. The annotate() function also provides other parameters that you can use to
create special formatting or placement on-screen.
8
SEM 5 | PDS IMPS
Legend: A legend documents the individual elements of a plot. Each line is presented in a
table that contains a label for it so that people can differentiate between each line. For
example, one line may represent sales from the first store location and another line may
represent sales from a second store location, so you include an entry in the legend for each
line that is labeled first and second. The following example shows how to add a legend to your
plot:
values = [1, 5, 8, 9, 2, 0, 3, 10, 4, 7]
values2 = [3, 8, 9, 2, 1, 2, 4, 7, 6, 6]
line1 = plt.plot(range(1,11), values)
line2 = plt.plot(range(1,11), values2)
plt.legend(['First', 'Second’], loc=4)
plt.show()
The call to legend() occurs after you create the plots, not before. You must provide a handle
to each of the plots. Notice how line1 is set equal to the first plot() call and line2 is set equal to
the second plot() call.
9
SEM 5 | PDS IMPS
Q. List and Explain different graphs in MatPlotLib.
The six most commonly used Plots come under Matplotlib. These are:
 Line Plot
 Bar Plot
 Scatter Plot
 Pie Plot
 Area Plot
 Histogram Plot
Line plots: are drawn by joining straight lines connecting data points where the x-axis and y-
axis values intersect. Line plots are the simplest form of representing data. In Matplotlib, the
plot() function represents this.
Example:
import matplotlib.pyplot as pyplot
pyplot.plot([1,2,3,5,6], [1, 2, 3, 4, 6])

pyplot.axis([0, 7, 0, 10])
# Print the chart

pyplot.show()
10
SEM 5 | PDS IMPS
Bar plots: are vertical/horizontal rectangular graphs that show data comparison where you
can gauge the changes over a period represented in another axis (mostly the X-axis). Each bar
can store the value of one or multiple data divided in a ratio. The longer a bar becomes, the
greater the value it holds. In Matplotlib, we use the bar() or barh() function to represent it.
Example:
pyplot.bar([0.25,2.25,3.25,5.25,7.25],[300,400,200,600,700],
label="Carpenter",color='b',width=0.5)
pyplot.bar([0.75,1.75,2.75,3.75,4.75],[50,30,20,50,60],
label="Plumber", color='g',width=.5)
pyplot.legend()
pyplot.xlabel('Days')
pyplot.ylabel('Wage')
pyplot.title('Details')
# Print the chart
pyplot.show()
Scatter Plot: We can implement the scatter (previously called XY) plots while comparing various
data variables to determine the connection between dependent and independent variables.
The data gets expressed as a collection of points clustered together meaningfully. Here each
value has one variable (x) determining the relationship with the other (Y).
Example:
x1 = [1, 2.5,3,4.5,5,6.5,7]
y1 = [1,2, 3, 2, 1, 3, 4]
x2=[8, 8.5, 9, 9.5, 10, 10.5, 11]

y2=[3,3.5, 3.7, 4,4.5, 5, 5.2]
pyplot.scatter(x1, y1, label = 'high bp low heartrate', color='c')
11
SEM 5 | PDS IMPS
pyplot.scatter(x2,y2,label='low bp high heartrate',color='g')
pyplot.title('Smart Band Data Report')
pyplot.xlabel('x')
pyplot.ylabel('y')
pyplot.legend()
# Print the chart
pyplot.show()
Pie Plot: A pie plot is a circular graph where

the data get represented within that components/segments or slices of pie. Data analysts use
them while representing the percentage or proportional data in which each pie slice
represents an item or data classification. In Matplotlib, the pie() function represents it.
Example:
slice = [12, 25, 50, 36, 19]
activities = ['NLP','Neural Network', 'Data analytics', 'Quantum Computing', 'Machine

Learning']
cols = ['r','b','c','g', 'orange']
pyplot.pie(slice,
labels =activities,
colors = cols,
startangle = 90,
shadow = True,
explode =(0,0.1,0,0,0),
autopct ='%1.1f%%')
pyplot.title('Training Subjects')
# Print the chart
pyplot.show()
12
SEM 5 | PDS IMPS
Area Plots: The area plots spread across certain areas with bumps and drops (highs and lows)
and are also known as stack plots. They look identical to the line plots and help track the
changes over time for two or multiple related groups to make it one whole category. In
Matplotlib, the stackplot() function represents it.
Example:
days = [1,2,3,4,5]
age =[63, 81, 52, 22, 37]
weight =[17, 28, 72, 52, 32]
pyplot.plot([],[], color='c', label = 'Weather Predicted', linewidth=5)
pyplot.plot([],[],color = 'g', label='Weather Change happened', linewidth=5)
pyplot.stackplot(days, age, weight, colors = ['c', 'g'])
pyplot.xlabel('Fluctuation with time')
pyplot.ylabel('Days')
pyplot.title('Weather report using Area Plot')
pyplot.legend()
# Print the chart
pyplot.show()
Histogram plot: We can use a histogram plot when the data remains distributed, whereas we
can use a bar graph to compare two entities. Both histogram and bar plot look alike but are
used in different scenarios. In Matplotlib, the hist() function represents this.
Example:
pop = [22,55,62,45,21,22,34,42,42,4,2,8]
bins = [1,10,20,30,40,50]
pyplot.hist(pop, bins, rwidth=0.6)
pyplot.xlabel('age groups')
13
SEM 5 | PDS IMPS
pyplot.ylabel('Number of people')
pyplot.title('Histogram')
# Print the chart
pyplot.show()
Q. Write a brief note on NetworkX library.
NetworkX is a Python language software package for the creation, manipulation, and study of
the structure, dynamics, and function of complex networks. It is used to study large complex
networks represented in form of graphs with nodes and edges. Using networkx we can load
and store complex networks. We can generate many types of random and classic networks,
analyze network structure, build network models, design new network algorithms and draw
networks.
Installation of the package:
pip install networkx
Example:
import networkx as nxx
import matplotlib.pyplot as plt1
G1 = nxx.Graph()
# Using the function add_edge
G1.add_edge(1, 2)
G1.add_edge(3, 2)
G1.add_edge(1, 4)
G1.add_edge(4, 2)
pos=nxx.circular_layout(G1)
nxx.draw(G1, pos, with_labels=True)
plt1.show()
14
SEM 5 | PDS IMPS
Q. Differentiate join and merge functions in pandas.
Both methods join() and merge() are used to perform joining pandas DataFrames on columns
meaning, it combines all columns from two or more DataFrames into a single DataFrame. The
main difference between join vs merge would be; join() is used to combine two DataFrames
on the index but not on columns whereas merge() is primarily used to specify the columns you
wanted to join on, this also supports joining on indexes and combination of index and columns.
import pandas as pd
# Creating the two dataframes
left = pd.DataFrame([['a', 1], ['b', 2]], list('XY'), list('PQ'))
right = pd.DataFrame([['c', 3], ['d', 4]], list('XY'), list('PR'))
Join: The join method takes two dataframes and joins them on their indexes (technically, you
can pick the column to join on for the left dataframe). If there are overlapping columns, the
join will want you to add a suffix to the overlapping column name from the left dataframe. Our
two dataframes do have an overlapping column name P.
joined_df = left.join(right, lsuffix='_')
print(joined_df)
merge
15
SEM 5 | PDS IMPS
At a basic level, merge more or less does the same thing as join. Both methods are used to
combine two dataframes together, but merge is more versatile, it requires specifying the
columns as a merge key. We can specify the overlapping columns with parameter on, or can
separately specify it with left_on and right_on parameters.
merged_df = left.merge(right, on='P', how='outer')
print(merged_df)
 join() method is used to perform join on row indices and doesn’t support joining on
columns unless setting column as index.
 join() by default performs left join.
 merge() method is used to perform join on indices, columns and combination of these
two.
 merge() by default performs inner join.
 Both these methods support inner, left, right, outer join types. merge additionally
supports the cross join.
Q. Explain Bag of Word model
Bag of words is a Natural Language Processing technique (Using Natural Language Processing,
we make use of the text data available across the internet to generate insights for the business.) of
text modelling. In technical terms, we can say that it is a method of feature extraction with
text data. This approach is a simple and flexible way of extracting features from documents.
A bag of words is a representation of text that describes the occurrence of words within a
document. We just keep track of word counts and disregard the grammatical details and the
word order. It is called a “bag” of words because any information about the order or
16
SEM 5 | PDS IMPS
structure of words in the document is discarded. The model is only concerned with whether
known words occur in the document, not where in the document.
Example:
Sentence 1: ”Welcome to Great Learning, Now start learning”
Sentence 2: “Learning is a good practice”
Step 1: Convert the above sentences in lower case as the case of the word does not hold
any information.
Step 2: Remove special characters and stopwords from the text. Stopwords are the words
that do not contain much information about text like ‘is’, ‘a’,’the and many more’.
After applying the above steps, the sentences are changed to
Sentence 1: ”welcome great learning now start learning”
Sentence 2: “learning good practice”
Although the above sentences do not make much sense the maximum information is
contained in these words only.
Step 3: Go through all the words in the above text and make a list of all of the words in our
model vocabulary.
 welcome
 great
 learning
 now
 start
 good
 practice
Now as the vocabulary has only 7 words, we can use a fixed-length document-
representation of 7, with one position in the vector to score each word.
The scoring method we use here is the same as used in the previous example. For sentence 1,
the count of words is as follow:
Word Frequency
welcome 1
17
SEM 5 | PDS IMPS
great 1
learning 2
now 1
start 1
good 0
practice 0
Writing the above frequencies in the vector
Sentence 1 ➝ [ 1,1,2,1,1,0,0 ]
Now for sentence 2, the scoring would be like
Word Frequency
welcome 0
great 0
learning 1
now 0
start 0
good 1
practice 1
Similarly, writing the above frequencies in the vector form
Sentence 2 ➝ [ 0,0,1,0,0,1,1 ]
Sentence welcome great learning now start good practice

Sentence1 1 1 2 1 1 0 0
Sentence2 0 0 1 0 0 1 1
The approach used in example two is the one that is generally used in the Bag-of-Words
technique, the reason being that the datasets used in Machine learning are tremendously
large and can contain vocabulary of a few thousand or even millions of words. Hence,
preprocessing the text before using bag-of-words is a better way to go.
Q. Explain how to deal with missing data in Pandas.
Pandas treat None and NaN as essentially interchangeable for indicating missing or null values.
To facilitate this convention, there are several useful functions for detecting, removing, and
replacing null values in Pandas DataFrame :
18
SEM 5 | PDS IMPS
 isnull() : In order to check null values in Pandas DataFrame, we use isnull() function this
function return dataframe of Boolean values which are True for NaN values.
 notnull():In order to check null values in Pandas Dataframe, we use notnull() function
this function return dataframe of Boolean values which are False for NaN values.
 dropna():In order to drop a null values from a dataframe, we used dropna() function
this function drop Rows/Columns of datasets with Null values in different ways.
 fillna()
 replace()
 interpolate()
In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function
these function replace NaN values with some value of their own. All these function help in
filling a null values in datasets of a DataFrame. Interpolate() function is basically used to fill NA
values in the dataframe but it uses various interpolation technique to fill the missing values
rather than hard-coding the value.
Q. Explain Groupby function in pandas with example.
Syntax: DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True,

group_keys=True, squeeze=False, **kwargs)
The groupby() function is used to group DataFrame or Series using a mapper or by a Series of
columns.A groupby operation involves some combination of splitting the object, applying a
function, and combining the results. This can be used to group large amounts of data and
compute operations on these groups.
19
SEM 5 | PDS IMPS
Q. Explain DataFrame in Pandas with example.
A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table

with rows and columns. OR Pandas DataFrame is two-dimensional size-mutable, potentially
heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is
a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.
Pandas DataFrame consists of three principal components, the data, rows, and columns.
20
SEM 5 | PDS IMPS
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
#load data into a DataFrame object:

df = pd.DataFrame(data)
print(df)
Output:
calories duration
0 420 50
1 380 40
2 390 45
Q. Differentiate rand and randn function in Numpy.
1) np.random.rand
np.random.rand returns a random numpy array or scalar whose element(s) are drawn
randomly from the normal distribution over [0,1). (including 0 but excluding 1)
It returns a single python float if no input parameter is specified.
Syntax
np.random.rand(d0,d1,d2,.. dn)
d0,d1,d2,.. dn (optional) – It represents the dimension of the required array given as int. It is
optional, if not specified, it will return a single python float.
Example 1: Creating 1-D Numpy Random Array
In [11]:
np.random.rand(3)
Out[11]:
array([0.7798154 , 0.45685334, 0.89824928])
21
SEM 5 | PDS IMPS
In [19]:
np.random.rand(5,3)
Out[19]:
array([[0.17963626, 0.46373528, 0.30762711],
[0.27334617, 0.45668808, 0.54813439],
[0.44506229, 0.32059869, 0.92962626],
[0.78297602, 0.20849134, 0.65793903],
[0.66985367, 0.45470811, 0.05023272]])
In [16]:
np.random.rand(3,2,4)
Out[16]:
array([[[0.82564261, 0.99100227, 0.3498416 , 0.25349147],
[0.97474162, 0.8879708 , 0.52644532, 0.48392986]],
[[0.24771917, 0.55511987, 0.64328105, 0.26636461],
[0.83461679, 0.19501868, 0.51199488, 0.75963094]],
[[0.545668 , 0.22256917, 0.51817445, 0.84151684],
[0.80153195, 0.42129928, 0.49337318, 0.8382367 ]]])
Example 4: A Random Python Float
In [17]:
np.random.rand()
Out[17]:
0.5747916494126569
22
SEM 5 | PDS IMPS
2) np.random.randn
np.random.randn returns a random numpy array or scalar of sample(s), drawn randomly from
the standard normal distribution.
It returns a single python float if no input parameter is specified.
Syntax
np.random.randn(d0,d1,d2,.. dn)
d0,d1,d2,.. dn (optional) – It represents the dimension of the required array given as int. It is
optional, if not specified, it will return a single python float.
Example 1: Creating 1-D Random Array
In [32]:
np.random.randn(6)
Out[32]:
array([-0.42125684, -0.0421679 , 0.63053175, 0.08204267, -1.08237789,
1.13159155])
In [34]:
np.random.randn(6,4)
Out[34]:
array([[-1.20911446, 0.22615005, 2.74344014, -0.47000636],
[ 2.45453931, -0.36098073, 0.9761115 , 0.21063749],
[ 1.05366423, 0.35103113, -0.16083158, -0.70649343],
[ 0.22107229, 0.17888074, -1.13098505, -0.26359566],
[ 2.29313593, 1.90569166, 0.71343492, 0.85209564],
[ 0.67663365, 0.56029281, 1.11382612, -0.92873211]])
In [35]:
np.random.randn(3,4,2)
23
SEM 5 | PDS IMPS
Out[35]:
array([[[-0.13509054, 1.31253658],
[ 0.79514661, -0.15733937],
[-0.42428779, 0.07816613],
[ 1.27951041, -1.2528357 ]],
[[-0.49349802, -0.1929593 ],
[-0.51593638, -1.08389571],
[-0.72854643, -0.44708392],
[-0.01845007, 2.02125787]],
[[-1.8826071 , 1.65592025],
[-2.18326764, -0.07711314],
[-2.9275772 , 2.3173623 ],
[ 0.94757097, -0.13646251]]])
Example 4: A Random Python Float
In [36]:
np.random.randn()
Out[36]:
-0.36506602839929475
24
SEM 5 | PDS IMPS
Q. Differentiate Numpy and Pandas
Comparison Parameter NumPy Pandas
Powerful Tool A powerful tool of NumPy is Arrays A powerful tool of Pandas is

Data frames and a Series
Memory Consumption NumPy is memory efficient Pandas consume more

memory
Data Compatibility Works with numerical data Works with tabular data
Performance Better performance when the Better performance when the

number of rows is 50K or less number of rows is 500k or more
Speed Faster than data frames Relatively slower than arrays
Data Object Creates “N” dimensional objects Creates “2D” objects
Type of Data Homogenous data type Heterogenous data type
Access Methods Using only index position Using index position or index
labels
Indexing Indexing in NumPy arrays is very fast Indexing in Pandas series is very
slow
Operations Does not have any additional Provides special utilities such as
functions “groupby” to access and
manipulate subsets
External Data Generally used data created by the Pandas object created by
user or built-in function external data such as CSV,
Excel, or SQL
Industrial Coverage NumPy is mentioned in 62 company Pandas are mentioned in 73

stack and 32 developers stack company stack and 46
developers stack
Application NumPy is popular for numerical Pandas is popular for data

calculations analysis and visualizations
Usage in ML and AI Toolkits can like TensorFlow and scikit Pandas series cannot be
can only be fed using NumPy arrays directly fed as input toolkits
Core Language NumPy was written in C programming Pandas use R language for
initially reference language
25
SEM 5 | PDS IMPS
Q. Factorial program in python using recursion.
def factorial(n):
if (n==1 or n==0):
return 1
else:
return n* factorial(n-1)
num=int(input("Enter a number: "))
print("Factorial of ",num, "is",factorial(num))
Output:
Enter a number: 3
Factorial of 3 is 6
Q. Write a program to print Current date and time.

import datetime
today=datetime.datetime.now()
x=str(today)
print(x)
Output:
2022-12-30 22:05:02000
Q. Write a program to interchange the List elements on two positions entered by a user.
list=[1,6,4,8]
def sum(list, a, b):
list[a],list[b] = list[b],list[a]
return list
a = int(input('Enter 1st Index : '))
b = int(input('Enter 2nd Index: '))
print(sum(list,a,b))
Output:
Enter 1st Index:
0
Enter 2nd Index:
2
[4, 6, 1, 8]
26
SEM 5 | PDS IMPS
Q. Write a program which takes 2 digits, X,Y as input and generates a 2- dimensional array of
size X * Y. The element value in the i-th row and j-th column of the array should be i*j.
row_num = int(input("Input number of rows: "))

col_num = int(input("Input number of columns: "))
multi_list = [[0 for col in range(col_num)] for row in range(row_num)]
for row in range(row_num):

for col in range(col_num):
multi_list[row][col]= row*col
print(multi_list)
Output:
Input number of rows: 3
Input number of columns: 4
[[0, 0, 0, 0], [0, 1, 2, 3], [0, 2, 4, 6]]
Q. Write a program to print following pattern.
def pypart(n):
if n==0:
return
else:
pypart(n-1)
print("* "*n)
n=5
pypart(n)
Output:
*
**
***
****
*****
27
SEM 5 | PDS IMPS
Q. Write a program to check whether the given number is prime or not.
num = 11
# If given number is greater than 1
if num > 1:
# Iterate from 2 to n / 2
for i in range(2, int(num/2)+1):
# If num is divisible by any number between
# 2 and n / 2, it is not prime
if (num % i) == 0:
print(num, "is not a prime number")
break
else:
print(num, "is a prime number")
else:
print(num, "is not a prime number")
Output:
11 is a prime number
Q. Python program to illustrate Pie chart.
# Import libraries
from matplotlib import pyplot as plt
import numpy as np
# Creating dataset
cars = ['AUDI', 'BMW', 'FORD',
'TESLA', 'JAGUAR', 'MERCEDES']
data = [23, 17, 35, 29, 12, 41]
# Creating plot
fig = plt.figure(figsize =(10, 7))
plt.pie(data, labels = cars)
# show plot
plt.show()
28
SEM 5 | PDS IMPS
Output:
Q. Python program to illustrate line chart.

import numpy as np
# Define X and Y variable data
x = np.array([1, 2, 3, 4])
y = x*2
plt.plot(x, y)
plt.xlabel("X-axis") # add X-axis label
plt.ylabel("Y-axis") # add Y-axis label
plt.title("Any suitable title") # add title
plt.show()
Output:
29
SEM 5 | PDS IMPS
Q. Write a program using Numpy to count number of “p” element wise in a given array.
import numpy as np
x1 = np.array(['Python', 'PHP', 'JS', 'examples', 'html'], dtype=np.str)
print("\nOriginal Array:")
print(x1)
print("Number of ‘P’:")
r = np.char.count(x1, "P")
print(r)
Output:
Original Array:
['Python' 'PHP' 'JS' 'examples' 'html']
Number of ‘P’:
[1 2 0 0 0]
Q. Python program to show various way to read data from a file.
# Program to show various ways to

# read data from a file.
# Creating a file
file1 = open("myfile.txt", "w")
L = ["This is Delhi \n", "This is Paris \n", "This is London \n"]
# Writing data to a file

file1.write("Hello \n")
file1.writelines(L)
file1.close() # to change file access modes
file1 = open("myfile.txt", "r+")
print("Output of Read function is ")
30
SEM 5 | PDS IMPS
print(file1.read())
print()
# seek(n) takes the file handle to the nth

# bite from the beginning.
file1.seek(0)
print("Output of Readline function is ")

print(file1.readline())
print()
file1.seek(0)
# To show difference between read and readline

print("Output of Read(9) function is ")
print(file1.read(9))
print()
file1.seek(0)
print("Output of Readline(9) function is ")

print(file1.readline(9))
print()
file1.seek(0)
# readlines function
print("Output of Readlines function is ")
print(file1.readlines())
print()
file1.close()
31
SEM 5 | PDS IMPS
Output:
Output of Read function is
Hello
This is Delhi
This is Paris
This is London
Output of Readline function is

Hello
Output of Read(9) function is

Hello
Th
Output of Readline(9) function is

Hello
Output of Readlines function is

['Hello \n', 'This is Delhi \n', 'This is Paris \n', 'This is London \n']
Q. Write a program to print Fibonacci series up to number given by user.
# Function for nth Fibonacci number

def Fibonacci(n):
# Check if input is 0 then it will

# print incorrect input
if n < 0:
print("Incorrect input")
32
SEM 5 | PDS IMPS
# Check if n is 0
# then it will return 0
elif n == 0:
return 0
# Check if n is 1,2
# it will return 1
elif n == 1 or n == 2:
return 1
else:
return Fibonacci(n-1) + Fibonacci(n-2)
# Driver Program
x=int(input(“Enter a number:”))
print(Fibonacci(x))
Output:
Enter a number:9
34
Q. Explain with a small example of each. 1. shape 2. tail() 3. describe()
1. Tail() Examples
In [1]:
import numpy as np
import pandas as pd
In [2]:
df = pd.DataFrame({'animal':['snake', 'bat', 'tiger', 'lion',
'fox', 'eagle', 'shark', 'dog', 'deer']})
df
33
SEM 5 | PDS IMPS
Out[2]:
animal
0 snake
1 bat
2 tiger
3 lion
4 fox
5 eagle
6 shark
7 dog
8 deer
In [3]:
df.tail()
Out[3]:
animal
4 fox
5 eagle
6 shark
7 dog
8 deer
In [4]:
df.tail(4)
Out[4]:
animal
5 eagle
6 shark
7 dog
8 deer
2. Shape( ) Example
import numpy as npy

# creating a 2-d array
arr1 = npy.array([[1, 3, 5, 7], [2, 4, 6, 8]])
# creating a 3-d array
arr2 = npy.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(arr1.shape)
34
SEM 5 | PDS IMPS
print(arr2.shape)
Output:
(2, 4)
(2, 2,2)
3. Describe() Example
import pandas as pd
import numpy as np
a1 = pd.Series([1, 2, 3])
a1.describe()
Output:
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
dtype: float64
Q. Define covariance and correlation
The difference between variance, covariance, and correlation is:
 Variance is a measure of variability from the mean

 Covariance is a measure of relationship between the variability of 2 variables -
covariance is scale dependent because it is not standardized
 Correlation is a of relationship between the variability of of 2 variables - correlation is
standardized making it not scale dependent
35
SEM 5 | PDS IMPS
Covariance brings about the variation across variables. We use covariance to measure how
much two variables change with each other. Correlation reveals the relation between the
variables. We use correlation to determine how strongly linked two variables are to each other.
import pandas as pd
import numpy as np
df = pd.DataFrame({'Row':[1,2,3,4], 'columns': [5,6,7,8]})
df.var() #Variance
df.cov() #Covariance
df.corr() #Correlation
Q. What is Data Wrangling process? Define data exploratory data analysis?
Data Wrangling is the process of gathering, collecting, and transforming Raw data into another
format for better understanding, decision-making, accessing, and analysis in less time. Data
Wrangling is also known as Data Munging.
Data Wrangling is a crucial topic for Data Science and Data Analysis. Pandas Framework of
Python is used for Data Wrangling. Pandas is an open-source library specifically developed for
Data Analysis and Data Science. The process like data sorting or filtration, Data grouping, etc.
Data wrangling in python deals with the below functionalities:
1. Data exploration: In this process, the data is studied, analyzed and understood by
visualizing representations of data.
2. Dealing with missing values: Most of the datasets having a vast amount of data contain
missing values of NaN, they are needed to be taken care of by replacing them with
mean, mode, the most frequent value of the column or simply by dropping the row
having a NaN value.
3. Reshaping data: In this process, data is manipulated according to the requirements,
where new data can be added or pre-existing data can be modified.
4. Filtering data: Some times datasets are comprised of unwanted rows or columns which
are required to be removed or filtered
36
SEM 5 | PDS IMPS
5. Other: After dealing with the raw dataset with the above functionalities we get an
efficient dataset as per our requirements and then it can be used for a required purpose
like data analyzing, machine learning, data visualization, model training etc.
Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques. It
is used to discover trends, patterns, or to check assumptions with the help of statistical
summary and graphical representations.
Steps in EDA:
1. Getting insights about the dataset

a) head()
b) shape
c) describe()
d) tail()
e) info()
2. Handling Missing Values
a) isnull()
b) notnull()
c) dropna()
d) fillna()
e) replace()
f) interpolate()
3. Data visualization
a) Matplotlib
b) Line Plot
c) Box plot
d) Bar Plot
e) Scatter Plot
f) Histogram
g) Pie Plot
4. Handling Outliers: An Outlier is a data-item/object that deviates significantly from the
rest of the (so-called normal)objects. They can be caused by measurement or execution
errors. The analysis for outlier detection is referred to as outlier mining. There are many
37
SEM 5 | PDS IMPS
ways to detect the outliers, and the removal process is the data frame same as removing
a data item from the panda’s dataframe.
For removing the outlier, one must follow the same process of removing an entry from
the dataset using its exact position in the dataset because in all the above methods of
detecting the outliers end result is the list of all those data items that satisfy the outlier
definition according to the method used.
Q. What is Scikit-learn?
Scikit-learn is a key library for the Python programming language that is typically used in
machine learning projects. Scikit-learn is focused on machine learning tools including
mathematical, statistical and general purpose algorithms that form the basis for many machine
learning technologies. As a free tool, Scikit-learn is tremendously important in many different
types of algorithm development for machine learning and related technologies.
The functionality that scikit-learn provides include:
 Regression, including Linear and Logistic Regression

 Classification, including K-Nearest Neighbors
 Clustering, including K-Means and K-Means++
 Model selection
 Preprocessing, including Min-Max Normalization
Installation:
Method 1 - Using Pip
Use the following command to install scikit-learn using pip:
pip install -U scikit-learn
Method 2 - Using Conda
Use the following command to install scikit-learn using conda:
conda install scikit-learn
38
SEM 5 | PDS IMPS
Q. Explain bar() function with code
The bar() function is used to create a bar plot that is bounded with a rectangle depending
on the given parameters of the function.
import numpy as np
data = {'Computer Networks':20, 'DBMS':15, 'Java':30,'C':35}

courses = list(data.keys())
values = list(data.values())
fig = plt.figure(figsize = (10, 5))
plt.bar(courses, values, color ='magenta',width = 0.4)
plt.xlabel("Books offered")
plt.ylabel("No. of books provided")
plt.title("Books provided by the institute")
plt.show()
Output:
Here plt.bar(courses, values, color='magenta') is basically specifying the bar chart that is to be
plotted using "Books offered"(by the college) column as the X-axis, and the "No. of books" as
the Y-axis.
The color attribute is basically used to set the color of the bars(magenta ).
The statement plt.xlabel("Books offered") and

plt.ylabel("books provided by the institute")
are used to label the corresponding axes. The
plt.title() function is used to make a title for the
graph. And the plt.show() function is used to
show the graph as the output of the previous
commands.
39
SEM 5 | PDS IMPS
Q. What do you understand by Data visualization? Discuss some Python’s data visualization
techniques.
Data visualization provides a good, organized pictorial representation of the data which
makes it easier to understand, observe, analyze. In this tutorial, we will discuss how to visualize
data using Python.
Python offers several plotting libraries, namely Matplotlib, Seaborn and many other such data
visualization packages with different features for creating informative, customized, and
appealing plots to present data in the most simple and effective way.
Matplotlib and Seaborn are python libraries that are used for data visualization. They have
inbuilt modules for plotting different graphs. While Matplotlib is used to embed graphs into
applications, Seaborn is primarily used for statistical graphs.
Python’s data visualization techniques
Q. Explain data science pipeline in details.
Data Science is an interdisciplinary field that focuses on extracting knowledge from data sets
that are typically huge in amount. The field encompasses analysis, preparing data for analysis,
and presenting findings to inform high-level decisions in an organization.
A pipeline in data science is “a set of actions which changes the raw (and confusing) data
from various sources (surveys, feedbacks, list of purchases, votes, etc.), to an understandable
format so that we can store it and use it for analysis.”
The raw data undergoes different stages within a pipeline which are:
1) Fetching/Obtaining the Data: This stage involves the identification of data from the internet
or internal/external databases and extracts into useful formats.
40
SEM 5 | PDS IMPS
2) Scrubbing/Cleaning the Data: This is the most time-consuming stage and requires more
effort. It is further divided into two stages:
 Examining Data:
a) identifying errors
b) identifying missing values
c) identifying corrupt records
 Cleaning of data:
a) replace or fill missing values/errors
3) Exploratory Data Analysis: When data reaches this stage of the pipeline, it is
free from errors and missing values, and hence is suitable for finding patterns
using visualizations and charts.
4) Modeling the Data: This is that stage of the data science pipeline where machine learning
comes to play. With the help of machine learning, we create data models. Data models are
nothing but general rules in a statistical sense, which is used as a predictive tool to enhance
our business decision-making.
5) Interpreting the Data: Similar to paraphrasing your data science model. Always remember,
if you can’t explain it to a six-year-old, you don’t understand it yourself. So, communication
becomes the key!! This is the most crucial stage of the pipeline, wherewith the use of
psychological techniques, correct business domain knowledge, and your immense
storytelling abilities, you can explain your model to the non-technical audience.
6) Revision: As the nature of the business changes, there is the introduction of new features
that may degrade your existing models. Therefore, periodic reviews and updates are very
important from both business’s and data scientist’s point of view.
Q. What is HTML parsing?
Parsing is a technique of examining web text which is the

combination of different tags, tokens, etc. For parsing the
HTML content of a webpage in Python we will use a
Python module known as BeautifulSoup.
41
SEM 5 | PDS IMPS
Beautiful Soup is a library that is used to scrape the data from web pages. It is used to parse
HTML and XML content in Python.
First of all import the requests module and the BeautyfulSoup module from bs4 as shown
below.
import requests
from bs4 import BeautifulSoup
# Url of website
url="http://170.187.134.184"
rawdata=requests.get(url)
html=rawdata.content
Now we will use html.parser to parse the content of html and prettify it using BeautifulSoup.
# Parsing html content with beautifulsoup

soup = BeautifulSoup(html, 'html.parser')
print(soup)
Once the content is parsed using we can use different methods of beautiful soup to get the
relevant data from the website.
print(soup.title)
paragraphs = soup.find_all('p')
print(paragraphs)
Combining the whole code at a place.
import requests
from bs4 import BeautifulSoup
# Url of website
url="http://170.187.134.184"
rawdata=requests.get(url)
html=rawdata.content
# Parsing html content with beautifulsoup
soup = BeautifulSoup(html, 'html.parser')
42
SEM 5 | PDS IMPS
print(soup.title)
paragraphs = soup.find_all('p')
print(paragraphs)
Output:
<title>Programming Blog and Software Development Company - CodeSpeedy</title>

[A Place Where You Find Solutions In Coding And Programming For PHP, WordPress, HTML,
CSS, JavaScript, Python, C++ and much more., Hire us for your software
development, mobile app development and web development project., Below are
some of our popular categories from our programming blog. Click to browse the tutorials and
articles., CodeSpeedy Technology Private Limited is an Information technology
company that keep helping the learners and developers to learn computer programming.
CodeSpeedy also provides coding solutions along with various IT services ( web
development, software development etc )., We also provide training and internship
on various computer programming field like Java, Python, C++, PHP, AI etc.
, 
If you are looking for a web design company or web development company then hire our
team. Our team also expert in developing software, Android and iOS, and Artificial
Intelligence.
, CodeSpeedy, Useful Links,
Location: Berhampore, West Bengal, India]
43

PDS Imp

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PDS Imp

Uploaded by

Copyright:

Available Formats

PYTHON FOR DATA SCIENCE

Telegram Website Whatsapp

Q. Explain Regression with example.

Q. Differentiate Supervised and Unsupervised learning.

 If the shape of the object is a long curving cylinder having Green-

Supervised learning is classified into two categories of algorithms:

 Classification: A classification problem is when the output variable is a category, such

 Support Vector Machine

 Classifying big data can be challenging.

Unsupervised learning is classified into two categories of algorithms:

Types of Unsupervised Learning:-

Supervised vs. Unsupervised Machine Learning

Parameters Supervised machine learning Unsupervised machine learning

Computational Simpler method Computationally complex

Accuracy Highly accurate Less accurate

No. of classes No. of classes is known No. of classes is not known

Data Analysis Uses offline analysis Uses real-time analysis of data

Algorithms used Linear and Logistics regression, K-Means clustering, Hierarchical

Support Vector Machine, Apriori algorithm, etc.

Q. Explain Labels, Annotation and Legends in MatPlotLib.

values = [1, 5, 8, 9, 2, 0, 3, 10, 4, 7]

import matplotlib.pyplot as plt

values = [1, 5, 8, 9, 2, 0, 3, 10, 4, 7]

import matplotlib.pyplot as plt

plt.annotate(xy=[1,1], s='First Entry')

values = [1, 5, 8, 9, 2, 0, 3, 10, 4, 7]

import matplotlib.pyplot as plt

line1 = plt.plot(range(1,11), values)

line2 = plt.plot(range(1,11), values2)

plt.legend(['First', 'Second’], loc=4)

Q. List and Explain different graphs in MatPlotLib.

import matplotlib.pyplot as pyplot

pyplot.plot([1,2,3,5,6], [1, 2, 3, 4, 6])

# Print the chart

import matplotlib.pyplot as pyplot

# Print the chart

import matplotlib.pyplot as pyplot

x2=[8, 8.5, 9, 9.5, 10, 10.5, 11]

pyplot.scatter(x1, y1, label = 'high bp low heartrate', color='c')

pyplot.scatter(x2,y2,label='low bp high heartrate',color='g')

pyplot.title('Smart Band Data Report')

# Print the chart

Pie Plot: A pie plot is a circular graph where

import matplotlib.pyplot as pyplot

slice = [12, 25, 50, 36, 19]

activities = ['NLP','Neural Network', 'Data analytics', 'Quantum Computing', 'Machine

cols = ['r','b','c','g', 'orange']

# Print the chart

import matplotlib.pyplot as pyplot

age =[63, 81, 52, 22, 37]

weight =[17, 28, 72, 52, 32]

pyplot.plot([],[], color='c', label = 'Weather Predicted', linewidth=5)

pyplot.plot([],[],color = 'g', label='Weather Change happened', linewidth=5)

pyplot.stackplot(days, age, weight, colors = ['c', 'g'])

pyplot.xlabel('Fluctuation with time')

pyplot.title('Weather report using Area Plot')

# Print the chart

import matplotlib.pyplot as pyplot

pyplot.hist(pop, bins, rwidth=0.6)

# Print the chart

Q. Write a brief note on NetworkX library.

Installation of the package:

pip install networkx