Professional Documents
Culture Documents
IMP’S
By
GTU MEDIUM
Regression is defined as a statistical method that helps us to analyze and understand the
relationship between two or more variables of interest. The process that is adapted to perform
regression analysis helps to understand which factors are important, which factors can be
ignored, and how they are influencing each other.
In regression, we normally have one dependent variable and one or more independent
variables. Here we try to “regress” the value of the dependent variable “Y” with the help of the
independent variables. In other words, we are trying to understand, how the value of ‘Y’
changes w.r.t change in ‘X’.
2
SEM 5 | PDS IMPS
Therefore,
y = mx + c
3.6 = o.4 * 3 +c
We get, c= 2.4
Supervised learning
Supervised learning, as the name indicates, has the presence of a supervisor as a teacher.
Basically supervised learning is when we teach or train the machine using data that is well
labelled. Which means some data is already tagged with the correct answer. After that, the
machine is provided with a new set of examples (data) so that the supervised learning
algorithm analyses the training data (set of training examples) and produces a correct
outcome from labelled data.
3
SEM 5 | PDS IMPS
For instance, suppose you are given a basket filled with different kinds of fruits. Now the first
step is to train the machine with all the different fruits one by one like this:
If the shape of the object is rounded and has a depression at the top,
is red in color, then it will be labeled as –Apple.
Now suppose after training the data, you have given a new separate fruit, say Banana from
the basket, and asked to identify it.
Since the machine has already learned the things from previous data and
this time has to use it wisely. It will first classify the fruit with its shape and color
and would confirm the fruit name as BANANA and put it in the Banana
category. Thus the machine learns the things from training data(basket
containing fruits) and then applies the knowledge to test data(new fruit).
Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies that some data is
already tagged with the correct answer.
Types:-
Regression
Logistic Regression
Classification
Naive Bayes Classifiers
K-NN (k nearest neighbors)
Decision Trees
4
SEM 5 | PDS IMPS
Advantages:-
Supervised learning allows collecting data and produces data output from previous
experiences.
Helps to optimize performance criteria with the help of experience.
Supervised machine learning helps to solve various types of real-world computation
problems.
Disadvantages:-
Unsupervised learning
Unsupervised learning is the training of a machine using information that is neither classified
nor labeled and allowing the algorithm to act on that information without guidance. Here the
task of the machine is to group unsorted information according to similarities, patterns, and
differences without any prior training of data.
Unlike supervised learning, no teacher is provided that means no training will be given to the
machine. Therefore the machine is restricted to find the hidden structure in unlabeled data
by itself.
For instance, suppose it is given an image having both dogs and cats which it has never
seen.
Thus the machine has no idea about the features of dogs and
cats so we can’t categorize it as ‘dogs and cats ‘. But it can
categorize them according to their similarities, patterns, and
differences, i.e., we can easily categorize the above picture
into two parts. The first may contain all pics having dogs in
them and the second part may contain all pics having cats in
5
SEM 5 | PDS IMPS
them. Here you didn’t learn anything before, which means no training data or examples.
It allows the model to work on its own to discover patterns and information that was
previously undetected. It mainly deals with unlabelled data.
Clustering: A clustering problem is where you want to discover the inherent groupings in
the data, such as grouping customers by purchasing behavior.
Association: An association rule learning problem is where you want to discover rules
that describe large portions of your data, such as people that buy X also tend to buy Y.
Clustering
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
Input Data Algorithms are trained using Algorithms are used against data
labeled data. that is not labeled
6
SEM 5 | PDS IMPS
Labels help people understand the significance of each axis of any graph you create.
Without labels, the values portrayed don’t have any significance. In addition to a moniker,
such as rainfall, you can also add units of measure, such as inches or centimeters, so that your
audience knows how to interpret the data shown. The following example shows how to add
labels to your graph:
plt.xlabel('Entries')
plt.ylabel('Values')
plt.plot(range(1,11), values)
plt.show()
The call to xlabel() documents the x-axis of your graph, while the call to ylabel() documents
the y-axis of your graph.
7
SEM 5 | PDS IMPS
Annotation: You use annotation to draw special attention to points of interest on a graph. For
example, you may want to point out that a specific data point is outside the usual range
expected for a particular data set. The following example shows how to add annotation to a
graph.
plt.plot(range(1,11), values)
plt.show()
The call to annotate() provides the labeling you need. You must provide a location for the
annotation by using the xy parameter, as well as provide text to place at the location by using
the s parameter. The annotate() function also provides other parameters that you can use to
create special formatting or placement on-screen.
8
SEM 5 | PDS IMPS
Legend: A legend documents the individual elements of a plot. Each line is presented in a
table that contains a label for it so that people can differentiate between each line. For
example, one line may represent sales from the first store location and another line may
represent sales from a second store location, so you include an entry in the legend for each
line that is labeled first and second. The following example shows how to add a legend to your
plot:
values2 = [3, 8, 9, 2, 1, 2, 4, 7, 6, 6]
plt.show()
The call to legend() occurs after you create the plots, not before. You must provide a handle
to each of the plots. Notice how line1 is set equal to the first plot() call and line2 is set equal to
the second plot() call.
9
SEM 5 | PDS IMPS
The six most commonly used Plots come under Matplotlib. These are:
Line Plot
Bar Plot
Scatter Plot
Pie Plot
Area Plot
Histogram Plot
Line plots: are drawn by joining straight lines connecting data points where the x-axis and y-
axis values intersect. Line plots are the simplest form of representing data. In Matplotlib, the
plot() function represents this.
Example:
10
SEM 5 | PDS IMPS
Bar plots: are vertical/horizontal rectangular graphs that show data comparison where you
can gauge the changes over a period represented in another axis (mostly the X-axis). Each bar
can store the value of one or multiple data divided in a ratio. The longer a bar becomes, the
greater the value it holds. In Matplotlib, we use the bar() or barh() function to represent it.
Example:
pyplot.bar([0.25,2.25,3.25,5.25,7.25],[300,400,200,600,700],
label="Carpenter",color='b',width=0.5)
pyplot.bar([0.75,1.75,2.75,3.75,4.75],[50,30,20,50,60],
label="Plumber", color='g',width=.5)
pyplot.legend()
pyplot.xlabel('Days')
pyplot.ylabel('Wage')
pyplot.title('Details')
pyplot.show()
Scatter Plot: We can implement the scatter (previously called XY) plots while comparing various
data variables to determine the connection between dependent and independent variables.
The data gets expressed as a collection of points clustered together meaningfully. Here each
value has one variable (x) determining the relationship with the other (Y).
Example:
x1 = [1, 2.5,3,4.5,5,6.5,7]
y1 = [1,2, 3, 2, 1, 3, 4]
11
SEM 5 | PDS IMPS
pyplot.xlabel('x')
pyplot.ylabel('y')
pyplot.legend()
pyplot.show()
Example:
pyplot.pie(slice,
labels =activities,
colors = cols,
startangle = 90,
shadow = True,
explode =(0,0.1,0,0,0),
autopct ='%1.1f%%')
pyplot.title('Training Subjects')
pyplot.show()
12
SEM 5 | PDS IMPS
Area Plots: The area plots spread across certain areas with bumps and drops (highs and lows)
and are also known as stack plots. They look identical to the line plots and help track the
changes over time for two or multiple related groups to make it one whole category. In
Matplotlib, the stackplot() function represents it.
Example:
days = [1,2,3,4,5]
pyplot.ylabel('Days')
pyplot.legend()
pyplot.show()
Histogram plot: We can use a histogram plot when the data remains distributed, whereas we
can use a bar graph to compare two entities. Both histogram and bar plot look alike but are
used in different scenarios. In Matplotlib, the hist() function represents this.
Example:
pop = [22,55,62,45,21,22,34,42,42,4,2,8]
bins = [1,10,20,30,40,50]
pyplot.xlabel('age groups')
13
SEM 5 | PDS IMPS
pyplot.ylabel('Number of people')
pyplot.title('Histogram')
pyplot.show()
NetworkX is a Python language software package for the creation, manipulation, and study of
the structure, dynamics, and function of complex networks. It is used to study large complex
networks represented in form of graphs with nodes and edges. Using networkx we can load
and store complex networks. We can generate many types of random and classic networks,
analyze network structure, build network models, design new network algorithms and draw
networks.
Example:
G1 = nxx.Graph()
G1.add_edge(1, 2)
G1.add_edge(3, 2)
G1.add_edge(1, 4)
G1.add_edge(4, 2)
pos=nxx.circular_layout(G1)
plt1.show()
14
SEM 5 | PDS IMPS
Both methods join() and merge() are used to perform joining pandas DataFrames on columns
meaning, it combines all columns from two or more DataFrames into a single DataFrame. The
main difference between join vs merge would be; join() is used to combine two DataFrames
on the index but not on columns whereas merge() is primarily used to specify the columns you
wanted to join on, this also supports joining on indexes and combination of index and columns.
import pandas as pd
Join: The join method takes two dataframes and joins them on their indexes (technically, you
can pick the column to join on for the left dataframe). If there are overlapping columns, the
join will want you to add a suffix to the overlapping column name from the left dataframe. Our
two dataframes do have an overlapping column name P.
print(joined_df)
merge
15
SEM 5 | PDS IMPS
At a basic level, merge more or less does the same thing as join. Both methods are used to
combine two dataframes together, but merge is more versatile, it requires specifying the
columns as a merge key. We can specify the overlapping columns with parameter on, or can
separately specify it with left_on and right_on parameters.
print(merged_df)
join() method is used to perform join on row indices and doesn’t support joining on
columns unless setting column as index.
merge() method is used to perform join on indices, columns and combination of these
two.
Both these methods support inner, left, right, outer join types. merge additionally
supports the cross join.
Bag of words is a Natural Language Processing technique (Using Natural Language Processing,
we make use of the text data available across the internet to generate insights for the business.) of
text modelling. In technical terms, we can say that it is a method of feature extraction with
text data. This approach is a simple and flexible way of extracting features from documents.
A bag of words is a representation of text that describes the occurrence of words within a
document. We just keep track of word counts and disregard the grammatical details and the
word order. It is called a “bag” of words because any information about the order or
16
SEM 5 | PDS IMPS
structure of words in the document is discarded. The model is only concerned with whether
known words occur in the document, not where in the document.
Example:
Step 1: Convert the above sentences in lower case as the case of the word does not hold
any information.
Step 2: Remove special characters and stopwords from the text. Stopwords are the words
that do not contain much information about text like ‘is’, ‘a’,’the and many more’.
Although the above sentences do not make much sense the maximum information is
contained in these words only.
Step 3: Go through all the words in the above text and make a list of all of the words in our
model vocabulary.
welcome
great
learning
now
start
good
practice
Now as the vocabulary has only 7 words, we can use a fixed-length document-
representation of 7, with one position in the vector to score each word.
The scoring method we use here is the same as used in the previous example. For sentence 1,
the count of words is as follow:
Word Frequency
welcome 1
17
SEM 5 | PDS IMPS
great 1
learning 2
now 1
start 1
good 0
practice 0
Sentence 1 ➝ [ 1,1,2,1,1,0,0 ]
Word Frequency
welcome 0
great 0
learning 1
now 0
start 0
good 1
practice 1
Sentence 2 ➝ [ 0,0,1,0,0,1,1 ]
Pandas treat None and NaN as essentially interchangeable for indicating missing or null values.
To facilitate this convention, there are several useful functions for detecting, removing, and
replacing null values in Pandas DataFrame :
18
SEM 5 | PDS IMPS
isnull() : In order to check null values in Pandas DataFrame, we use isnull() function this
function return dataframe of Boolean values which are True for NaN values.
notnull():In order to check null values in Pandas Dataframe, we use notnull() function
this function return dataframe of Boolean values which are False for NaN values.
dropna():In order to drop a null values from a dataframe, we used dropna() function
this function drop Rows/Columns of datasets with Null values in different ways.
fillna()
replace()
interpolate()
In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function
these function replace NaN values with some value of their own. All these function help in
filling a null values in datasets of a DataFrame. Interpolate() function is basically used to fill NA
values in the dataframe but it uses various interpolation technique to fill the missing values
rather than hard-coding the value.
The groupby() function is used to group DataFrame or Series using a mapper or by a Series of
columns.A groupby operation involves some combination of splitting the object, applying a
function, and combining the results. This can be used to group large amounts of data and
compute operations on these groups.
19
SEM 5 | PDS IMPS
20
SEM 5 | PDS IMPS
import pandas as pd
data = {
"calories": [420, 380, 390],
"duration": [50, 40, 45]
}
print(df)
Output:
calories duration
0 420 50
1 380 40
2 390 45
1) np.random.rand
np.random.rand returns a random numpy array or scalar whose element(s) are drawn
randomly from the normal distribution over [0,1). (including 0 but excluding 1)
Syntax
np.random.rand(d0,d1,d2,.. dn)
d0,d1,d2,.. dn (optional) – It represents the dimension of the required array given as int. It is
optional, if not specified, it will return a single python float.
In [11]:
np.random.rand(3)
Out[11]:
21
SEM 5 | PDS IMPS
In [19]:
np.random.rand(5,3)
Out[19]:
In [16]:
np.random.rand(3,2,4)
Out[16]:
In [17]:
np.random.rand()
Out[17]:
0.5747916494126569
22
SEM 5 | PDS IMPS
2) np.random.randn
np.random.randn returns a random numpy array or scalar of sample(s), drawn randomly from
the standard normal distribution.
Syntax
np.random.randn(d0,d1,d2,.. dn)
d0,d1,d2,.. dn (optional) – It represents the dimension of the required array given as int. It is
optional, if not specified, it will return a single python float.
In [32]:
np.random.randn(6)
Out[32]:
1.13159155])
In [34]:
np.random.randn(6,4)
Out[34]:
In [35]:
np.random.randn(3,4,2)
23
SEM 5 | PDS IMPS
Out[35]:
array([[[-0.13509054, 1.31253658],
[ 0.79514661, -0.15733937],
[-0.42428779, 0.07816613],
[[-0.49349802, -0.1929593 ],
[-0.51593638, -1.08389571],
[-0.72854643, -0.44708392],
[-0.01845007, 2.02125787]],
[[-1.8826071 , 1.65592025],
[-2.18326764, -0.07711314],
[-2.9275772 , 2.3173623 ],
[ 0.94757097, -0.13646251]]])
In [36]:
np.random.randn()
Out[36]:
-0.36506602839929475
24
SEM 5 | PDS IMPS
Data Compatibility Works with numerical data Works with tabular data
Access Methods Using only index position Using index position or index
labels
Indexing Indexing in NumPy arrays is very fast Indexing in Pandas series is very
slow
Operations Does not have any additional Provides special utilities such as
functions “groupby” to access and
manipulate subsets
External Data Generally used data created by the Pandas object created by
user or built-in function external data such as CSV,
Excel, or SQL
Usage in ML and AI Toolkits can like TensorFlow and scikit Pandas series cannot be
can only be fed using NumPy arrays directly fed as input toolkits
Core Language NumPy was written in C programming Pandas use R language for
initially reference language
25
SEM 5 | PDS IMPS
def factorial(n):
if (n==1 or n==0):
return 1
else:
return n* factorial(n-1)
num=int(input("Enter a number: "))
print("Factorial of ",num, "is",factorial(num))
Output:
Enter a number: 3
Factorial of 3 is 6
Q. Write a program to interchange the List elements on two positions entered by a user.
list=[1,6,4,8]
def sum(list, a, b):
list[a],list[b] = list[b],list[a]
return list
a = int(input('Enter 1st Index : '))
b = int(input('Enter 2nd Index: '))
print(sum(list,a,b))
Output:
Enter 1st Index:
0
Enter 2nd Index:
2
[4, 6, 1, 8]
26
SEM 5 | PDS IMPS
Q. Write a program which takes 2 digits, X,Y as input and generates a 2- dimensional array of
size X * Y. The element value in the i-th row and j-th column of the array should be i*j.
print(multi_list)
Output:
Input number of rows: 3
Input number of columns: 4
[[0, 0, 0, 0], [0, 1, 2, 3], [0, 2, 4, 6]]
def pypart(n):
if n==0:
return
else:
pypart(n-1)
print("* "*n)
n=5
pypart(n)
Output:
*
**
***
****
*****
27
SEM 5 | PDS IMPS
num = 11
# If given number is greater than 1
if num > 1:
# Iterate from 2 to n / 2
for i in range(2, int(num/2)+1):
# If num is divisible by any number between
# 2 and n / 2, it is not prime
if (num % i) == 0:
print(num, "is not a prime number")
break
else:
print(num, "is a prime number")
else:
print(num, "is not a prime number")
Output:
11 is a prime number
# Import libraries
from matplotlib import pyplot as plt
import numpy as np
# Creating dataset
cars = ['AUDI', 'BMW', 'FORD',
'TESLA', 'JAGUAR', 'MERCEDES']
data = [23, 17, 35, 29, 12, 41]
# Creating plot
fig = plt.figure(figsize =(10, 7))
plt.pie(data, labels = cars)
# show plot
plt.show()
28
SEM 5 | PDS IMPS
Output:
Output:
29
SEM 5 | PDS IMPS
Q. Write a program using Numpy to count number of “p” element wise in a given array.
import numpy as np
x1 = np.array(['Python', 'PHP', 'JS', 'examples', 'html'], dtype=np.str)
print("\nOriginal Array:")
print(x1)
print("Number of ‘P’:")
r = np.char.count(x1, "P")
print(r)
Output:
Original Array:
['Python' 'PHP' 'JS' 'examples' 'html']
Number of ‘P’:
[1 2 0 0 0]
# Creating a file
file1 = open("myfile.txt", "w")
L = ["This is Delhi \n", "This is Paris \n", "This is London \n"]
30
SEM 5 | PDS IMPS
print(file1.read())
print()
file1.seek(0)
file1.seek(0)
file1.seek(0)
# readlines function
print("Output of Readlines function is ")
print(file1.readlines())
print()
file1.close()
31
SEM 5 | PDS IMPS
Output:
Output of Read function is
Hello
This is Delhi
This is Paris
This is London
32
SEM 5 | PDS IMPS
# Check if n is 0
# then it will return 0
elif n == 0:
return 0
# Check if n is 1,2
# it will return 1
elif n == 1 or n == 2:
return 1
else:
return Fibonacci(n-1) + Fibonacci(n-2)
# Driver Program
x=int(input(“Enter a number:”))
print(Fibonacci(x))
Output:
Enter a number:9
34
1. Tail() Examples
In [1]:
import numpy as np
import pandas as pd
In [2]:
df = pd.DataFrame({'animal':['snake', 'bat', 'tiger', 'lion',
'fox', 'eagle', 'shark', 'dog', 'deer']})
df
33
SEM 5 | PDS IMPS
Out[2]:
animal
0 snake
1 bat
2 tiger
3 lion
4 fox
5 eagle
6 shark
7 dog
8 deer
In [3]:
df.tail()
Out[3]:
animal
4 fox
5 eagle
6 shark
7 dog
8 deer
In [4]:
df.tail(4)
Out[4]:
animal
5 eagle
6 shark
7 dog
8 deer
2. Shape( ) Example
34
SEM 5 | PDS IMPS
print(arr2.shape)
Output:
(2, 4)
(2, 2,2)
3. Describe() Example
import pandas as pd
import numpy as np
a1 = pd.Series([1, 2, 3])
a1.describe()
Output:
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
dtype: float64
35
SEM 5 | PDS IMPS
Covariance brings about the variation across variables. We use covariance to measure how
much two variables change with each other. Correlation reveals the relation between the
variables. We use correlation to determine how strongly linked two variables are to each other.
import pandas as pd
import numpy as np
df.var() #Variance
df.cov() #Covariance
df.corr() #Correlation
Data Wrangling is the process of gathering, collecting, and transforming Raw data into another
format for better understanding, decision-making, accessing, and analysis in less time. Data
Wrangling is also known as Data Munging.
Data Wrangling is a crucial topic for Data Science and Data Analysis. Pandas Framework of
Python is used for Data Wrangling. Pandas is an open-source library specifically developed for
Data Analysis and Data Science. The process like data sorting or filtration, Data grouping, etc.
1. Data exploration: In this process, the data is studied, analyzed and understood by
visualizing representations of data.
2. Dealing with missing values: Most of the datasets having a vast amount of data contain
missing values of NaN, they are needed to be taken care of by replacing them with
mean, mode, the most frequent value of the column or simply by dropping the row
having a NaN value.
3. Reshaping data: In this process, data is manipulated according to the requirements,
where new data can be added or pre-existing data can be modified.
4. Filtering data: Some times datasets are comprised of unwanted rows or columns which
are required to be removed or filtered
36
SEM 5 | PDS IMPS
5. Other: After dealing with the raw dataset with the above functionalities we get an
efficient dataset as per our requirements and then it can be used for a required purpose
like data analyzing, machine learning, data visualization, model training etc.
Exploratory Data Analysis (EDA) is an approach to analyze the data using visual techniques. It
is used to discover trends, patterns, or to check assumptions with the help of statistical
summary and graphical representations.
Steps in EDA:
37
SEM 5 | PDS IMPS
ways to detect the outliers, and the removal process is the data frame same as removing
a data item from the panda’s dataframe.
For removing the outlier, one must follow the same process of removing an entry from
the dataset using its exact position in the dataset because in all the above methods of
detecting the outliers end result is the list of all those data items that satisfy the outlier
definition according to the method used.
Q. What is Scikit-learn?
Scikit-learn is a key library for the Python programming language that is typically used in
machine learning projects. Scikit-learn is focused on machine learning tools including
mathematical, statistical and general purpose algorithms that form the basis for many machine
learning technologies. As a free tool, Scikit-learn is tremendously important in many different
types of algorithm development for machine learning and related technologies.
Installation:
38
SEM 5 | PDS IMPS
The bar() function is used to create a bar plot that is bounded with a rectangle depending
on the given parameters of the function.
import numpy as np
import matplotlib.pyplot as plt
plt.xlabel("Books offered")
plt.ylabel("No. of books provided")
plt.title("Books provided by the institute")
plt.show()
Output:
Here plt.bar(courses, values, color='magenta') is basically specifying the bar chart that is to be
plotted using "Books offered"(by the college) column as the X-axis, and the "No. of books" as
the Y-axis.
The color attribute is basically used to set the color of the bars(magenta ).
39
SEM 5 | PDS IMPS
Q. What do you understand by Data visualization? Discuss some Python’s data visualization
techniques.
Data visualization provides a good, organized pictorial representation of the data which
makes it easier to understand, observe, analyze. In this tutorial, we will discuss how to visualize
data using Python.
Python offers several plotting libraries, namely Matplotlib, Seaborn and many other such data
visualization packages with different features for creating informative, customized, and
appealing plots to present data in the most simple and effective way.
Matplotlib and Seaborn are python libraries that are used for data visualization. They have
inbuilt modules for plotting different graphs. While Matplotlib is used to embed graphs into
applications, Seaborn is primarily used for statistical graphs.
Data Science is an interdisciplinary field that focuses on extracting knowledge from data sets
that are typically huge in amount. The field encompasses analysis, preparing data for analysis,
and presenting findings to inform high-level decisions in an organization.
A pipeline in data science is “a set of actions which changes the raw (and confusing) data
from various sources (surveys, feedbacks, list of purchases, votes, etc.), to an understandable
format so that we can store it and use it for analysis.”
The raw data undergoes different stages within a pipeline which are:
1) Fetching/Obtaining the Data: This stage involves the identification of data from the internet
or internal/external databases and extracts into useful formats.
40
SEM 5 | PDS IMPS
2) Scrubbing/Cleaning the Data: This is the most time-consuming stage and requires more
effort. It is further divided into two stages:
Examining Data:
a) identifying errors
b) identifying missing values
c) identifying corrupt records
Cleaning of data:
a) replace or fill missing values/errors
3) Exploratory Data Analysis: When data reaches this stage of the pipeline, it is
free from errors and missing values, and hence is suitable for finding patterns
using visualizations and charts.
4) Modeling the Data: This is that stage of the data science pipeline where machine learning
comes to play. With the help of machine learning, we create data models. Data models are
nothing but general rules in a statistical sense, which is used as a predictive tool to enhance
our business decision-making.
5) Interpreting the Data: Similar to paraphrasing your data science model. Always remember,
if you can’t explain it to a six-year-old, you don’t understand it yourself. So, communication
becomes the key!! This is the most crucial stage of the pipeline, wherewith the use of
psychological techniques, correct business domain knowledge, and your immense
storytelling abilities, you can explain your model to the non-technical audience.
6) Revision: As the nature of the business changes, there is the introduction of new features
that may degrade your existing models. Therefore, periodic reviews and updates are very
important from both business’s and data scientist’s point of view.
41
SEM 5 | PDS IMPS
Beautiful Soup is a library that is used to scrape the data from web pages. It is used to parse
HTML and XML content in Python.
First of all import the requests module and the BeautyfulSoup module from bs4 as shown
below.
import requests
from bs4 import BeautifulSoup
# Url of website
url="http://170.187.134.184"
rawdata=requests.get(url)
html=rawdata.content
Now we will use html.parser to parse the content of html and prettify it using BeautifulSoup.
Once the content is parsed using we can use different methods of beautiful soup to get the
relevant data from the website.
print(soup.title)
paragraphs = soup.find_all('p')
print(paragraphs)
import requests
from bs4 import BeautifulSoup
# Url of website
url="http://170.187.134.184"
rawdata=requests.get(url)
html=rawdata.content
# Parsing html content with beautifulsoup
soup = BeautifulSoup(html, 'html.parser')
42
SEM 5 | PDS IMPS
print(soup.title)
paragraphs = soup.find_all('p')
print(paragraphs)
Output:
43