FODS Lab Manual - Organized
FODS Lab Manual - Organized
PAGE
EX.NO DATE LIST OF EXPERIMENTS MARK SIGNATURE
NO
Reading data from text files, Excel and the web and exploring
4 various commands for doing descriptive analytics on the Iris
data set.
Use the diabetes data set from UCI and Pima Indians Diabetes
data set for performing the following:
a. Univariate analysis: Frequency, Mean, Median, Mode,
Variance, Standard Deviation, Skewness and Kurtosis.
5 b. Bivariate analysis: Linear and logistic regression
modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the
two data sets.
Apply and explore various plotting functions on UCI data sets.
a. Normal curves
b. Density and contour plots
6 c. Correlation and scatter plots
d. Histograms
e. Three dimensional plotting
7 Visualizing Geographic Data with Basemap
Read the following file formats
a. Pickle files
8 b. Image files using PIL
c. Multiple files using Glob
d. Importing data from database
Ex No : 1 Download, install and explore the features of NumPy, SciPy, Jupyter,
Date : Statsmodels and Pandas packages.
Aim: To download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and Pandas
packages.
Methods:
Jupyter Notebook is an open-source web application that allows you to create and share
documents that contain live code, equations, visualizations, and narrative text. Uses include data
cleaning and transformation, numerical simulation, statistical modeling, data visualization,
machine learning, and much more.
Jupyter has support for over 40 different programming languages and Python is one of them.
Python is a requirement (Python 3.3 or greater, or Python 2.7) for installing the Jupyter
Notebook itself.
PIP is a package management system used to install and manage software packages/libraries written in
Python. These files are stored in a large “online repository” termed as Python Package Index (PyPI). pip
uses PyPI as the default source for packages and their dependencies. So whenever you type:
pip will look for that package on PyPI and if found, it will download and install the package on your local
system.
python --version
Python 3.10.0
The PIP can be downloaded and installed using the command line by going through the
following steps:
Step 2: Change the current path of the directory in the command line to the path of the directory
where the above file exists.
Step 3: get-pip.py is a bootstrapping script that enables users to install pip in Python
environments. Run the command given below:
python get-pip.py
Step 4: Now wait through the installation process. Voila! pip is now installed on your system.
One can easily verify if the pip has been installed correctly by performing a version check on the
same. Just go to the command line and execute the following command:
2
Adding PIP to Windows Environment Variables
If you are facing any path error then you can follow the following steps to add the pip to your
PATH. You can follow the following steps to set the Path:
Go to System and Security > System in the Control Panel once it has been opened.
On the left side, click the Advanced system settings link.
Then select Environment Variables.
Double-click the PATH variable under System Variables.
Click New, and add the directory where pip is installed, e.g. C:Python33Scripts, and select OK.
It may happen sometimes that your pip current pip version is not supporting your current version
of python or machine for that you can downgrade your pip version with the following command.
Note: You can mention the version you want to install
UsingPIP:
Install Jupyter using the PIP package manager used to install and manage software packages/libraries
written in Python.
PIP is a package management system used to install and manage software packages/libraries
written in Python. These files are stored in a large “on-line repository” termed as Python Package
Index(PyPI).
pip uses PyPI as the default source for packages and their dependencies.
To install Jupyter using pip, we need to first check if pip is updated in our system. Use the
following command to update pip:
After updating the pip version, follow the instructions provided below to install Jupyter:
3
Beginning Installation:
InstallingPackages:
FinishedInstallation:
LaunchingJupyter:
Use the following command to launch Jupyter using command-line:
4
jupyter notebook
Start the installer and select Customize installation. On the next screen leave all the optional features
checked. Finally, on the Advanced Options screen make sure to check Install for all users, Add Python to
environment variables and Precompile standard library. Optionally, you can customize the install
location. I’ve used C:\Python38. You should see something like this:
Press the Install button and in a few minutes, depending on the speed of your computer, you should be
ready. On the last page of the installer, you should also press the Disable path length limit:
5
Now, to check if Python was correctly installed, open a Command Prompt (or a PowerShell) window.
Press and hold the SHIFT key and right click with your mouse somewhere on your desktop, select Open
command window here. Alternatively, on Windows 10, use the bottom left search box to search for cmd.
Write python in the command window and press Enter, you should see something like this:
Exit from the Python interpreter by writing quit() and pressing the Enter key.
Now, open a cmd window like before. Use the next set of commands to install NumPy, SciPy and
Matplotlib:
After each of the above commands you should see Successfully installed ….
Launch Python from a cmd window and check the version of Scipy, you should see something like this:
1 C:\>python
2 Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 22:39:24) [MSC v.1916 32 bit (Intel)] on win32
3 Type "help", "copyright", "credits" or "license" for more information.
4 >>> import scipy as sp
5 >>> sp.version.version
6 '1.4.1'
7 >>>
Let’s try something a bit more interesting now, let’s plot a simple function with Matplotlib. First, we’ll
6
import SciPy and Matplotlib with:
1 import scipy as sp
2 import matplotlib.pylab as plt
1 t = sp.linspace(0, 1, 100)
1 plt.plot(t, t**2)
2 plt.show()
Pandas in Python is a package that is written for data analysis and manipulation. Pandas offer
various operations and data structures to perform numerical data manipulations and time series.
Pandas is an open-source library that is built over Numpy libraries. Pandas library is known for
its high productivity and high performance. Pandas is popular because it makes importing and
analyzing data much easier.
Pandas programs can be written on any plain text editor like notepad, notepad++, or anything of
that sort and saved with a .py extension. To begin with, writing Pandas Codes and performing
various intriguing and useful operations, one must have Python installed on their System. This
can be done by following the step by step instructions provided below:
To check if your device is pre-installed with Python or not, just go to the Command line(search
for cmd in the Run dialog( + R).
Now run the following command:
python --version
7
If Python is already installed, it will generate a message with the Python version available.
Using pip
Using Anaconda
PIP is a package management system used to install and manage software packages/libraries
written in Python. These files are stored in a large “on-line repository” termed as Python Package
Index (PyPI).
Pandas can be installed using PIP by the use of the following command:
Installing statsmodels
8
The easiest way to install statsmodels is to install it as part of the Anaconda distribution, a cross-
platform distribution for data analysis and scientific computing. This is the recommended
installation method for most users.
Instructions for installing from PyPI, source or a development version are also provided.
Python Support
statsmodels supports Python 3.7, 3.8, and 3.9.
Anaconda
statsmodels is available through conda provided by Anaconda. The latest release can be installed
using: conda install -c conda-forge statsmodels
PyPI (pip)
To obtain the latest released version of statsmodels using pip: pip install statsmodels
You will need a C compiler installed to build statsmodels. If you are building from the github
source and not a source release, then you will also need Cython. You can follow the instructions
below to get a C compiler setup for Windows.
If your system is already set up with pip, a compiler, and git, you can try:
If you do not have pip installed or want to do the installation more manually, you can also type:
statsmodels can also be installed in develop mode which installs statsmodels into the current
python environment in-place. The advantage of this is that edited modules will immediately be
re-interpreted when the python interpreter restarts without having to re-install statsmodels.
If you aspire for a flourishing career in the field of machine learning, let us introduce you to one
more interesting package whose functionalities would leave you awestruck.
So, let's see what is statsmodels and what are its features?
9
Statsmodels is a popular library in Python that enables us to estimate and analyze various
statistical models. It is built on numeric and scientific libraries like NumPy and SciPy.
1. It includes various models of linear regression like ordinary least squares, generalized least
squares, weighted least squares, etc.
2. It provides some efficient functions for time series analysis.
3. It also has some datasets for examples and testing.
4. Models based on survival analysis are also available.
5. All the statistical tests that we can imagine for data on a large scale are present.
Installing statsmodels
1. Checking the version of Python installed in our PCs, we have discussed this already in the
previous articles but let's talk about this again-
There are two ways to check the version of Python in Windows-
Using Powershell
Using Command Prompt
Using PowerShell
Follow the below steps to check the version of Python using PowerShell.
Type 'Command Prompt' on the taskbar's search pane and you'll see its icon. Click on it to open
the command prompt.
Also, you can directly click on its icon if it is pinned on the taskbar.
10
Checking the Version of Python in Linux
In Linux, we have a shell where we type our commands that are interpreted and tells the
operating system about what the user wants.
Start your system and switch on to the Linux operating system (you might find it with the name
Ubuntu).
Once the desktop screen of Linux appears, click on 'Terminal' to open it.
In the terminal window, type python -version and click on 'Enter'.
In the next line, it will display the current version of python installed in your system.
Installation of statsmodels
Now let us discuss the steps for installing statsmodels in our system. We will look at two
methods of installation
In the first method, we will open the Anaconda Prompt and type the following command-
In the second method, we will open the Command Prompt, type the following command and
click on 'Enter'.
Here, we will perform OLS(Ordinary Least Squares) regression, in this technique we will try to
minimize the net sum of squares of difference between the calculated value and observed value.
Example -
11
1. import pandas as pd
2. import statsmodels.api as sm
3. df=pd.read_csv("/content/SampleSuperstore.csv")
4. df.head()
5. x=df['Sales']
6. y=df['Profit']
7. model=sm.OLS(y,x).fit()
8. modelmodel_summary=model.summary()
9. print(model_summary)
Output-
Result: Thus the Jupyter notebook, and the Python libraries Pandas, Numpy, Scipy, Statsmodels has been
successfully downloaded and installed.
12
Ex No : 2 (a)
Numpy Aggregate functions
Date :
Aim: To write python code using numpy to create an array and apply the different aggregate functions. i)
sum of elements ii)Max iii) Min iv) standard deviation v) variance vi) index of minimum and maximum
value.
Algorithm:
Step 1: Create a 3x3 array of random integers in the interval [0, 10)
Step 2: calculate sum using np.sum()
Step 3: calculate max using np.max()
step 4: calculate min using np.min()
Step 5: calculate standard deviation using np.std()
Step 6: calculate variance using np.var()
Step 7: calculate index of minimum and maximum value np.argmin() and np.argmax()
Program:
import numpy as np
a=np.random.randint(0, 10, (4, 3))
print(a)
print(np.sum(a))
print(np.min(a))
print(np.max(a))
print(np.std(a))
print(np.var(a))
print(np.argmin(a))
print(np.argmax(a))
Output:
3 x 3 array - a
[[5 9 4]
[4 0 7]
[0 6 9]
[9 1 3]]
Sum of a: 57
Minimum of a: 0
Maximum of a: 9
Standard Deviation of a: 3.2177890960513036
Variance of a: 10.354166666666666
Index of the minimum of a: 4
Index of the maximum of a: 1
Result: The python program using Numpy to find aggregates of a given value has been executed
successfully.
13
Ex No : 2 (b)
Numpy Attributes and Indexing
Date :
Aim: To write a python program to work with numpy attributes and indexing.
Algorithm:
Step 1: import the numpy package
Step 2: create a random array of one dimensional, two dimensional and three dimensional arrays
Step 3: use different numpy attributes like ndim, shape, size and dtype.
Step 4: use positive and negative indexing to find how the elements are accessed
Program:
import numpy as np
np.random.seed(0)
x1 = np.random.randint(10, size=6) # One-dimensional array
x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5))
print("x1 ndim: ", x1.ndim)
print("x1 shape:", x1.shape)
print("x1 size: ", x1.size)
print(x1)
print(x1.dtype)
print("x2 ndim: ", x2.ndim)
print("x2 shape:", x2.shape)
print("x2 size: ", x2.size)
print(x2)
print(x2.dtype)
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
print(x3)
print("data type of x3", x3.dtype)
#numpy indexing
print(x1)
print("The index'0' of x1:", x1[0])
print("The reverse index of x1:", x1[-1])
print("Array x2:", x2)
print("The first index in a 2 dimensional array:", x2[0,0])
print(x2[2,-1])
x2[0,0]=12
print("The new value of x2[0,0]:", x2)
14
Output:
x1 ndim: 1
x1 shape: (6,)
x1 size: 6
[5 0 3 3 7 9]
int32
x2 ndim: 2
x2 shape: (3, 4)
x2 size: 12
[[3 5 2 4]
[7 6 8 8]
[1 6 7 7]]
int32
x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size: 60
[[[8 1 5 9 8]
[9 4 3 0 3]
[5 0 2 3 8]
[1 3 3 3 7]]
[[0 1 9 9 0]
[4 7 3 2 7]
[2 0 0 4 5]
[5 6 8 4 1]]
[[4 9 8 1 1]
[7 9 9 3 6]
[7 2 0 3 5]
[9 4 4 6 4]]]
data type of x3 int32
[5 0 3 3 7 9]
The index'0' of x1: 5
The reverse index of x1: 9
Array x2:
[[3 5 2 4]
[7 6 8 8]
[1 6 7 7]]
The first index in a 2 dimensional array: 3
7
The new value of x2[0,0]:
[[12 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]
Result: The python program using Numpy has been written and executed successfully to demonstrate its
attributes and indexing.
15
Ex No : 2 (c)
Working with Concatenation, Slicing of Numpy arrays
Date :
Aim: To work and practice with Numpy arrays concatenation and slicing.
Algorithm:
Output
Concatenating x, y, z: [ 1 2 3 3 2 1 99 99 99]
[[1 2 3]
[4 5 6]
[1 2 3]
[4 5 6]]
[[1 2 3 1 2 3]
[4 5 6 4 5 6]]
import numpy as np
np.random.seed(0)
x = np.arange(10)
print("The array X:", x)
#numpy slicing
print("The First five elements:",x[:5])
print("The Last five elements:",x[5:])
print("The elements in between:",x[4:7])
print("The even elements:", x[::2])
print("The odd elements:", x[1::2])
print("The elements in the reverse order:", x[::-1])
16
print(x[5::-2])
x2 = np.random.randint(10, size=(3, 4))
print("The resized array x2:", x2)
print(x2[:2,:3])
print(x2[:,0])
Output
The array X: [0 1 2 3 4 5 6 7 8 9]
The First five elements: [0 1 2 3 4]
The Last five elements: [5 6 7 8 9]
The elements in between: [4 5 6]
The even elements: [0 2 4 6 8]
The odd elements: [1 3 5 7 9]
The elements in the reverse order: [9 8 7 6 5 4 3 2 1 0]
[5 3 1]
The resized array x2: [[5 0 3 3]
[7 9 3 5]
[2 4 7 6]]
[[5 0 3]
[7 9 3]]
[5 7 2]
Result: Thus numpy program to demonstrate concatenation and slicing techniques in numpy arrays has
been written and executed successfully.
17
Ex No : 2 (d)
Working with Reshaping and Splitting of Numpy arrays
Date :
Aim: To work and practice with Numpy arrays splitting and reshaping.
Algorithm:
import numpy as np
grid = np.arange(1, 10)
print("The array'grid':", grid)
grid1 = np.arange(1, 10).reshape((3, 3))
print("The reshaped array 'grid1':",grid1)
x = np.array([1, 2, 3])
print(x)
print(x.reshape((1, 3)))
print(x[np.newaxis, :])
Output
The array'grid': [1 2 3 4 5 6 7 8 9]
The reshaped array 'grid1': [[1 2 3]
[4 5 6]
[7 8 9]]
[1 2 3]
[[1 2 3]]
[[1 2 3]
import numpy as np
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print("Splitting the array x into x1, x2, x3:", x1, x2, x3)
Output:
18
Program: To demonstrate numpy splitting – vsplit()
import numpy as np
grid = np.arange(16).reshape((4, 4))
print(grid)
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)
Output
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
[[0 1 2 3]
[4 5 6 7]]
[[ 8 9 10 11]
[12 13 14 15]]
import numpy as np
grid = np.arange(16).reshape((4, 4))
print(grid)
left, right = np.hsplit(grid, [2])
print(left)
print(right)
Output
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
[[ 0 1]
[ 4 5]
[ 8 9]
[12 13]]
[[ 2 3]
[ 6 7]
[10 11]
[14 15]]
19
Output
The array 'a': [[[ 0. 1. 2.]
[ 3. 4. 5.]]
[[ 6. 7. 8.]
[ 9. 10. 11.]]]
The dsplit array 'D1': [array([[[0.],
[3.]],
[[6.],
[9.]]]), array([[[ 1.],
[ 4.]],
[[ 7.],
[10.]]]), array([[[ 2.],
[ 5.]],
[[ 8.],
[11.]]])]
The dsplit array 'D2': [array([[[ 0., 1.],
[ 3., 4.]],
[[ 6., 7.],
[ 9., 10.]]]), array([[[ 2.],
[ 5.]],
[[ 8.],
[11.]]]), array([], shape=(2, 2, 0), dtype=float64)]
Result: The python program using Numpy has been written and executed successfully to demonstrate its
Concatenation, Slicing, Reshaping and Splitting.
20
Ex No : 3 (a)
Working with Pandas Series objects.
Date :
Algorithm:
Creating a series from array: In order to create a series from array, we have to import a numpy module
and have to use array() function.
Program
# import pandas as pd
import pandas as pd
# import numpy as np
import numpy as np
# simple array
data = np.array(['p','a','n','d','a', 's'])
ser = pd.Series(data)
print(ser)
Output :
Creating a series from array with an index: In order to create a series by explicitly proving index
instead of the default, we have to provide a list of elements to the index parameter with the same number
of elements as it is an array.
import pandas as pd
# a simple list
21
list = ['p','a','n','d','a','s']
Output
0 p
1 a
2 n
3 d
4 a
5 s
dtype: object
10 p
11 a
12 n
13 d
14 a
15 s
dtype: object
Creating a series from Dictionary: In order to create a series from the dictionary, we have to first create
a dictionary after that we can make a series using dictionary. Dictionary keys are used to construct
indexes of Series.
import pandas as pd
# a simple dictionary
dict = {'Bala': 10,
'Chander': 20,
'Vijay': 30}
# create series from dictionary
ser = pd.Series(dict)
print(ser)
Output
Bala 10
Chander 20
Vijay 30
dtype: int64
Creating a series using NumPy functions : In order to create a series using numpy function, we can use
different function of numpy like numpy.linspace(), numpy.random.radn().
22
ser1 = pd.Series(np.linspace(3, 33, 3))
print(ser1)
Output
0 3.0
1 18.0
2 33.0
dtype: float64
0 1.0
1 12.0
2 23.0
3 34.0
4 45.0
5 56.0
6 67.0
7 78.0
8 89.0
9 100.0
dtype: float64
There are two ways through which we can access element of series, they are :
Accessing Element from Series with Position : In order to access the series element refers to
the index number. Use the index operator [ ] to access an element in a series. The index must be
an integer. In order to access multiple elements from a series, we use Slice operation.
Output
0 I
1 L
23
2 O
3 V
4 E
dtype: object
Output
I
ser = pd.Series(df['Name'])
data = ser.head(10)
data
Output
0 ABIRAMI T
1 BACHU MANEESH
2 BENSIHA A
3 DEVANAND C
4 DHANALAKSHMI S
5 DHANUSH J
6 DHIVYA LAKSHMI B
7 DORAGALU NAVADEEP
8 DRAVID M
9 GNANESWARAN B
Name: Name, dtype: object
24
Indexing a Series using .loc[ ] :
This function selects data by refering the explicit index . The df.loc indexer selects data in a different
way than just the indexing operator. It can select subsets of data.
data.loc[3:6]
Output
3 DEVANAND C
4 DHANALAKSHMI S
5 DHANUSH J
6 DHIVYA LAKSHMI B
Name: Name, dtype: object
data.iloc[3:6]
Output
3 DEVANAND C
4 DHANALAKSHMI S
5 DHANUSH J
Name: Name, dtype: object
We can perform binary operation on series like addition, subtraction and many other operation.
In order to perform binary operation on series we have to use some function like .add(),.sub()
etc..
# creating a series
data = pd.Series([5, 2, 3,7], index=['a', 'b', 'c', 'd'])
# creating a series
data1 = pd.Series([1, 6, 4, 9], index=['a', 'b', 'd', 'e'])
Output
a 5
25
b 2
c 3
d 7
dtype: int64
a 1
b 6
d 4
e 9
dtype: int64
a 6.0
b 8.0
c 3.0
d 11.0
e 9.0
dtype: float64
data.sub(data1, fill_value=0)
a 4.0
b -4.0
c 3.0
d 3.0
e -9.0
dtype: float64
Result: Thus the pandas program to create series has been created and performed various operations
successfully
26
Ex No : 3 (b)
Working with Pandas Data Frame objects.
Date :
Aim: To write Python program to demonstrate Pandas Data Frame objects objects.
Algorithm:
Program
import pandas as pd
import numpy as np
rng = np.random.RandomState(42)
A = rng.randint(10, size=(3, 4))
print("Array A", A)
print(A - A[0])
df = pd.DataFrame(A, columns=list('QRST'))
print("The Data Frame df",df)
#Indexing
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
#Concatenation
print(pd.concat([ser1, ser2]))
Output
Array A [[6 3 7 4]
[6 9 2 6]
[7 4 3 7]]
27
[[ 0 0 0 0]
[ 0 6 -5 2]
[ 1 1 -4 3]]
Q R S T
0 0 0 0 0
1 0 6 -5 2
2 1 1 -4 3
Q R S T
0 3 0 4 1
1 -3 0 -7 -3
2 3 0 -1 3
Q R S T
0 12 9 13 10
1 12 15 8 12
2 14 11 10 14
0 False
1 True
2 False
3 True
dtype: bool
0 1
2 hello
dtype: object
1 A
2 B
3 C
4 D
5 E
6 F
dtype: object
Result: Thus the pandas program to create data frame has been written and also various operations on
data frame has been performed successfully.
28
Ex No : 3 (c)
Working with combining data sets.
Date :
Algorithm:
The data required for a data-analysis task usually comes from multiple sources. And therefore, it is
important to learn the methods to bring this data together. There are three best and most time-saving ways
to combine multiple datasets using Python pandas methods.
They are
merge(): To combine the datasets on common column or index or both.
concat(): To combine the datasets across rows or columns.
join(): To combine the datasets on key column or index.
We will be using dummy course_data dataset which has two sheets “Fees” and “Discounts”
29
Program
import pandas as pd
df1=pd.read_excel(r"C:\Users\New\AppData\Local\Programs\Python\Python39\Dummy_course_data.xls",
sheet_name="Fees")
df2=pd.read_excel(r"C:\Users\New\AppData\Local\Programs\Python\Python39\Dummy_course_data.xls",
sheet_name="Discounts")
print("The fees data set:", df1)
print("The discount data set:",df2)
Output
The fees data set:
Course Country Fee_USD
0 Maths India 15500
1 Physics Germany 16700
2 Applied Maths Germany 11100
3 General Science United Kingdom 18000
4 Social Science Austria 18400
5 History Poland 23000
6 Politics India 21600
7 Computer Graphics United States 27000
The discount data set:
Course Country Discount_USD
0 Maths India 1000
1 Physics Germany 2300
2 German language Germany 1500
3 Information Technology United Kingdom 1200
4 Social Science Austria 1500
5 History Poland 3200
6 Marketing India 2000
7 Computer Graphics United States 2500
pd.merge() automatically detects the common column between two datasets and combines them on this
column.
Output
30
Program to demonstrate ‘outer’ option in merge().
This ‘outer’ join is similar to the one done in SQL. It returns matching rows from both datasets plus
non matching rows. Some cells are filled with NaN as these columns do not have matching records in
either of the two datasets.
df4 = pd.merge(df1, df2, how='outer')
df4
Output
As the second dataset df2 has 3 rows different than df1 for columns Course and Country, the final output
after merge contains 10 rows. i.e. 7 rows from df1 + 3 additional rows from df2.
31
Output
You can get same results by using how = ‘left’ also. All you need to do is just change the
order of DataFrames mentioned in pd.merge() from df1, df2 to df2, df1 .
The order of the columns in the final output will change based on the order in which you mention
DataFrames in pd.merge()
Output
32
Join columns of another DataFrame.
Join columns with other DataFrame either on index or on a key column. Efficiently join multiple
DataFrame objects by index at once by passing a list.
Parameters
otherDataFrame, Series, or a list containing any combination of them
Index should be similar to one of the columns in this one. If a Series is passed, its name
attribute must be set, and that will be used as the column name in the resulting joined
DataFrame.
Column or index level name(s) in the caller to join on the index in other, otherwise joins
index-on-index. If multiple values given, the other DataFrame must have a MultiIndex.
Can pass an array as the join key if it is not already contained in the calling DataFrame.
Like an Excel VLOOKUP operation.
lsuffixstr, default ‘’
rsuffixstr, default ‘’
Order result DataFrame lexicographically by the join key. If False, the order of the join
key depends on the join type (how keyword).
validatestr, optional
33
If specified, checks if join is of specified type. * “one_to_one” or “1:1”: check if join
keys are unique in both left and right datasets. * “one_to_many” or “1:m”: check if join
keys are unique in left dataset. * “many_to_one” or “m:1”: check if join keys are unique
in right dataset. * “many_to_many” or “m:m”: allowed, but does not result in checks. ..
versionadded:: 1.5.0
Returns
DataFrame
Output
As per definition join() combines two DataFrames on either on index (by default) and that’s why the
output contains all the rows & columns from both DataFrames.
If you want to join both DataFrames using the common column — Country, you need to set Country to
be the index in both df1 and df2. It can be done like below.
For the sake of simplicity, I am copying df1 and df2 into df11 and df22 respectively.
df11 = df1.copy()
df11.set_index('Course', inplace=True)
print(df11)
df22 = df2.copy()
df22.set_index('Course', inplace=True)
print(df22)
The above block of code will make column Course as index in both datasets.
Output
Country Fee_USD
Course
Maths India 15500
Physics Germany 16700
Applied Maths Germany 11100
General Science United Kingdom 18000
34
Social Science Austria 18400
History Poland 23000
Politics India 21600
Computer Graphics United States 27000
Country Discount_USD
Course
Maths India 1000
Physics Germany 2300
German language Germany 1500
Information Technology United Kingdom 1200
Social Science Austria 1500
History Poland 3200
Marketing India 2000
Computer Graphics United States 2500
Output
concat()
Concat function concatenates datasets along rows or columns. So it simply stacks multiple DataFrames
together one over other or side by side when aligned on index.
Syntax: concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)
Parameters:
35
Returns: type of objs (Series of DataFrame)
Both datasets can be stacked side by side as well by making the axis = 1, as shown below.
pd.concat([df1,df2], axis=1,keys = ["df1_data","df2_data"])
Output
Result: Thus the pandas program has been written to demonstrate combining data sets using merge(),
concat()’join() functions and executed successfully.
36
Ex No : 3 (d)
Working with pivot table
Date :
Aim: To write a python program using pandas which demonstrates pivot table.
Algorithm:
Step 1: import the pandas package
Step 2: Read tested.csv data set by using read_csv()
Step 3: Create a data frame df1 to laod data from tested.csv
Step 3: Display the data in df1.
Step 4: Construct a table that shows the proportion of people who survived in each passenger class.(There
are three classes. They are 1, 2, and 3. Replacing them with strings will look better as an index.)
Step 5: Generate a tabular sheet to show the proportion of people who survived in each class.
segregated by gender.
Step 6: Calculate the average fare in each class for both males and females.
Syntax:
Levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index
and columns of the result DataFrame.
Parameters:
data : DataFrame
values : column to aggregate, optional
index: column, Grouper, array, or list of the previous
columns: column, Grouper, array, or list of the previous
Returns: DataFrame
37
Program to demonstrate pandas.pivot_table()
To load and display first five records from tested.csv data set.
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\Users\New\AppData\Local\Programs\Python\Python39\tested.csv")
df.head()
Output
Output
38
Output
Generate a tabular sheet to show the proportion of people who survived in each class segregated by
gender.
Output
Calculate the average fare in each class for both males and females.
df.pivot_table("Fare", index=['Pclass', "Sex"], aggfunc=np.mean)
39
Output
Result
Thus the python program using pandas which demonstrates pivot table has been written and executed
successfully.
40
Ex No : 4
Working with pivot table.
Date :
Aim: To write a Pandas program to read data from text files, Excel Files and web pages.And also explore
various commands for doing descriptive analytics on the Iris data set.
Algorithm:
We can read a text file (txt) by using the pandas read_fwf() function, fwf stands for fixed-width
lines, We can use this to read fixed length or variable length text files.
# Syntax of read_fwf()
Parameters
A list of tuples giving the extents of the fixed-width fields of each line as half-open
intervals (i.e., [from, to[ ). String value ‘infer’ can be used to instruct the parser to try
detecting the column specifications from the first 100 rows of the data which are not
being skipped via skiprows (default=’infer’).
A list of field widths which can be used instead of ‘colspecs’ if the intervals are
contiguous.
41
infer_nrows int, default 100
The number of rows to consider when letting the parser determine the colspecs.
**kwds optional
Returns
DataFrame or TextFileReader
Method 2
We can read a text file (txt) by using the pandas read_csv() function.
Parameters:
filepath_or_buffer: It is the location of the file which is to be retrieved using this function. It
accepts any string path or URL of the file.
sep: It stands for separator, default is ‘, ‘ as in CSV(comma separated values).
header: It accepts int, a list of int, row numbers to use as the column names, and the start of the
data. If no names are passed, i.e., header=None, then, it will display the first column as 0, the
second as 1, and so on.
usecols: It is used to retrieve only selected columns from the CSV file.
nrows: It means a number of rows to be displayed from the dataset.
index_col: If None, there are no index numbers displayed along with records.
skiprows: Skips passed rows in the new data frame.
Method 1
We can read a text file (txt) by using the pandas read_fwf() function.
42
Sample Text file
birds.txt
Output
STRAY BIRDS
0 BY
1 RABINDRANATH TAGORE
2 STRAY birds of summer come to my
3 window to sing and fly away.
4 And yellow leaves of autumn, which
5 have no songs, flutter and fall there
6 with a sigh.
Method 2
We can read a text file (txt) by using the pandas read_csv() function.
43
II CSE A.txt
# importing pandas
import pandas as pd
# display DataFrame
print(df)
Output
R.No\tNAME
0 113321104001 ABINAYA P
1 113321104003 ALAGANENI SUMI
2 113321104004 ANGA SRI AJAY KUMAR
3 113321104005 ASHOK KUMAR A
4 113321104006 ASHWINI E
5 113321104009 BHANDATMAKURU.S.V.S.VISWANATHA SARMA
6 113321104010 BOTTA SRIDHAR
7 113321104011 DASETTI MAHESH
8 113321104013 DEYAA ASMI M
9 113321104015 DHANUSH D
10 113321104017 DHARSHINI R
11 113321104019 DONKALA KAMAKSHI HARSHITHA
12 113321104022 DUVVURU BHAVANA REDDY
13 113321104023 GIRISH KUMAR V V
14 113321104026 HARI KRISHNAN R
44
Pandas - Reading an Excel file.
Algorithm
Multiple Sheets
45
Reading an Iris data set using read_excel()
Output
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8
Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
.. ...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica
Output
S.No Player Mat Runs HS
0 1 V Kohli (INDIA) 75 2633 94*
1 2 RG Sharma (INDIA) 104 2633 118
2 3 MJ Guptill (NZ) 83 2436 105
3 4 Shoaib Malik (ICC/PAK) 111 2263 75
4 5 BB McCullum (NZ) 71 2140 123
... ... ... ... ... ...
1758 1759 G Wijekoon (SL) 3 1 1*
1759 1760 JD Wildermuth (AUS) 2 1 1*
1760 1761 Yamin Ahmadzai (AFG) 2 1 1*
46
1761 1762 Zaki Ul Hassan (Belg) 3 1 1
1762 1763 Zeeshan Abbas (BAH) 1 1 1
Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
.. ...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica
Making our own index – In this program we are making S.No as index.
Output
Player Mat Runs HS
S.No
1 V Kohli (INDIA) 75 2633 94*
2 RG Sharma (INDIA) 104 2633 118
3 MJ Guptill (NZ) 83 2436 105
4 Shoaib Malik (ICC/PAK) 111 2263 75
5 BB McCullum (NZ) 71 2140 123
... ... ... ... ...
1759 G Wijekoon (SL) 3 1 1*
1760 JD Wildermuth (AUS) 2 1 1*
1761 Yamin Ahmadzai (AFG) 2 1 1*
1762 Zaki Ul Hassan (Belg) 3 1 1
1763 Zeeshan Abbas (BAH) 1 1 1
[1763 rows x 4 columns]
47
Concatenating two sheets in an excel file
import pandas as pds
file =(r"C:\Users\New\AppData\Local\Programs\Python\Python39\Iris.xls")
newData1 = pds.read_excel(file, sheet_name = 'IA1', index_col = 0)
newData2 = pds.read_excel(file, sheet_name = 'IA2', index_col = 0)
newData3 = pds.concat([newData1, newData2])
print(newData3)
Output
REGISTER NUMBER NAME Mark
S.No
1 113321104001 ABINAYA P 83.333333
2 113321104003 ALAGANENI SUMI 78.333333
3 113321104004 ANGA SRI AJAY KUMAR 76.666667
4 113321104005 ASHOK KUMAR A 83.333333
5 113321104006 ASHWINI E 81.666667
... ... ... ...
56 113321104099 SUBIKSHYA KUMAR V 25
57 113321104101 SWARNA DARSINI V 76.666667
58 113321104105 VADDIBOINA PATTABHIRAMI REDDY AB
59 113321104106 VAISHALI G 58.333333
60 113321104120 YUVASHREE S 83.333333
newData3.head()
Output
newData3.tail()
Output
48
The shape() method can be used to view the number of rows and columns in the data frame.
newData3.shape
Output
(120, 3)
Now, suppose our data is mostly numerical. We can get the statistical information like mean, max,
min, etc. about the data frame using the describe() method.
newData3.describe()
Output
print(newData3.columns.tolist())
Output
['REGISTER NUMBER', 'NAME ', 'Mark ']
Output
70.65277777777777
49
If any column contains numerical data, we can sort that column using the sort_values() method.
sorted_column = newData3.sort_values(['Mark '], ascending = False)
print(sorted_column)
Output
REGISTER NUMBER NAME Mark
S.No
12 113321104019 DONKALA KAMAKSHI HARSHITHA 91.666667
55 113321104098 SUBASHREE S 91.666667
51 113321104090 SHALINI M 91.666667
44 113321104082 SAHANA S 90.000000
1 113321104001 ABINAYA P 90.000000
... ... ... ...
26 113321104048 KONDUBOUINA PATHEK KUMAR 0.000000
48 113321104087 SENTHUR MURUGAN T 0.000000
58 113321104105 VADDIBOINA PATTABHIRAMI REDDY 0.000000
40 113321104076 PRAAJEET M R 0.000000
49 113321104088 SHAIK ALFIYAZ 0.000000
Like in excel, formulas can also be applied and calculated columns can be created as follows:
Output
S.No
1 173.333333
2 165.000000
3 156.666667
4 170.000000
5 168.333333
Name: calculated_column, dtype: float64
50
import pandas as pd
pd.read_html('https://www.icc-cricket.com/match/100697#scorecard')
Output
[ Batters South Africa Batting R B 4s 6s
\
0 Dean ElgarD Elgar CPT run out (Marnus Labusc... 26 68.0 2.0 0.0
1 Sarel ErweeSJ Erwee c Usman Khawaja b Scott B... 18 31.0 3.0 0.0
2 Theunis Booysen de BruynTBdB de Bruyn c Alex ... 12 31.0 2.0 0.0
3 Temba BavumaT Bavuma c Alex Carey b Mitchell ... 1 8.0 0.0 0.0
4 Khaya ZondoK Zondo c Marnus Labuschagne b Mit... 5 19.0 0.0 0.0
5 Kyle VerreynneK Verreynne WKT c Steve Smith ... 52 99.0 3.0 0.0
6 Marco JansenM Jansen c Alex Carey b Cameron G... 59 136.0 10.0 0.0
7 Keshav MaharajKA Maharaj c Pat Cummins b Nath... 2 9.0 0.0 0.0
8 Kagiso RabadaK Rabada b Cameron Green 4 5.0 1.0 0.0
9 Anrich NortjeA Nortje NOT OUT 1 1.0 0.0 0.0
10 Lungi NgidiL Ngidi b Cameron Green 2 6.0 0.0 0.0
11 Extras (nb 1, b 3, lb 3) 7 NaN NaN NaN
12 Total (all out, 68.4 overs) 189 NaN NaN NaN
SR
0 38.23
1 58.06
2 38.70
3 12.50
4 26.31
5 52.52
6 43.38
7 22.22
8 80.00
9 100.00
10 33.33
11 NaN
12 NaN ,
Fall of Wickets
0 1-29 (SJ Erwee, 10.4 ov) , 2-56 (TBdB de Bruyn...,
Bowlers Australia Bowling O M R W Econ Dots
0 Mitchell StarcMA Starc 13.0 2 39 2 3.00 60
1 Pat CumminsPJ Cummins 14.0 4 30 0 2.14 67
2 Scott BolandSM Boland 14.0 2 34 1 2.42 67
3 Nathan LyonNM Lyon 17.0 3 53 1 3.11 74
4 Cameron GreenC Green 10.4 3 27 5 2.53 51,
Batters Australia Batting R B 4s 6s \
0 David WarnerDA Warner Retired Hurt 200 254 16 2
1 Usman KhawajaUT Khawaja c Kyle Verreynne b Ka... 1 11 0 0
2 Marnus LabuschagneM Labuschagne run out (Dean... 14 35 1 0
3 Steve SmithSPD Smith c Theunis Booysen de Bru... 85 161 9 1
4 Travis HeadTM Head NOT OUT 48 48 7 1
5 Cameron GreenC Green Retired Hurt 6 20 1 0
6 Alex CareyAT Carey WKT NOT OUT 9 22 1 0
7 Pat CumminsPJ Cummins - - - -
8 Scott BolandSM Boland - - - -
9 Mitchell StarcMA Starc - - - -
10 Nathan LyonNM Lyon - - - -
11 Extras (nb 5, w 1, b 5, lb 12) 23 NaN NaN NaN
12 Total (3 wickets, 91 overs) 386 NaN NaN NaN
SR
51
0 78.74
1 9.09
2 40.00
3 52.79
4 100.00
5 30.00
6 40.90
7 -
8 -
9 -
10 -
11 NaN
12 NaN ,
Fall of Wickets
0 1-21 (UT Khawaja, 6.4 ov) , 2-75 (M Labuschagn...,
Bowlers South Africa Bowling O M R W Econ Dots
0 Kagiso RabadaK Rabada 18.0 1 94 1 5.22 65
1 Lungi NgidiL Ngidi 15.1 2 62 0 4.08 63
2 Marco JansenM Jansen 16.0 1 56 0 3.50 71
3 Anrich NortjeA Nortje 16.0 1 50 1 3.12 70
4 Keshav MaharajKA Maharaj 25.5 0 107 0 4.14 97]
Result: Thus the pandas programs for reading data from text files, Excel Files and web pages has been
written and executed successfully.
52
Ex No : 5(a)
Univariate Analysis on UCI and pima Indian diabetes data set.
Date :
Aim: To write python programs using numpy, pandas, seaborn to demonstrate univariate analysis on UCI
diabetes_data_upload data set and Pima Indian diabetes data set
Algorithm
Step 1: Import Numpy, pandas and seaborn
Step 2: Create Data Frames to load the data from UCI diabetes dataset and Pima India diabetes dataset
Step 3: Perform basic analysis on the data set.
Step 4: Perform univariate analysis on the UCI data set and Pima India diabetes dataset.
Step 5: Plot a density plot to present the skewness and kurtosis.
Univariate analysis on UCI diabetes_data_upload data set.
UCI Diabetes data set:UCI_diabetes_upload.csv
Output
sudden
Genital visual delayed partial muscle
AgeGender Polyuria Polydipsia weight weakness Polyphagia ItchingIrritability Alopecia Obesityclass
thrush blurring healing paresis stiffness
loss
0 40 Male No Yes No Yes No No No Yes No Yes No Yes Yes Yes Positive
53
3 45 Male No No Yes Yes Yes Yes No Yes No Yes No No No No Positive
4 60 Male Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes Positive
df.shape
Output
(520, 17)
df.dtypes
Output
Age int64
Gender object
Polyuria object
Polydipsia object
sudden weight loss object
weakness object
Polyphagia object
Genital thrush object
visual blurring object
Itching object
Irritability object
delayed healing object
partial paresis object
muscle stiffness object
Alopecia object
Obesity object
class object
dtype: object
df.info()
Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 520 entries, 0 to 519
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 520 non-null int64
1 Gender 520 non-null object
2 Polyuria 520 non-null object
3 Polydipsia 520 non-null object
4 sudden weight loss 520 non-null object
5 weakness 520 non-null object
6 Polyphagia 520 non-null object
7 Genital thrush 520 non-null object
8 visual blurring 520 non-null object
9 Itching 520 non-null object
10 Irritability 520 non-null object
11 delayed healing 520 non-null object
12 partial paresis 520 non-null object
13 muscle stiffness 520 non-null object
14 Alopecia 520 non-null object
15 Obesity 520 non-null object
16 class 520 non-null object
dtypes: int64(1), object(16)
memory usage: 69.2+ KB
df.describe().T
54
Output
Skewness
Skewness is used to measure symmetry of data along with the mean value. Symmetry means equal
distribution of observation above or below the mean.
df[‘Age’].skew()
Output
0.3293593578272701
55
Kurtosis
Kurtosis is used to defined peakedness ( or flatness) of density plot (normal distribution plot). As
per Dr. Wheeler defines kurtosis defined as: “The kurtosis parameter is a measure of the
combined weight of the tails relative to the rest of the distribution.” This means we measure tail
heaviness of given distribution.
kurtosis = Positive: if peakedness of graph is more than normal distribution (more peaked plot)
df[‘Age’].kurt()
Output
-0.19170941407070163
The graph representation of a single variable and interpretation of skewness and peakedness of
distribution from it.
Output
skewness = 0.3293593578272701
o Positive: Data is not symmetric and left side tail is longer than right side tail in
density plot.
kurtosis = -0.19170941407070163
o Negative: Peakedness of graph is less than normal distribution (flat plot)
56
Univariate analysis on Pima Indian diabetes data set
import numpy as np
import pandas as pd
df =
pd.read_csv(r"C:\Users\New\AppData\Local\Programs\Python\Python39\pima
_diabetes.csv")
df.head()
Output
Output
0 500
1 268
Name: Outcome, dtype: int64
df.shape
Output
57
(768, 9)
df.dtypes
Output
Pregnancies int64
Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction float64
Age int64
Outcome int64
dtype: object
df.info()
Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB
df.describe().T
Output
58
Mode of the column ‘Pregnancies’
df['Pregnancies'].mode()
Output
0 1
Name: Pregnancies, dtype: int64
Variance — its gives average deviation from mean value
Variance of the column ‘Pregnancies’
df['Pregnancies'].var()
Output
11.354056320621465
Standard Deviation of the column ‘Pregnancies’
df['Pregnancies'].std()
Output
3.3695780626988694
Skewness
Skewness is used to measure symmetry of data along with the mean value. Symmetry means equal
distribution of observation above or below the mean.
df['Pregnancies'].skew()
Output
0.9016739791518588
Kurtosis
Kurtosis is used to defined peakedness ( or flatness) of density plot (normal distribution plot). As
per Dr. Wheeler defines kurtosis defined as: “The kurtosis parameter is a measure of the
combined weight of the tails relative to the rest of the distribution.” This means we measure tail
heaviness of given distribution.
kurtosis = Positive: if peakedness of graph is more than normal distribution (more peaked plot)
df['Pregnancies'].kurt()
Output
59
0.15921977754746486
The graph representation of a single variable and interpretation of skewness and peakedness of
distribution from it.
sns.distplot(df['Pregnancies'],hist=True,kde=True)
Output
Result: Thus python programs has been written using numpy, pandas, seaborn to demonstrate
univariate analysis on UCI diabetes_data_upload data set, Pima Indian diabetes data set and
executed successfully.
60
Ex No : 5(b)
Bivariate Analysis on UCI and pima Indian diabetes data set.
Date :
Aim: To write python programs using numpy, pandas, seaborn and sklearn to demonstrate bivariate
analysis(Linear regression and Logistic regression) on UCI diabetes_data_upload data set and Pima
Indian diabetes data set.
Procedure
Load sklearn Libraries.Load
Data
Load the diabetes datasetSplit
Dataset
Creating Model Linear Regression and Logistic RegressionMake
predictions using the testing set
Finding Coefficient And Mean Square Error
Program
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#To calculate accuracy measures and confusion matrix from sklearn import metrics
61
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
Output
Coefficients:
[938.23786125]
Mean squared error: 2548.07
Coefficient of determination: 0.47
Result: Thus python programs for bivariate analysis (linear regression and Logistic regression) using
pandas, Numpy, Seaborn and sklearn has been written and executed successfully.
62
Ex No : 5(c) Multiple Linear Regression Analysis on UCI and pima Indian diabetes
Date : data set.
Aim: To write python programs using numpy, pandas, seaborn and sklearn to demonstrate multiple
regression analysis on UCI diabetes_data_upload data set and Pima Indian diabetes data set.
Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique
that uses several explanatory variables to predict the outcome of a response variable. The goal of multiple
linear regression is to model the linear relationship between the explanatory (independent) variables and
response (dependent) variables. In essence, multiple regression is the extension of ordinary least-squares
(OLS) regression because it involves more than one explanatory variable.
Procedure:
Load sklearn Libraries.
Load Data
Load the diabetes dataset
Split Dataset
Fitting multiple linear regression to the training
Predict the Test set results.
Finding Coefficient and Mean Square Error
Program
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
diabetes_df =
pd.read_csv('https://raw.githubusercontent.com/ammishra08/MachineLearning/master/Datasets/diabetes.c
sv')
diabetes_df.head()
diabetes_df.isnull().sum()
63
X = diabetes_df.drop(['Outcome'], axis=1)
X
Y = diabetes_df['Outcome']
64
[0.05882353, 0.46733668, 0.57377049, ..., 0.45305514, 0.10119556,
0.03333333]])
## Linear Regression
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
# fit - training
lin_reg.fit(X_train, Y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
lin_reg.score(X_test, Y_test)
0.32230203252064193
65
0.08541831, 0.59232843, 0.1303017 , 0.23072725])
mean_squared_error(Y_test, predictions)
0.14370648838141728
r2_score(Y_test, predictions)
0.32230203252064193
Result: Thus python programs for linear regression using pandas, Numpy, Seaborn and sklearn has been
written and executed successfully.
66
Ex No : 5(d)
Multiple Linear Regression Analysis on UCI and pima Indian diabetes data set.
Date :
Aim : To write a python program, which compares the results of the two different data sets.
Procedure
Step 1: Prepare the datasets to be comparedStep 2:
Based on the above data, you can then create the following two DataFrames Step 3: Compare the
values between the two Pandas DataFrames
While you have the data below stored in a second CSV file called car2.csv
Program
import pandas as pd
import numpy as np
data_1 = pd.read_csv(r'd:\car1.csv')df1 =
pd.DataFrame(data_1) data_2 =
pd.read_csv(r'd:\car2.csv')df2 =
pd.DataFrame(data_2) df1['amount1'] =
df2['amount1']
Output
67
6 Audi Chennai 2022 1200000 1200000 True 0
7 Ertiga Chennai 2022 1300000 1300000 True 0
68
Ex No : 6(a) Apply and explore various plotting functions on UCI data sets.
Date : Normal Curve
Aim: To apply and explore normal curve on UCI data sets using python programs and libraries Numpy,
pandas, seaborn.
Procedure:
Step 1) Import the required packages such as numpy, matplotlib.
Step 2) Import the norm function from Scipy’s stat library.
step 3) Initialize mean and Standard deviation
step 4) Calculate z-transform z1 & z2.
step 5) Set the title of the graph
step 6) Set the label and limits of the graph.
step 7) Plot the graph and save it in a file.
Program:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
%matplotlib inline
# define constants
mu = 998.8
sigma = 73.10
x1 = 900
x2 = 1100
# calculate the z-transform
z1 = ( x1 - mu ) / sigma
z2 = ( x2 - mu ) / sigma
x = np.arange(z1, z2, 0.001) # range of x in spec
x_all = np.arange(-10, 10, 0.001) # entire range of x, both in and out of spec
# mean = 0, stddev = 1, since Z-transform was calculated
y = norm.pdf(x,0,1)
y2 = norm.pdf(x_all,0,1)
# build the plot
fig, ax = plt.subplots(figsize=(9,6))
plt.style.use('fivethirtyeight')
ax.plot(x_all,y2)
69
ax.fill_between(x,y,0, alpha=0.3, color='b')
ax.fill_between(x_all,y2,0, alpha=0.1)
ax.set_xlim([-4,4])
ax.set_xlabel('# of Standard Deviations Outside the Mean')
ax.set_yticklabels([])
ax.set_title('Normal Gaussian Curve')
Output:
Result:
The program to explore normal curve using python and its libraries has been written and executed
successfully.
70
Ex No : 6(b)
Density and Contour Plots
Date :
Aim: To write python programs to demonstrate density and contour plots using UCI data sets.
Procedure:
Step 1) Import the required packages such as numpy, matplotlib. seaborn
Step 2) load the iris data set
step 3) Set the title of the graph
step 4) Set the label and limits of the graph.
step 5) plot the density plot and contour plot using the functions
Program: To demonstrate density plot
import pandas as pd
iris = pd.read_csv(r"C:\Users\New\AppData\Local\Programs\Python\Python39\iris.csv")
print(iris.head())
# DENSITY PLOT
fig = plt.figure(figsize = (15,20))
ax = fig.gca()
iris.plot(ax = ax, kind='density', subplots=True, layout=(4,4), sharex=False)
plt.show()
71
Program: To demonstrate contour plots
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
data = sns.load_dataset("iris")
data.head()
Result: The python programs to explore density plot and contour plot on iris data set has been written and
executed successfully.
72
Ex No : 6(c)
Correlation and Scatter plots
Date :
Aim: To write python programs to demonstrate Correlation and Scatter plots using UCI data sets.
Procedure:
Step 1) Import the required packages such as numpy, matplotlib. seaborn
Step 2) load the iris data set
step 3) Derive correlation matrix
step 4) Set the title of the graph
step 5) Set the label and limits of the graph.
step 6) plot the Correlation plot
step 7)using the iris data plot the scatter plot
Program: To demonstrate correlation plot
import matplotlib.pyplot as plt
import seaborn as sb
import pandas as pd
iris = pd.read_csv(r"C:\Users\New\AppData\Local\Programs\Python\Python39\iris.csv")
iris.corr()
73
Program: To demonstrate scatter plot
import pandas as pd
iris = pd.read_csv(r"C:\Users\New\AppData\Local\Programs\Python\Python39\iris.csv")
print(iris.head())
74
ax.set_title('Iris Dataset')
ax.set_xlabel('sepallength')
ax.set_ylabel('sepalwidth')
Result
Thus the python program to demonstrate correlation plot and scatter plot has been written and executed
successfully.
75
Ex No : 6(d)
Histogram
Date :
Aim: To write python programs to demonstrate histogram using UCI data sets.
Procedure:
Step 1) Import the required packages such as numpy, matplotlib. seaborn
Step 2) load the iris data set
step 3) Set the label and limits of the graph.
step 4) plot the histogram
Program: To demonstrate correlation plot.
import pandas as pd
iris = pd.read_csv(r"C:\Users\New\AppData\Local\Programs\Python\Python39\iris.csv")
print(iris.head())
76
Result
Thus the python program to demonstrate histogram has been written and executed successfully.
77
Ex No : 6(e)
Three dimensional plotting
Date :
Aim: To apply and explore three dimensional plotting functions on UCI data sets.
Procedure
fig = plt.figure(figsize=(4,4))
ax = fig.add_subplot(111, projection='3d')
78
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
x = data['Age'].values
y = data['Glucose'].values
z = data['Outcome'].values
ax.set_xlabel("Age (Year)")
ax.set_ylabel("Glucose (Reading)")
ax.set_zlabel("Outcome (0 or 1)")
ax.scatter(x, y, z, c='r', marker='o')
plt.show()
Result: Thus a python program to demonstrate three dimensional plotting has been written and executed
successfully.
79
Ex No : 7
Visualizing Geographic Data with Basemap.
Date :
Procedure:
Installing Basemap package
Import mpl_toolkits from basemap, matplotlib
Adding vector layers to a map
Projection, bounding box, & resolution
Plotting a specific region
Background relief maps
Plotting geographic data using Basemap
Programs:
80
The drawcountries() function uses similar arguments like drawcountries() as shown below:
81
The fillcontinents() function can take the following arguments:
82
Draw map boundary
The drawmapboundary() function is used to draw the earth boundary on the map
The drawmapboundary() function can take the following arguments:
linewidth: sets line width for boundary line (default: 1)
color: sets the color of the boundary line (default: black)
fill_color: fills the map background region
83
fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid', color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
m.fillcontinents(color='coral',lake_color='aqua')
m.drawmeridians(range(0, 360, 20), color='k', linewidth=1.0, dashes=[4, 4], labels=[0, 0, 0, 1])
m.drawparallels(range(-90, 100, 10), color='k', linewidth=1.0, dashes=[4, 4], labels=[1, 0, 0, 0])
plt.ylabel("Latitude", fontsize=15, labelpad=35)
plt.xlabel("Longitude", fontsize=15, labelpad=20)
plt.show()
The Basemap() function is used to set projection, bounding box, & resolution of a map
Map projection:
Inside the Basemap() function, the projection=" argument can take several pre-defined
projections listed in the table below or visit this site to get more information.
To specify the desired projection, use the general syntax shown below:
m = Basemap(projection='aeqd')
m = Basemap(projection='cyl')
84
Some projections
require setting
bounding box, map center, & map size of the map using the following arguments:
Example:
m = Basemap(projection='cyl', ,llcrnrlat=-80,urcrnrlat=80,llcrnrlon=-
180,urcrnrlon=180)
b) Map center:
Example:
c) Map resolution: The map resolution argument determines the quality of vector layers such as
coastlines, lakes, & rivers etc. The available options are:
85
c: crude
l: low
i: intermediate
h: high
f: full
Let’s see some examples on how the map projection, bounding box, map center, & map
resolution arguments used to create and modify maps:
86
m.drawmapboundary(fill_color='lightblue')
plt.title(" Cylindrical Equidistant Projection", fontsize=20)
plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', lat_0=20, lon_0=78)
m.drawcoastlines()
m.drawcountries()
m.bluemarble(scale=0.5);
87
plt.show()
import numpy as np
from itertools import chain
#plt.figure(figsize=(16, 12),edgecolor='w')
m = Basemap(projection='ortho', lat_0=20, lon_0=78)
m.drawcoastlines()
m.drawcountries()
m.bluemarble(scale=0.5);
draw_map(m)
plt.show()
88
Plotting a specific region
89
Mapping Geographical Data with Basemap using a dataset-
datasets_557_1096_cities_r2.csv
# importing packages
import pandas as pd
import numpy as np
from numpy import array
import matplotlib as mpl
# for plots
import matplotlib.pyplot as plt
from matplotlib import cm
from mpl_toolkits.basemap import Basemap
%matplotlib inline
fig = plt.figure(figsize=(20,20))
states = cities.groupby('state_name')['name_of_city'].count().sort_values(ascending=True)
states.plot(kind="barh", fontsize = 20)
plt.grid(b=True, which='both', color='Black',linestyle='-')
plt.xlabel('No of cities taken for analysis', fontsize = 20)
plt.show ()
90
cities['latitude'] = cities['location'].apply(lambda x: x.split(',')[0])
cities['longitude'] = cities['location'].apply(lambda x: x.split(',')[1])
print("The Top 10 Cities sorted according to the Total Population (Descending Order)")
top_pop_cities = cities.sort_values(by='population_total',ascending=False)
top10_pop_cities=top_pop_cities.head()
The Top 10 Cities sorted according to the Total Population (Descending Order)
plt.subplots(figsize=(20, 15))
map = Basemap(width=1200000,height=900000,projection='lcc',resolution='l',
llcrnrlon=67,llcrnrlat=5,urcrnrlon=99,urcrnrlat=37,lat_0=28,lon_0=77)
map.drawmapboundary ()
map.drawcountries ()
map.drawcoastlines ()
lg=array(top10_pop_cities['longitude'])
lt=array(top10_pop_cities['latitude'])
pt=array(top10_pop_cities['population_total'])
nc=array(top10_pop_cities['name_of_city'])
x, y = map(lg, lt)
population_sizes = top10_pop_cities["population_total"].apply(lambda x: int(x / 5000))
plt.scatter(x, y, s=population_sizes, marker="o", c=population_sizes, cmap=cm.Dark2, alpha=0.7)
91
Result:
Thus the python program to demonstrate Visualizing Geographic Data with Basemap has been written
and executed successfully.
92