0% found this document useful (0 votes)
66 views93 pages

FODS Lab Manual - Organized

The document outlines a series of experiments focused on data analysis using Python libraries such as NumPy, SciPy, Jupyter, Statsmodels, and Pandas. It includes instructions for installation, usage of various data manipulation techniques, and statistical analysis methods on datasets like Iris and diabetes data. Additionally, it provides guidance on visualizing data and using aggregate functions in NumPy.

Uploaded by

MEEHA D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views93 pages

FODS Lab Manual - Organized

The document outlines a series of experiments focused on data analysis using Python libraries such as NumPy, SciPy, Jupyter, Statsmodels, and Pandas. It includes instructions for installation, usage of various data manipulation techniques, and statistical analysis methods on datasets like Iris and diabetes data. Additionally, it provides guidance on visualizing data and using aggregate functions in NumPy.

Uploaded by

MEEHA D
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

INDEX

PAGE
EX.NO DATE LIST OF EXPERIMENTS MARK SIGNATURE
NO

Download, install and explore the features of NumPy, SciPy,


1 Jupyter, Statsmodels and Pandas packages.

2 Working with Numpy arrays


Working with Pandas data frames
3

Reading data from text files, Excel and the web and exploring
4 various commands for doing descriptive analytics on the Iris
data set.
Use the diabetes data set from UCI and Pima Indians Diabetes
data set for performing the following:
a. Univariate analysis: Frequency, Mean, Median, Mode,
Variance, Standard Deviation, Skewness and Kurtosis.
5 b. Bivariate analysis: Linear and logistic regression
modeling
c. Multiple Regression analysis
d. Also compare the results of the above analysis for the
two data sets.
Apply and explore various plotting functions on UCI data sets.
a. Normal curves
b. Density and contour plots
6 c. Correlation and scatter plots
d. Histograms
e. Three dimensional plotting
7 Visualizing Geographic Data with Basemap
Read the following file formats
a. Pickle files
8 b. Image files using PIL
c. Multiple files using Glob
d. Importing data from database
Ex No : 1 Download, install and explore the features of NumPy, SciPy, Jupyter,
Date : Statsmodels and Pandas packages.

Aim: To download, install and explore the features of NumPy, SciPy, Jupyter, Statsmodels and Pandas
packages.

Methods:

How to install Jupyter Notebook on Windows?

Jupyter Notebook is an open-source web application that allows you to create and share
documents that contain live code, equations, visualizations, and narrative text. Uses include data
cleaning and transformation, numerical simulation, statistical modeling, data visualization,
machine learning, and much more.

Jupyter has support for over 40 different programming languages and Python is one of them.
Python is a requirement (Python 3.3 or greater, or Python 2.7) for installing the Jupyter
Notebook itself.

How to Install PIP on Windows?

PIP is a package management system used to install and manage software packages/libraries written in
Python. These files are stored in a large “online repository” termed as Python Package Index (PyPI). pip
uses PyPI as the default source for packages and their dependencies. So whenever you type:

pip install package_name

pip will look for that package on PyPI and if found, it will download and install the package on your local
system.

Check if Python is installed

Run the following command to test if python is installed.

python --version

If it is installed, you will see something like this:

Python 3.10.0

Download and Install pip

The PIP can be downloaded and installed using the command line by going through the
following steps:

Method 1: Manually install PIP on Windows


Pip must be manually installed on Windows. You might need to use the correct version of the
file from pypa.org if you’re using an earlier version of Python or pip. Get the file and save it to a
folder on your PC.

Step 1: Download the get-pip.py (https://bootstrap.pypa.io/get-pip.py) file and store it in the


same directory as python is installed.

Step 2: Change the current path of the directory in the command line to the path of the directory
where the above file exists.

Step 3: get-pip.py is a bootstrapping script that enables users to install pip in Python
environments. Run the command given below:

python get-pip.py

Step 4: Now wait through the installation process. Voila! pip is now installed on your system.

Verification of the installation process

One can easily verify if the pip has been installed correctly by performing a version check on the
same. Just go to the command line and execute the following command:

pip –V or pip –version

2
Adding PIP to Windows Environment Variables

If you are facing any path error then you can follow the following steps to add the pip to your
PATH. You can follow the following steps to set the Path:

 Go to System and Security > System in the Control Panel once it has been opened.
 On the left side, click the Advanced system settings link.
 Then select Environment Variables.
 Double-click the PATH variable under System Variables.
 Click New, and add the directory where pip is installed, e.g. C:Python33Scripts, and select OK.

Upgrading Pip On Windows

pip can be upgraded using the following command.

python -m pip install -U pip

Downgrading Pip On Windows

It may happen sometimes that your pip current pip version is not supporting your current version
of python or machine for that you can downgrade your pip version with the following command.
Note: You can mention the version you want to install

python -m pip install pip==17.0

UsingPIP:
Install Jupyter using the PIP package manager used to install and manage software packages/libraries
written in Python.

Installing Jupyter Notebook using pip:

PIP is a package management system used to install and manage software packages/libraries
written in Python. These files are stored in a large “on-line repository” termed as Python Package
Index(PyPI).
pip uses PyPI as the default source for packages and their dependencies.

To install Jupyter using pip, we need to first check if pip is updated in our system. Use the
following command to update pip:

python -m pip install --upgrade pip

After updating the pip version, follow the instructions provided below to install Jupyter:

 Command to install Jupyter: python -m pip install jupyter

3
 Beginning Installation:

 Downloading Files and Data:

 InstallingPackages:

 FinishedInstallation:

LaunchingJupyter:
Use the following command to launch Jupyter using command-line:

4
jupyter notebook

Install NumPy, SciPy, Matplotlib with Python 3 on Windows

Start the installer and select Customize installation. On the next screen leave all the optional features
checked. Finally, on the Advanced Options screen make sure to check Install for all users, Add Python to
environment variables and Precompile standard library. Optionally, you can customize the install
location. I’ve used C:\Python38. You should see something like this:

Press the Install button and in a few minutes, depending on the speed of your computer, you should be
ready. On the last page of the installer, you should also press the Disable path length limit:

5
Now, to check if Python was correctly installed, open a Command Prompt (or a PowerShell) window.
Press and hold the SHIFT key and right click with your mouse somewhere on your desktop, select Open
command window here. Alternatively, on Windows 10, use the bottom left search box to search for cmd.

Write python in the command window and press Enter, you should see something like this:

Exit from the Python interpreter by writing quit() and pressing the Enter key.

Now, open a cmd window like before. Use the next set of commands to install NumPy, SciPy and
Matplotlib:

1 python -m pip install numpy


2 python -m pip install scipy
3 python -m pip install matplotlib

After each of the above commands you should see Successfully installed ….

Launch Python from a cmd window and check the version of Scipy, you should see something like this:

1 C:\>python
2 Python 3.8.1 (tags/v3.8.1:1b293b6, Dec 18 2019, 22:39:24) [MSC v.1916 32 bit (Intel)] on win32
3 Type "help", "copyright", "credits" or "license" for more information.
4 >>> import scipy as sp
5 >>> sp.version.version
6 '1.4.1'
7 >>>

Let’s try something a bit more interesting now, let’s plot a simple function with Matplotlib. First, we’ll

6
import SciPy and Matplotlib with:

1 import scipy as sp
2 import matplotlib.pylab as plt

Next, we can define some points on the (0, 1) interval with:

1 t = sp.linspace(0, 1, 100)

Now, let’s plot a parabola defined by the above interval:

1 plt.plot(t, t**2)
2 plt.show()

You should see something like this:

How to Install Python Pandas on Windows and Linux?

Pandas in Python is a package that is written for data analysis and manipulation. Pandas offer
various operations and data structures to perform numerical data manipulations and time series.
Pandas is an open-source library that is built over Numpy libraries. Pandas library is known for
its high productivity and high performance. Pandas is popular because it makes importing and
analyzing data much easier.

Pandas programs can be written on any plain text editor like notepad, notepad++, or anything of
that sort and saved with a .py extension. To begin with, writing Pandas Codes and performing
various intriguing and useful operations, one must have Python installed on their System. This
can be done by following the step by step instructions provided below:

What if Python already exists? Let’s check

To check if your device is pre-installed with Python or not, just go to the Command line(search
for cmd in the Run dialog( + R).
Now run the following command:

python --version

7
If Python is already installed, it will generate a message with the Python version available.

Downloading and Installing Pandas


Pandas can be installed in multiple ways on Windows and on Linux. Various different ways are
listed below:
Windows

Python Pandas can be installed on Windows in two ways:

 Using pip
 Using Anaconda

Install Pandas using pip

PIP is a package management system used to install and manage software packages/libraries
written in Python. These files are stored in a large “on-line repository” termed as Python Package
Index (PyPI).
Pandas can be installed using PIP by the use of the following command:

pip install pandas

Installing statsmodels

8
The easiest way to install statsmodels is to install it as part of the Anaconda distribution, a cross-
platform distribution for data analysis and scientific computing. This is the recommended
installation method for most users.

Instructions for installing from PyPI, source or a development version are also provided.

Python Support
statsmodels supports Python 3.7, 3.8, and 3.9.

Anaconda
statsmodels is available through conda provided by Anaconda. The latest release can be installed
using: conda install -c conda-forge statsmodels

PyPI (pip)
To obtain the latest released version of statsmodels using pip: pip install statsmodels

Installation from Source

You will need a C compiler installed to build statsmodels. If you are building from the github
source and not a source release, then you will also need Cython. You can follow the instructions
below to get a C compiler setup for Windows.

If your system is already set up with pip, a compiler, and git, you can try:

pip install git+https://github.com/statsmodels/statsmodels

If you do not have pip installed or want to do the installation more manually, you can also type:

python setup.py install

Or even more manually

python setup.py build


python setup.py install

statsmodels can also be installed in develop mode which installs statsmodels into the current
python environment in-place. The advantage of this is that edited modules will immediately be
re-interpreted when the python interpreter restarts without having to re-install statsmodels.

python setup.py develop

How to install statsmodels in Python?

If you aspire for a flourishing career in the field of machine learning, let us introduce you to one
more interesting package whose functionalities would leave you awestruck.

So, let's see what is statsmodels and what are its features?

9
Statsmodels is a popular library in Python that enables us to estimate and analyze various
statistical models. It is built on numeric and scientific libraries like NumPy and SciPy.

Some of the essential features of this package are-

1. It includes various models of linear regression like ordinary least squares, generalized least
squares, weighted least squares, etc.
2. It provides some efficient functions for time series analysis.
3. It also has some datasets for examples and testing.
4. Models based on survival analysis are also available.
5. All the statistical tests that we can imagine for data on a large scale are present.

Installing statsmodels

Let's have a look at the steps of installing statsmodels in Python-

1. Checking the version of Python installed in our PCs, we have discussed this already in the
previous articles but let's talk about this again-
There are two ways to check the version of Python in Windows-

 Using Powershell
 Using Command Prompt

Using PowerShell

Follow the below steps to check the version of Python using PowerShell.

1. Click 'Win+R' or type 'Run' on the taskbar's search pane.


2. Type 'Powershell'
3. A window will appear on your screen named 'Windows Powershell'
4. Click on 'Enter'
5. Type python -version and click on 'Enter'
6. The version would be displayed in the next line.

Using Command Prompt

Type 'Command Prompt' on the taskbar's search pane and you'll see its icon. Click on it to open
the command prompt.

Also, you can directly click on its icon if it is pinned on the taskbar.

1. Once the 'Command Prompt' screen is visible on your screen.


2. Type python -version and click on 'Enter'.
3. The version installed in your system would be displayed in the next line.

10
Checking the Version of Python in Linux

In Linux, we have a shell where we type our commands that are interpreted and tells the
operating system about what the user wants.

The steps to check the version of Python in Linux is-

 Start your system and switch on to the Linux operating system (you might find it with the name
Ubuntu).
 Once the desktop screen of Linux appears, click on 'Terminal' to open it.
 In the terminal window, type python -version and click on 'Enter'.

In the next line, it will display the current version of python installed in your system.

Installation of statsmodels

Now let us discuss the steps for installing statsmodels in our system. We will look at two
methods of installation

1. Using Anaconda Prompt


2. Using Command Prompt

In the first method, we will open the Anaconda Prompt and type the following command-

1. conda install -c conda-forge statsmodels

In the second method, we will open the Command Prompt, type the following command and
click on 'Enter'.

1. pip install statsmodels

It's time to look have a program in which we will import statsmodels-

Here, we will perform OLS(Ordinary Least Squares) regression, in this technique we will try to
minimize the net sum of squares of difference between the calculated value and observed value.

Example -

11
1. import pandas as pd
2. import statsmodels.api as sm
3. df=pd.read_csv("/content/SampleSuperstore.csv")
4. df.head()
5. x=df['Sales']
6. y=df['Profit']
7. model=sm.OLS(y,x).fit()
8. modelmodel_summary=model.summary()
9. print(model_summary)

Output-

Result: Thus the Jupyter notebook, and the Python libraries Pandas, Numpy, Scipy, Statsmodels has been
successfully downloaded and installed.

12
Ex No : 2 (a)
Numpy Aggregate functions
Date :

Aim: To write python code using numpy to create an array and apply the different aggregate functions. i)
sum of elements ii)Max iii) Min iv) standard deviation v) variance vi) index of minimum and maximum
value.
Algorithm:
Step 1: Create a 3x3 array of random integers in the interval [0, 10)
Step 2: calculate sum using np.sum()
Step 3: calculate max using np.max()
step 4: calculate min using np.min()
Step 5: calculate standard deviation using np.std()
Step 6: calculate variance using np.var()
Step 7: calculate index of minimum and maximum value np.argmin() and np.argmax()
Program:
import numpy as np
a=np.random.randint(0, 10, (4, 3))
print(a)
print(np.sum(a))
print(np.min(a))
print(np.max(a))
print(np.std(a))
print(np.var(a))
print(np.argmin(a))
print(np.argmax(a))

Output:
3 x 3 array - a
[[5 9 4]
[4 0 7]
[0 6 9]
[9 1 3]]
Sum of a: 57
Minimum of a: 0
Maximum of a: 9
Standard Deviation of a: 3.2177890960513036
Variance of a: 10.354166666666666
Index of the minimum of a: 4
Index of the maximum of a: 1

Result: The python program using Numpy to find aggregates of a given value has been executed
successfully.

13
Ex No : 2 (b)
Numpy Attributes and Indexing
Date :

Aim: To write a python program to work with numpy attributes and indexing.
Algorithm:
Step 1: import the numpy package
Step 2: create a random array of one dimensional, two dimensional and three dimensional arrays
Step 3: use different numpy attributes like ndim, shape, size and dtype.
Step 4: use positive and negative indexing to find how the elements are accessed

Program:
import numpy as np
np.random.seed(0)
x1 = np.random.randint(10, size=6) # One-dimensional array
x2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5))
print("x1 ndim: ", x1.ndim)
print("x1 shape:", x1.shape)
print("x1 size: ", x1.size)
print(x1)
print(x1.dtype)
print("x2 ndim: ", x2.ndim)
print("x2 shape:", x2.shape)
print("x2 size: ", x2.size)
print(x2)
print(x2.dtype)
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)
print(x3)
print("data type of x3", x3.dtype)
#numpy indexing
print(x1)
print("The index'0' of x1:", x1[0])
print("The reverse index of x1:", x1[-1])
print("Array x2:", x2)
print("The first index in a 2 dimensional array:", x2[0,0])
print(x2[2,-1])
x2[0,0]=12
print("The new value of x2[0,0]:", x2)

14
Output:
x1 ndim: 1
x1 shape: (6,)
x1 size: 6
[5 0 3 3 7 9]
int32
x2 ndim: 2
x2 shape: (3, 4)
x2 size: 12
[[3 5 2 4]
[7 6 8 8]
[1 6 7 7]]
int32
x3 ndim: 3
x3 shape: (3, 4, 5)
x3 size: 60
[[[8 1 5 9 8]
[9 4 3 0 3]
[5 0 2 3 8]
[1 3 3 3 7]]

[[0 1 9 9 0]
[4 7 3 2 7]
[2 0 0 4 5]
[5 6 8 4 1]]

[[4 9 8 1 1]
[7 9 9 3 6]
[7 2 0 3 5]
[9 4 4 6 4]]]
data type of x3 int32
[5 0 3 3 7 9]
The index'0' of x1: 5
The reverse index of x1: 9
Array x2:
[[3 5 2 4]
[7 6 8 8]
[1 6 7 7]]
The first index in a 2 dimensional array: 3
7
The new value of x2[0,0]:
[[12 5 2 4]
[ 7 6 8 8]
[ 1 6 7 7]]

Result: The python program using Numpy has been written and executed successfully to demonstrate its
attributes and indexing.

15
Ex No : 2 (c)
Working with Concatenation, Slicing of Numpy arrays
Date :

Aim: To work and practice with Numpy arrays concatenation and slicing.

Algorithm:

Step 1: import the numpy package


Step 2: create a random array of one dimensional, two dimensional and three dimensional array
Step 3: use concatenate function
Step 4: use slicing using the syntax x[start:stop:step]
Step 5: Print the result.

Program: To demonstrate Concatenation


import numpy as np
x = np.array([1, 2, 3])
y = np.array([3, 2, 1]) #numpy concatenation
np.concatenate([x, y])
z = [99, 99, 99]
print("Concatenating x, y, z:", np.concatenate([x, y, z]))
grid = np.array([[1, 2, 3],
[4, 5, 6]])
print(np.concatenate([grid, grid]))
print(np.concatenate([grid, grid], axis=1))

Output
Concatenating x, y, z: [ 1 2 3 3 2 1 99 99 99]
[[1 2 3]
[4 5 6]
[1 2 3]
[4 5 6]]
[[1 2 3 1 2 3]
[4 5 6 4 5 6]]

Program: To demonstrate Slicing

import numpy as np
np.random.seed(0)
x = np.arange(10)
print("The array X:", x)
#numpy slicing
print("The First five elements:",x[:5])
print("The Last five elements:",x[5:])
print("The elements in between:",x[4:7])
print("The even elements:", x[::2])
print("The odd elements:", x[1::2])
print("The elements in the reverse order:", x[::-1])

16
print(x[5::-2])
x2 = np.random.randint(10, size=(3, 4))
print("The resized array x2:", x2)
print(x2[:2,:3])
print(x2[:,0])

Output

The array X: [0 1 2 3 4 5 6 7 8 9]
The First five elements: [0 1 2 3 4]
The Last five elements: [5 6 7 8 9]
The elements in between: [4 5 6]
The even elements: [0 2 4 6 8]
The odd elements: [1 3 5 7 9]
The elements in the reverse order: [9 8 7 6 5 4 3 2 1 0]
[5 3 1]
The resized array x2: [[5 0 3 3]
[7 9 3 5]
[2 4 7 6]]
[[5 0 3]
[7 9 3]]
[5 7 2]

Result: Thus numpy program to demonstrate concatenation and slicing techniques in numpy arrays has
been written and executed successfully.

17
Ex No : 2 (d)
Working with Reshaping and Splitting of Numpy arrays
Date :

Aim: To work and practice with Numpy arrays splitting and reshaping.

Algorithm:

Step 1: import the numpy package


Step 2: create a random array of one dimensional, two dimensional and three dimensional array Step 3:
use splitting , reshape , arange function
Step 4: Print the result.

Program: To demonstrate numpy reshape

import numpy as np
grid = np.arange(1, 10)
print("The array'grid':", grid)
grid1 = np.arange(1, 10).reshape((3, 3))
print("The reshaped array 'grid1':",grid1)
x = np.array([1, 2, 3])
print(x)
print(x.reshape((1, 3)))
print(x[np.newaxis, :])

Output
The array'grid': [1 2 3 4 5 6 7 8 9]
The reshaped array 'grid1': [[1 2 3]
[4 5 6]
[7 8 9]]
[1 2 3]
[[1 2 3]]
[[1 2 3]

Program: To demonstrate numpy splitting

import numpy as np
x = [1, 2, 3, 99, 99, 3, 2, 1]
x1, x2, x3 = np.split(x, [3, 5])
print("Splitting the array x into x1, x2, x3:", x1, x2, x3)

Output:

Splitting the array x into x1, x2, x3: [1 2 3] [99 99] [3 2 1]

18
Program: To demonstrate numpy splitting – vsplit()

import numpy as np
grid = np.arange(16).reshape((4, 4))
print(grid)
upper, lower = np.vsplit(grid, [2])
print(upper)
print(lower)

Output

[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
[[0 1 2 3]
[4 5 6 7]]
[[ 8 9 10 11]
[12 13 14 15]]

Program: To demonstrate numpy splitting – hsplit()

import numpy as np
grid = np.arange(16).reshape((4, 4))
print(grid)
left, right = np.hsplit(grid, [2])
print(left)
print(right)

Output
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
[[ 0 1]
[ 4 5]
[ 8 9]
[12 13]]
[[ 2 3]
[ 6 7]
[10 11]
[14 15]]

Program: To demonstrate numpy splitting – dsplit()


import numpy as np
a = np.arange(12.0).reshape (2,2,3)
print("The array 'a':", a)
D1 = np.dsplit(a,3)
print("The dsplit array 'D1':",D1)
D2 = np.dsplit(a, np.array([2,6]))
print("The dsplit array 'D2':",D2)

19
Output
The array 'a': [[[ 0. 1. 2.]
[ 3. 4. 5.]]

[[ 6. 7. 8.]
[ 9. 10. 11.]]]
The dsplit array 'D1': [array([[[0.],
[3.]],

[[6.],
[9.]]]), array([[[ 1.],
[ 4.]],

[[ 7.],
[10.]]]), array([[[ 2.],
[ 5.]],

[[ 8.],
[11.]]])]
The dsplit array 'D2': [array([[[ 0., 1.],
[ 3., 4.]],

[[ 6., 7.],
[ 9., 10.]]]), array([[[ 2.],
[ 5.]],

[[ 8.],
[11.]]]), array([], shape=(2, 2, 0), dtype=float64)]

Result: The python program using Numpy has been written and executed successfully to demonstrate its
Concatenation, Slicing, Reshaping and Splitting.

20
Ex No : 3 (a)
Working with Pandas Series objects.
Date :

Aim: To write Python program to demonstrate Pandas Series objects objects.

Algorithm:

Step1: Import the numpy and pandas libraries.


Step2: Generate random numbers using randint function.
Step3: Create pandas Series objects
Step4: Do the index alignment of the pandas objects
Step5: Perform indexing operation.
Step6: Perform Slicing operation.
Step7: Perform Binary operation on Series

Creating a series from array: In order to create a series from array, we have to import a numpy module
and have to use array() function.

Program

# import pandas as pd
import pandas as pd
# import numpy as np
import numpy as np
# simple array
data = np.array(['p','a','n','d','a', 's'])
ser = pd.Series(data)
print(ser)

Output :

Creating a series from array with an index: In order to create a series by explicitly proving index
instead of the default, we have to provide a list of elements to the index parameter with the same number
of elements as it is an array.

import pandas as pd

# a simple list

21
list = ['p','a','n','d','a','s']

# create series form a list


ser = pd.Series(list)
print(ser)

ser = pd.Series(data, index=[10, 11, 12, 13, 14, 15])


print(ser)

Output
0 p
1 a
2 n
3 d
4 a
5 s
dtype: object
10 p
11 a
12 n
13 d
14 a
15 s
dtype: object

Creating a series from Dictionary: In order to create a series from the dictionary, we have to first create
a dictionary after that we can make a series using dictionary. Dictionary keys are used to construct
indexes of Series.
import pandas as pd
# a simple dictionary
dict = {'Bala': 10,
'Chander': 20,
'Vijay': 30}
# create series from dictionary
ser = pd.Series(dict)
print(ser)
Output
Bala 10
Chander 20
Vijay 30
dtype: int64

Creating a series using NumPy functions : In order to create a series using numpy function, we can use
different function of numpy like numpy.linspace(), numpy.random.radn().

# import pandas and numpy


import pandas as pd
import numpy as np

# series with numpy linspace()

22
ser1 = pd.Series(np.linspace(3, 33, 3))
print(ser1)

# series with numpy linspace()


ser2 = pd.Series(np.linspace(1, 100, 10))
print(ser2)

Output
0 3.0
1 18.0
2 33.0
dtype: float64
0 1.0
1 12.0
2 23.0
3 34.0
4 45.0
5 56.0
6 67.0
7 78.0
8 89.0
9 100.0
dtype: float64

Accessing element of Series

There are two ways through which we can access element of series, they are :

 Accessing Element from Series with Position


 Accessing Element Using Label (index)

Accessing Element from Series with Position : In order to access the series element refers to
the index number. Use the index operator [ ] to access an element in a series. The index must be
an integer. In order to access multiple elements from a series, we use Slice operation.

Accessing first 5 elements of Series

# import pandas and numpy


import pandas as pd
import numpy as np

# creating simple array


data = np.array(['I','L','O','V','E','I','N','D','I','A'])
ser = pd.Series(data)

#retrieve the first element


print(ser[:5])

Output
0 I
1 L

23
2 O
3 V
4 E
dtype: object

Accessing Element Using Label (index) :


In order to access an element from series, we have to set values by index label. A Series is like a
fixed-size dictionary in that you can get and set values by index label.

Accessing a single element using index label

# import pandas and numpy


import pandas as pd
import numpy as np

# creating simple array


data = np.array(['D','A','T','A','S','C','I','E','N','C','E'])
ser = pd.Series(data,index=[10,11,12,13,14,15,16,17,18,19,20])

# accessing a element using index element


print(ser[16])

Output
I

Indexing a Series using indexing operator [] :


Indexing operator is used to refer to the square brackets following an object.

# importing pandas module


import pandas as pd

# making data frame


df = pd.read_csv(r"C:\Users\New\AppData\Local\Programs\Python\Python39\II CSE B.csv")

ser = pd.Series(df['Name'])
data = ser.head(10)
data

Output
0 ABIRAMI T
1 BACHU MANEESH
2 BENSIHA A
3 DEVANAND C
4 DHANALAKSHMI S
5 DHANUSH J
6 DHIVYA LAKSHMI B
7 DORAGALU NAVADEEP
8 DRAVID M
9 GNANESWARAN B
Name: Name, dtype: object

24
Indexing a Series using .loc[ ] :
This function selects data by refering the explicit index . The df.loc indexer selects data in a different
way than just the indexing operator. It can select subsets of data.

We access the element of series using .loc[] function.

data.loc[3:6]

Output
3 DEVANAND C
4 DHANALAKSHMI S
5 DHANUSH J
6 DHIVYA LAKSHMI B
Name: Name, dtype: object

Indexing a Series using .iloc[ ] :


This function allows us to retrieve data by position. In order to do that, we’ll need to specify the positions
of the data that we want. The df.iloc indexer is very similar to df.loc but only uses integer locations
to make its selections.

data.iloc[3:6]

Output
3 DEVANAND C
4 DHANALAKSHMI S
5 DHANUSH J
Name: Name, dtype: object

Binary Operation on Series

We can perform binary operation on series like addition, subtraction and many other operation.
In order to perform binary operation on series we have to use some function like .add(),.sub()
etc..

# importing pandas module


import pandas as pd

# creating a series
data = pd.Series([5, 2, 3,7], index=['a', 'b', 'c', 'd'])

# creating a series
data1 = pd.Series([1, 6, 4, 9], index=['a', 'b', 'd', 'e'])

print(data, "\n\n", data1)


# adding two series using
# .add
data.add(data1, fill_value=0)

Output
a 5

25
b 2
c 3
d 7
dtype: int64

a 1
b 6
d 4
e 9
dtype: int64

a 6.0
b 8.0
c 3.0
d 11.0
e 9.0
dtype: float64

data.sub(data1, fill_value=0)

a 4.0
b -4.0
c 3.0
d 3.0
e -9.0
dtype: float64

Result: Thus the pandas program to create series has been created and performed various operations
successfully

26
Ex No : 3 (b)
Working with Pandas Data Frame objects.
Date :

Aim: To write Python program to demonstrate Pandas Data Frame objects objects.

Algorithm:

Step1: Import the numpy and pandas libraries.


Step2: Generate random numbers using randint function.
Step3: Create pandas DataFrame and Series objects
Step4: Do the index alignment of the pandas objects
Step5: Perform subtraction operation between dataframes.
Step6: Perform addition operation between dataframes.
Step7: Check for null values in dataframe
Step8: Concatenate Data Frame objects using concat function.

Program
import pandas as pd
import numpy as np
rng = np.random.RandomState(42)
A = rng.randint(10, size=(3, 4))
print("Array A", A)
print(A - A[0])
df = pd.DataFrame(A, columns=list('QRST'))
print("The Data Frame df",df)

#subtraction operation between dataframes.


print(df - df.iloc[0])
print(df.subtract(df['R'], axis=0))

#Addition operation between dataframes.


print(df.add(df['Q'],axis=0))
data = pd.Series([1, np.nan, 'hello', None])

#Check for null values.


print(data.isnull())
print(data[data.notnull()])

#Indexing
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
#Concatenation
print(pd.concat([ser1, ser2]))

Output
Array A [[6 3 7 4]
[6 9 2 6]
[7 4 3 7]]

27
[[ 0 0 0 0]
[ 0 6 -5 2]
[ 1 1 -4 3]]

The Data Frame df Q R S T


0 6 3 7 4
1 6 9 2 6
2 7 4 3 7

Q R S T
0 0 0 0 0
1 0 6 -5 2
2 1 1 -4 3

Q R S T
0 3 0 4 1
1 -3 0 -7 -3
2 3 0 -1 3

Q R S T
0 12 9 13 10
1 12 15 8 12
2 14 11 10 14

0 False
1 True
2 False
3 True
dtype: bool

0 1
2 hello
dtype: object

1 A
2 B
3 C
4 D
5 E
6 F
dtype: object

Result: Thus the pandas program to create data frame has been written and also various operations on
data frame has been performed successfully.

28
Ex No : 3 (c)
Working with combining data sets.
Date :

Aim: To write pandas programs to work with combining data sets.

Algorithm:

Step 1: import the pandas package


Step 2: Read Dummy_course data set by using read_excel()
Step 3: Create two data frames df1 (for course fees) and df2 (for course discounts)
Step 3: Display the data in df1 and df2
Step 4: Combine the two data sets by using pandas.merge()
Step 5: Display the combined data set.
Step 6: Apply all the options in the pandas.merge() functions.

The data required for a data-analysis task usually comes from multiple sources. And therefore, it is
important to learn the methods to bring this data together. There are three best and most time-saving ways
to combine multiple datasets using Python pandas methods.
They are
merge(): To combine the datasets on common column or index or both.
concat(): To combine the datasets across rows or columns.
join(): To combine the datasets on key column or index.

We will be using dummy course_data dataset which has two sheets “Fees” and “Discounts”

29
Program
import pandas as pd
df1=pd.read_excel(r"C:\Users\New\AppData\Local\Programs\Python\Python39\Dummy_course_data.xls",
sheet_name="Fees")
df2=pd.read_excel(r"C:\Users\New\AppData\Local\Programs\Python\Python39\Dummy_course_data.xls",
sheet_name="Discounts")
print("The fees data set:", df1)
print("The discount data set:",df2)

Output
The fees data set:
Course Country Fee_USD
0 Maths India 15500
1 Physics Germany 16700
2 Applied Maths Germany 11100
3 General Science United Kingdom 18000
4 Social Science Austria 18400
5 History Poland 23000
6 Politics India 21600
7 Computer Graphics United States 27000
The discount data set:
Course Country Discount_USD
0 Maths India 1000
1 Physics Germany 2300
2 German language Germany 1500
3 Information Technology United Kingdom 1200
4 Social Science Austria 1500
5 History Poland 3200
6 Marketing India 2000
7 Computer Graphics United States 2500

Program: To merge two data sets


df3 = pd.merge(df1,df2)
df3

pd.merge() automatically detects the common column between two datasets and combines them on this
column.
Output

30
Program to demonstrate ‘outer’ option in merge().
This ‘outer’ join is similar to the one done in SQL. It returns matching rows from both datasets plus
non matching rows. Some cells are filled with NaN as these columns do not have matching records in
either of the two datasets.
df4 = pd.merge(df1, df2, how='outer')
df4

Output

As the second dataset df2 has 3 rows different than df1 for columns Course and Country, the final output
after merge contains 10 rows. i.e. 7 rows from df1 + 3 additional rows from df2.

Program to demonstrate ‘left’ option in merge().


As per definition, left join returns all the rows from the left DataFrame and only matching rows from right
DataFrame.
Exactly same happened here and for the rows which do not have any value in Discount_USD column,
NaN is substituted.

df6 = pd.merge(df1, df2, how='left')


df6

31
Output

Program to demonstrate ‘right’ option in merge().


The right join returned all rows from right DataFrame i.e. df2 and only matching rows from left
DataFrame i.e. df1

You can get same results by using how = ‘left’ also. All you need to do is just change the
order of DataFrames mentioned in pd.merge() from df1, df2 to df2, df1 .

The order of the columns in the final output will change based on the order in which you mention
DataFrames in pd.merge()

df7 = pd.merge(df1, df2, how='right')


df7

Output

DataFrame.join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False, validate=None)

32
Join columns of another DataFrame.

Join columns with other DataFrame either on index or on a key column. Efficiently join multiple
DataFrame objects by index at once by passing a list.

Parameters
otherDataFrame, Series, or a list containing any combination of them

Index should be similar to one of the columns in this one. If a Series is passed, its name
attribute must be set, and that will be used as the column name in the resulting joined
DataFrame.

onstr, list of str, or array-like, optional

Column or index level name(s) in the caller to join on the index in other, otherwise joins
index-on-index. If multiple values given, the other DataFrame must have a MultiIndex.
Can pass an array as the join key if it is not already contained in the calling DataFrame.
Like an Excel VLOOKUP operation.

how{‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’

How to handle the operation of the two objects.

 left: use calling frame’s index (or column if on is specified)


 right: use other’s index.
 outer: form union of calling frame’s index (or column if on is specified) with
other’s index, and sort it. lexicographically.
 inner: form intersection of calling frame’s index (or column if on is specified)
with other’s index, preserving the order of the calling’s one.
 cross: creates the cartesian product from both frames, preserves the order of the
left keys.

New in version 1.2.0.

lsuffixstr, default ‘’

Suffix to use from left frame’s overlapping columns.

rsuffixstr, default ‘’

Suffix to use from right frame’s overlapping columns.

sortbool, default False

Order result DataFrame lexicographically by the join key. If False, the order of the join
key depends on the join type (how keyword).

validatestr, optional

33
If specified, checks if join is of specified type. * “one_to_one” or “1:1”: check if join
keys are unique in both left and right datasets. * “one_to_many” or “1:m”: check if join
keys are unique in left dataset. * “many_to_one” or “m:1”: check if join keys are unique
in right dataset. * “many_to_many” or “m:m”: allowed, but does not result in checks. ..
versionadded:: 1.5.0

Returns
DataFrame

A dataframe containing columns from both the caller and other.

Program : To demonstrate DataFrame.join()


df8 = df1.join(df2,lsuffix = '_df1',rsuffix = '_df2')
df8

Output

As per definition join() combines two DataFrames on either on index (by default) and that’s why the
output contains all the rows & columns from both DataFrames.
If you want to join both DataFrames using the common column — Country, you need to set Country to
be the index in both df1 and df2. It can be done like below.
For the sake of simplicity, I am copying df1 and df2 into df11 and df22 respectively.
df11 = df1.copy()
df11.set_index('Course', inplace=True)
print(df11)
df22 = df2.copy()
df22.set_index('Course', inplace=True)
print(df22)

The above block of code will make column Course as index in both datasets.
Output
Country Fee_USD
Course
Maths India 15500
Physics Germany 16700
Applied Maths Germany 11100
General Science United Kingdom 18000

34
Social Science Austria 18400
History Poland 23000
Politics India 21600
Computer Graphics United States 27000
Country Discount_USD
Course
Maths India 1000
Physics Germany 2300
German language Germany 1500
Information Technology United Kingdom 1200
Social Science Austria 1500
History Poland 3200
Marketing India 2000
Computer Graphics United States 2500

Output

concat()
Concat function concatenates datasets along rows or columns. So it simply stacks multiple DataFrames
together one over other or side by side when aligned on index.

Syntax: concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy)

Parameters:

 objs: Series or DataFrame objects


 axis: axis to concatenate along; default = 0
 join: way to handle indexes on other axis; default = ‘outer’
 ignore_index: if True, do not use the index values along the concatenation axis; default = False
 keys: sequence to add an identifier to the result indexes; default = None
 levels: specific levels (unique values) to use for constructing a MultiIndex; default = None
 names: names for the levels in the resulting hierarchical index; default = None
 verify_integrity: check whether the new concatenated axis contains duplicates; default = False
 sort: sort non-concatenation axis if it is not already aligned when join is ‘outer’; default = False
 copy: if False, do not copy data unnecessarily; default = True

35
Returns: type of objs (Series of DataFrame)

Program to demonstrate concat()


pd.concat([df1, df2])
Output

Both datasets can be stacked side by side as well by making the axis = 1, as shown below.
pd.concat([df1,df2], axis=1,keys = ["df1_data","df2_data"])

Output

Result: Thus the pandas program has been written to demonstrate combining data sets using merge(),
concat()’join() functions and executed successfully.

36
Ex No : 3 (d)
Working with pivot table
Date :

Aim: To write a python program using pandas which demonstrates pivot table.
Algorithm:
Step 1: import the pandas package
Step 2: Read tested.csv data set by using read_csv()
Step 3: Create a data frame df1 to laod data from tested.csv
Step 3: Display the data in df1.
Step 4: Construct a table that shows the proportion of people who survived in each passenger class.(There
are three classes. They are 1, 2, and 3. Replacing them with strings will look better as an index.)
Step 5: Generate a tabular sheet to show the proportion of people who survived in each class.
segregated by gender.
Step 6: Calculate the average fare in each class for both males and females.
Syntax:

pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc=’mean’,


fill_value=None, margins=False, dropna=True, margins_name=’All’) create a spreadsheet-
style pivot table as a DataFrame.

Levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index
and columns of the result DataFrame.

Parameters:
data : DataFrame
values : column to aggregate, optional
index: column, Grouper, array, or list of the previous
columns: column, Grouper, array, or list of the previous

aggfunc: function, list of functions, dict, default numpy.mean


-> If list of functions passed, the resulting pivot table will have hierarchical columns whose top
level are the function names.
-> If dict is passed, the key is column to aggregate and value is function or list of functions

fill_value[scalar, default None] : Value to replace missing values with


margins[boolean, default False] : Add all row / columns (e.g. for subtotal / grand totals)
dropna[boolean, default True] : Do not include columns whose entries are all NaN
margins_name[string, default ‘All’] : Name of the row / column that will contain the totals
when margins is True.

Returns: DataFrame

37
Program to demonstrate pandas.pivot_table()
To load and display first five records from tested.csv data set.
import pandas as pd
import numpy as np
df = pd.read_csv(r"C:\Users\New\AppData\Local\Programs\Python\Python39\tested.csv")
df.head()

Output

Drop some columns to simplify our analysis


df.drop(['PassengerId','Ticket','Name'],inplace=True,axis=1)
df

Output

Grouping data using index in a Pivot Table


grouped = pd.pivot_table(data=df,index=['Sex'])
grouped

38
Output

Pivot Table with multi-index

Generate a tabular sheet to show the proportion of people who survived in each class segregated by
gender.

df.pivot_table("Survived", index=['Pclass', "Sex"])

Output

Calculate the average fare in each class for both males and females.
df.pivot_table("Fare", index=['Pclass', "Sex"], aggfunc=np.mean)

39
Output

Result
Thus the python program using pandas which demonstrates pivot table has been written and executed
successfully.

40
Ex No : 4
Working with pivot table.
Date :

Aim: To write a Pandas program to read data from text files, Excel Files and web pages.And also explore
various commands for doing descriptive analytics on the Iris data set.

Pandas - Reading a Text file.

Algorithm:

Step1: Import the numpy and pandas libraries.


Step2: Create a text file and store it in the python library.
Step3: By using pandas read_fwf() read the data from the file.
Step4: Print the output.
Step5: By using pandas read_csv read the data from the file.
Step4: Print the output.
Method 1

We can read a text file (txt) by using the pandas read_fwf() function, fwf stands for fixed-width
lines, We can use this to read fixed length or variable length text files.

# Syntax of read_fwf()

pandas.read_fwf(filepath_or_buffer, colspecs='infer', widths=None, infer_nrows=100, **kwds)

Parameters

filepath_or_buffer str, path object, or file-like object

String, path object (implementing os.PathLike[str]), or file-like object implementing


a text read() function.The string could be a URL. Valid URL schemes include http, ftp,
s3, and file. For file URLs, a host is expected. A local file could be:
file://localhost/path/to/table.csv.

Colspecs list of tuple (int, int) or ‘infer’. optional

A list of tuples giving the extents of the fixed-width fields of each line as half-open
intervals (i.e., [from, to[ ). String value ‘infer’ can be used to instruct the parser to try
detecting the column specifications from the first 100 rows of the data which are not
being skipped via skiprows (default=’infer’).

Widths list of int, optional

A list of field widths which can be used instead of ‘colspecs’ if the intervals are
contiguous.

41
infer_nrows int, default 100

The number of rows to consider when letting the parser determine the colspecs.

**kwds optional

Optional keyword arguments can be passed to TextFileReader.

Returns
DataFrame or TextFileReader

A comma-separated values (csv) file is returned as two-dimensional data structure with


labeled axes.

Method 2

We can read a text file (txt) by using the pandas read_csv() function.

Syntax: pd.read_csv(filepath_or_buffer, sep=’ ,’ , header=’infer’,


index_col=None, usecols=None, engine=None, skiprows=None,
nrows=None)

Parameters:

 filepath_or_buffer: It is the location of the file which is to be retrieved using this function. It
accepts any string path or URL of the file.
 sep: It stands for separator, default is ‘, ‘ as in CSV(comma separated values).
 header: It accepts int, a list of int, row numbers to use as the column names, and the start of the
data. If no names are passed, i.e., header=None, then, it will display the first column as 0, the
second as 1, and so on.
 usecols: It is used to retrieve only selected columns from the CSV file.
 nrows: It means a number of rows to be displayed from the dataset.
 index_col: If None, there are no index numbers displayed along with records.
 skiprows: Skips passed rows in the new data frame.

Method 1

We can read a text file (txt) by using the pandas read_fwf() function.

42
Sample Text file
birds.txt

# Read Variable length text file


import pandas as pd
df1 = pd.read_fwf(r"C:\Users\New\AppData\Local\Programs\Python\Python39\birds.txt")
print(df1)

Output
STRAY BIRDS
0 BY
1 RABINDRANATH TAGORE
2 STRAY birds of summer come to my
3 window to sing and fly away.
4 And yellow leaves of autumn, which
5 have no songs, flutter and fall there
6 with a sigh.

Method 2

We can read a text file (txt) by using the pandas read_csv() function.

43
II CSE A.txt

Program to read a text data set using read_csv()

# Read Text Files with Pandas using read_csv()

# importing pandas
import pandas as pd

# read text file into pandas DataFrame


df = pd.read_csv(r"C:\Users\New\AppData\Local\Programs\Python\Python39\II CSE A.txt",)

# display DataFrame
print(df)

Output
R.No\tNAME
0 113321104001 ABINAYA P
1 113321104003 ALAGANENI SUMI
2 113321104004 ANGA SRI AJAY KUMAR
3 113321104005 ASHOK KUMAR A
4 113321104006 ASHWINI E
5 113321104009 BHANDATMAKURU.S.V.S.VISWANATHA SARMA
6 113321104010 BOTTA SRIDHAR
7 113321104011 DASETTI MAHESH
8 113321104013 DEYAA ASMI M
9 113321104015 DHANUSH D
10 113321104017 DHARSHINI R
11 113321104019 DONKALA KAMAKSHI HARSHITHA
12 113321104022 DUVVURU BHAVANA REDDY
13 113321104023 GIRISH KUMAR V V
14 113321104026 HARI KRISHNAN R

44
Pandas - Reading an Excel file.

Algorithm

Step1: Import the numpy and pandas libraries.


Step2: Create an excel and store it in the python library.
Step3: By using pandas read_excel()read the data from the file.
Step4: Print the output.
Step5: Apply various descriptive commands

Iris Data Set

Multiple Sheets

45
Reading an Iris data set using read_excel()

import pandas as pds


file =(r"C:\Users\New\AppData\Local\Programs\Python\Python39\Iris.xls")
newData = pds.read_excel(file)
print(newData)

Output
Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \
0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8

Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
.. ...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica

[150 rows x 6 columns]

Reading data from multiple sheets from an excel file.


import pandas as pds
file =(r"C:\Users\New\AppData\Local\Programs\Python\Python39\Iris.xls")
newData = pds.read_excel(file, sheet_name = 'cricket')
print(newData)
newDat1 = pds.read_excel(file, sheet_name = 'Iris1')
print(newDat1)

Output
S.No Player Mat Runs HS
0 1 V Kohli (INDIA) 75 2633 94*
1 2 RG Sharma (INDIA) 104 2633 118
2 3 MJ Guptill (NZ) 83 2436 105
3 4 Shoaib Malik (ICC/PAK) 111 2263 75
4 5 BB McCullum (NZ) 71 2140 123
... ... ... ... ... ...
1758 1759 G Wijekoon (SL) 3 1 1*
1759 1760 JD Wildermuth (AUS) 2 1 1*
1760 1761 Yamin Ahmadzai (AFG) 2 1 1*

46
1761 1762 Zaki Ul Hassan (Belg) 3 1 1
1762 1763 Zeeshan Abbas (BAH) 1 1 1

[1763 rows x 5 columns]

Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm \


0 1 5.1 3.5 1.4 0.2
1 2 4.9 3.0 1.4 0.2
2 3 4.7 3.2 1.3 0.2
3 4 4.6 3.1 1.5 0.2
4 5 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 146 6.7 3.0 5.2 2.3
146 147 6.3 2.5 5.0 1.9
147 148 6.5 3.0 5.2 2.0
148 149 6.2 3.4 5.4 2.3
149 150 5.9 3.0 5.1 1.8

Species
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
.. ...
145 Iris-virginica
146 Iris-virginica
147 Iris-virginica
148 Iris-virginica
149 Iris-virginica

[150 rows x 6 columns]

Various Descriptive Commands

Making our own index – In this program we are making S.No as index.

import pandas as pds


file =(r"C:\Users\New\AppData\Local\Programs\Python\Python39\Iris.xls")
newData = pds.read_excel(file, sheet_name = 'cricket', index_col = 0)
print(newData)

Output
Player Mat Runs HS
S.No
1 V Kohli (INDIA) 75 2633 94*
2 RG Sharma (INDIA) 104 2633 118
3 MJ Guptill (NZ) 83 2436 105
4 Shoaib Malik (ICC/PAK) 111 2263 75
5 BB McCullum (NZ) 71 2140 123
... ... ... ... ...
1759 G Wijekoon (SL) 3 1 1*
1760 JD Wildermuth (AUS) 2 1 1*
1761 Yamin Ahmadzai (AFG) 2 1 1*
1762 Zaki Ul Hassan (Belg) 3 1 1
1763 Zeeshan Abbas (BAH) 1 1 1
[1763 rows x 4 columns]

47
Concatenating two sheets in an excel file
import pandas as pds
file =(r"C:\Users\New\AppData\Local\Programs\Python\Python39\Iris.xls")
newData1 = pds.read_excel(file, sheet_name = 'IA1', index_col = 0)
newData2 = pds.read_excel(file, sheet_name = 'IA2', index_col = 0)
newData3 = pds.concat([newData1, newData2])
print(newData3)

Output
REGISTER NUMBER NAME Mark
S.No
1 113321104001 ABINAYA P 83.333333
2 113321104003 ALAGANENI SUMI 78.333333
3 113321104004 ANGA SRI AJAY KUMAR 76.666667
4 113321104005 ASHOK KUMAR A 83.333333
5 113321104006 ASHWINI E 81.666667
... ... ... ...
56 113321104099 SUBIKSHYA KUMAR V 25
57 113321104101 SWARNA DARSINI V 76.666667
58 113321104105 VADDIBOINA PATTABHIRAMI REDDY AB
59 113321104106 VAISHALI G 58.333333
60 113321104120 YUVASHREE S 83.333333

[120 rows x 3 columns]

Displaying first five records

newData3.head()

Output

Displaying last five records

newData3.tail()

Output

48
The shape() method can be used to view the number of rows and columns in the data frame.

newData3.shape

Output
(120, 3)

Now, suppose our data is mostly numerical. We can get the statistical information like mean, max,
min, etc. about the data frame using the describe() method.

newData3.describe()

Output

To View all the Column Names

print(newData3.columns.tolist())

Output
['REGISTER NUMBER', 'NAME ', 'Mark ']

To find the Mean value of a single Column


newData3['Mark '].mean()

Output
70.65277777777777

49
If any column contains numerical data, we can sort that column using the sort_values() method.
sorted_column = newData3.sort_values(['Mark '], ascending = False)
print(sorted_column)

Output
REGISTER NUMBER NAME Mark
S.No
12 113321104019 DONKALA KAMAKSHI HARSHITHA 91.666667
55 113321104098 SUBASHREE S 91.666667
51 113321104090 SHALINI M 91.666667
44 113321104082 SAHANA S 90.000000
1 113321104001 ABINAYA P 90.000000
... ... ... ...
26 113321104048 KONDUBOUINA PATHEK KUMAR 0.000000
48 113321104087 SENTHUR MURUGAN T 0.000000
58 113321104105 VADDIBOINA PATTABHIRAMI REDDY 0.000000
40 113321104076 PRAAJEET M R 0.000000
49 113321104088 SHAIK ALFIYAZ 0.000000

Like in excel, formulas can also be applied and calculated columns can be created as follows:

newData3['calculated_column']= newData2['Mark '] + newData1['Mark ']


newData3['calculated_column'].head()

Output
S.No
1 173.333333
2 165.000000
3 156.666667
4 170.000000
5 168.333333
Name: calculated_column, dtype: float64

Reading data from the web

Web Page ‘https://www.icc-cricket.com/match/100697#scorecard’

50
import pandas as pd
pd.read_html('https://www.icc-cricket.com/match/100697#scorecard')

Output
[ Batters South Africa Batting R B 4s 6s
\
0 Dean ElgarD Elgar CPT run out (Marnus Labusc... 26 68.0 2.0 0.0
1 Sarel ErweeSJ Erwee c Usman Khawaja b Scott B... 18 31.0 3.0 0.0
2 Theunis Booysen de BruynTBdB de Bruyn c Alex ... 12 31.0 2.0 0.0
3 Temba BavumaT Bavuma c Alex Carey b Mitchell ... 1 8.0 0.0 0.0
4 Khaya ZondoK Zondo c Marnus Labuschagne b Mit... 5 19.0 0.0 0.0
5 Kyle VerreynneK Verreynne WKT c Steve Smith ... 52 99.0 3.0 0.0
6 Marco JansenM Jansen c Alex Carey b Cameron G... 59 136.0 10.0 0.0
7 Keshav MaharajKA Maharaj c Pat Cummins b Nath... 2 9.0 0.0 0.0
8 Kagiso RabadaK Rabada b Cameron Green 4 5.0 1.0 0.0
9 Anrich NortjeA Nortje NOT OUT 1 1.0 0.0 0.0
10 Lungi NgidiL Ngidi b Cameron Green 2 6.0 0.0 0.0
11 Extras (nb 1, b 3, lb 3) 7 NaN NaN NaN
12 Total (all out, 68.4 overs) 189 NaN NaN NaN

SR
0 38.23
1 58.06
2 38.70
3 12.50
4 26.31
5 52.52
6 43.38
7 22.22
8 80.00
9 100.00
10 33.33
11 NaN
12 NaN ,
Fall of Wickets
0 1-29 (SJ Erwee, 10.4 ov) , 2-56 (TBdB de Bruyn...,
Bowlers Australia Bowling O M R W Econ Dots
0 Mitchell StarcMA Starc 13.0 2 39 2 3.00 60
1 Pat CumminsPJ Cummins 14.0 4 30 0 2.14 67
2 Scott BolandSM Boland 14.0 2 34 1 2.42 67
3 Nathan LyonNM Lyon 17.0 3 53 1 3.11 74
4 Cameron GreenC Green 10.4 3 27 5 2.53 51,
Batters Australia Batting R B 4s 6s \
0 David WarnerDA Warner Retired Hurt 200 254 16 2
1 Usman KhawajaUT Khawaja c Kyle Verreynne b Ka... 1 11 0 0
2 Marnus LabuschagneM Labuschagne run out (Dean... 14 35 1 0
3 Steve SmithSPD Smith c Theunis Booysen de Bru... 85 161 9 1
4 Travis HeadTM Head NOT OUT 48 48 7 1
5 Cameron GreenC Green Retired Hurt 6 20 1 0
6 Alex CareyAT Carey WKT NOT OUT 9 22 1 0
7 Pat CumminsPJ Cummins - - - -
8 Scott BolandSM Boland - - - -
9 Mitchell StarcMA Starc - - - -
10 Nathan LyonNM Lyon - - - -
11 Extras (nb 5, w 1, b 5, lb 12) 23 NaN NaN NaN
12 Total (3 wickets, 91 overs) 386 NaN NaN NaN

SR

51
0 78.74
1 9.09
2 40.00
3 52.79
4 100.00
5 30.00
6 40.90
7 -
8 -
9 -
10 -
11 NaN
12 NaN ,
Fall of Wickets
0 1-21 (UT Khawaja, 6.4 ov) , 2-75 (M Labuschagn...,
Bowlers South Africa Bowling O M R W Econ Dots
0 Kagiso RabadaK Rabada 18.0 1 94 1 5.22 65
1 Lungi NgidiL Ngidi 15.1 2 62 0 4.08 63
2 Marco JansenM Jansen 16.0 1 56 0 3.50 71
3 Anrich NortjeA Nortje 16.0 1 50 1 3.12 70
4 Keshav MaharajKA Maharaj 25.5 0 107 0 4.14 97]

Result: Thus the pandas programs for reading data from text files, Excel Files and web pages has been
written and executed successfully.

52
Ex No : 5(a)
Univariate Analysis on UCI and pima Indian diabetes data set.
Date :

Aim: To write python programs using numpy, pandas, seaborn to demonstrate univariate analysis on UCI
diabetes_data_upload data set and Pima Indian diabetes data set
Algorithm
Step 1: Import Numpy, pandas and seaborn
Step 2: Create Data Frames to load the data from UCI diabetes dataset and Pima India diabetes dataset
Step 3: Perform basic analysis on the data set.
Step 4: Perform univariate analysis on the UCI data set and Pima India diabetes dataset.
Step 5: Plot a density plot to present the skewness and kurtosis.
Univariate analysis on UCI diabetes_data_upload data set.
UCI Diabetes data set:UCI_diabetes_upload.csv

UCI Data Set


import numpy as np
import pandas as pd
df =
pd.read_csv(r"C:\Users\New\AppData\Local\Programs\Python\Python39\UCI_
diabetes_data_upload.csv")
df.head()

Output
sudden
Genital visual delayed partial muscle
AgeGender Polyuria Polydipsia weight weakness Polyphagia ItchingIrritability Alopecia Obesityclass
thrush blurring healing paresis stiffness
loss
0 40 Male No Yes No Yes No No No Yes No Yes No Yes Yes Yes Positive

1 58 Male No No No Yes No No Yes No No No Yes No Yes No Positive

2 41 Male Yes No No Yes Yes No No Yes No Yes No Yes Yes No Positive

53
3 45 Male No No Yes Yes Yes Yes No Yes No Yes No No No No Positive

4 60 Male Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes Positive

df.shape
Output
(520, 17)

df.dtypes
Output
Age int64
Gender object
Polyuria object
Polydipsia object
sudden weight loss object
weakness object
Polyphagia object
Genital thrush object
visual blurring object
Itching object
Irritability object
delayed healing object
partial paresis object
muscle stiffness object
Alopecia object
Obesity object
class object
dtype: object

df.info()
Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 520 entries, 0 to 519
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Age 520 non-null int64
1 Gender 520 non-null object
2 Polyuria 520 non-null object
3 Polydipsia 520 non-null object
4 sudden weight loss 520 non-null object
5 weakness 520 non-null object
6 Polyphagia 520 non-null object
7 Genital thrush 520 non-null object
8 visual blurring 520 non-null object
9 Itching 520 non-null object
10 Irritability 520 non-null object
11 delayed healing 520 non-null object
12 partial paresis 520 non-null object
13 muscle stiffness 520 non-null object
14 Alopecia 520 non-null object
15 Obesity 520 non-null object
16 class 520 non-null object
dtypes: int64(1), object(16)
memory usage: 69.2+ KB

df.describe().T

54
Output

Frequency of the column’class’


sr = pd.Series(df['class'])
pos = sr.value_counts()
print(pos)
Output
Positive 320
Negative 200
Name: class, dtype: int64

Mean of the column ‘Age’


df['Age'].mean()
Output
48.02884615384615
Median of the column ‘Age’
df['Age'].median()
Output
47.5
Mode of the column ‘Age’
df['Age'].mode()
Output
0 35
Name: Age, dtype: int64
Variance — its gives average deviation from mean value
Variance of the column ‘Age’
df['Age'].var()
Output
147.65812583370388
Standard Deviation of the column ‘Age’
df['Age'].std()
Output
12.151465995249458

Skewness
Skewness is used to measure symmetry of data along with the mean value. Symmetry means equal
distribution of observation above or below the mean.

 skewness = 0: if data is symmetric along with mean


 skewness = Negative: if data is not symmetric and right side tail is longer than left side
tail of density plot.
 skewness = Positive: if data is not symmetric and left side tail is longer than right side
tail in density plot.

df[‘Age’].skew()
Output
0.3293593578272701

55
Kurtosis

Kurtosis is used to defined peakedness ( or flatness) of density plot (normal distribution plot). As
per Dr. Wheeler defines kurtosis defined as: “The kurtosis parameter is a measure of the
combined weight of the tails relative to the rest of the distribution.” This means we measure tail
heaviness of given distribution.

kurtosis = 0: if peakedness of graph is equal to normal distribution.

kurtosis = Negative: if peakedness of graph is less than normal distribution(flat plot)

kurtosis = Positive: if peakedness of graph is more than normal distribution (more peaked plot)

We can find kurtosis of given variable by below given formula.

df[‘Age’].kurt()
Output
-0.19170941407070163

The graph representation of a single variable and interpretation of skewness and peakedness of
distribution from it.

import seaborn as sns


sns.distplot(df['Age'],hist=True,kde=True)

Output

In the above graph we can clearly see that

 skewness = 0.3293593578272701
o Positive: Data is not symmetric and left side tail is longer than right side tail in
density plot.
 kurtosis = -0.19170941407070163
o Negative: Peakedness of graph is less than normal distribution (flat plot)

56
Univariate analysis on Pima Indian diabetes data set

Pima Indian Diabetes data set:pima_diabetes.csv

import numpy as np
import pandas as pd
df =
pd.read_csv(r"C:\Users\New\AppData\Local\Programs\Python\Python39\pima
_diabetes.csv")
df.head()

Output

Frequency of the column’Outcome’


import numpy as np
import pandas as pd
df = pd.read_csv(r"C:\Users\New\AppData\Local\Programs\Python\Python39\pima_diabetes.csv")
sr = pd.Series(df['Outcome'])
pos = sr.value_counts()
print(pos)

Output
0 500
1 268
Name: Outcome, dtype: int64

df.shape
Output

57
(768, 9)

df.dtypes
Output
Pregnancies int64
Glucose int64
BloodPressure int64
SkinThickness int64
Insulin int64
BMI float64
DiabetesPedigreeFunction float64
Age int64
Outcome int64
dtype: object

df.info()
Output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Pregnancies 768 non-null int64
1 Glucose 768 non-null int64
2 BloodPressure 768 non-null int64
3 SkinThickness 768 non-null int64
4 Insulin 768 non-null int64
5 BMI 768 non-null float64
6 DiabetesPedigreeFunction 768 non-null float64
7 Age 768 non-null int64
8 Outcome 768 non-null int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

df.describe().T
Output

Mean of the column ‘Pregnancies’


df['Pregnancies'].mean()
Output
3.8450520833333335
Median of the column ‘Pregnancies’
df['Pregnancies'].median()
Output
3.0

58
Mode of the column ‘Pregnancies’
df['Pregnancies'].mode()
Output
0 1
Name: Pregnancies, dtype: int64
Variance — its gives average deviation from mean value
Variance of the column ‘Pregnancies’
df['Pregnancies'].var()
Output
11.354056320621465
Standard Deviation of the column ‘Pregnancies’
df['Pregnancies'].std()
Output
3.3695780626988694

Skewness
Skewness is used to measure symmetry of data along with the mean value. Symmetry means equal
distribution of observation above or below the mean.

 skewness = 0: if data is symmetric along with mean


 skewness = Negative: if data is not symmetric and right side tail is longer than left side
tail of density plot.
 skewness = Positive: if data is not symmetric and left side tail is longer than right side
tail in density plot.

df['Pregnancies'].skew()
Output
0.9016739791518588

Kurtosis

Kurtosis is used to defined peakedness ( or flatness) of density plot (normal distribution plot). As
per Dr. Wheeler defines kurtosis defined as: “The kurtosis parameter is a measure of the
combined weight of the tails relative to the rest of the distribution.” This means we measure tail
heaviness of given distribution.

kurtosis = 0: if peakedness of graph is equal to normal distribution.

kurtosis = Negative: if peakedness of graph is less than normal distribution(flat plot)

kurtosis = Positive: if peakedness of graph is more than normal distribution (more peaked plot)

We can find kurtosis of given variable by below given formula.

df['Pregnancies'].kurt()
Output

59
0.15921977754746486

The graph representation of a single variable and interpretation of skewness and peakedness of
distribution from it.

import seaborn as sns

sns.distplot(df['Pregnancies'],hist=True,kde=True)

Output

Result: Thus python programs has been written using numpy, pandas, seaborn to demonstrate
univariate analysis on UCI diabetes_data_upload data set, Pima Indian diabetes data set and
executed successfully.

60
Ex No : 5(b)
Bivariate Analysis on UCI and pima Indian diabetes data set.
Date :

Aim: To write python programs using numpy, pandas, seaborn and sklearn to demonstrate bivariate
analysis(Linear regression and Logistic regression) on UCI diabetes_data_upload data set and Pima
Indian diabetes data set.

Procedure
Load sklearn Libraries.Load
Data
Load the diabetes datasetSplit
Dataset
Creating Model Linear Regression and Logistic RegressionMake
predictions using the testing set
Finding Coefficient And Mean Square Error

Program
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from sklearn import datasets, linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

#To calculate accuracy measures and confusion matrix from sklearn import metrics

diabetes_X, diabetes_y = datasets.load_diabetes(return_X_y=True)


diabetes_X = diabetes_X[:, np.newaxis, 2]
# Split the data into training/testing sets
diabetes_X_train = diabetes_X[:-20]
diabetes_X_test = diabetes_X[-20:]

# Split the targets into training/testing sets


diabetes_y_train = diabetes_y[:-20]
diabetes_y_test = diabetes_y[-20:]
# Create linear regression object
regr = linear_model.LinearRegression()

61
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)

# Make predictions using the testing set


diabetes_y_pred = regr.predict(diabetes_X_test)

# Create Logistic regression object


Logistic_model = LogisticRegression()
Logistic_model.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean squared error
print('Mean squared error: %.2f' % mean_squared_error(diabetes_y_test, diabetes_y_pred))
# The coefficient of determination: 1 is perfect prediction
print('Coefficient of determination: %.2f' % r2_score(diabetes_y_test, diabetes_y_pred))
y_predict = Logistic_model.predict(diabetes_X_train)
#print("Y predict/hat ", y_predict)
y_predict

Output
Coefficients:
[938.23786125]
Mean squared error: 2548.07
Coefficient of determination: 0.47

Result: Thus python programs for bivariate analysis (linear regression and Logistic regression) using
pandas, Numpy, Seaborn and sklearn has been written and executed successfully.

62
Ex No : 5(c) Multiple Linear Regression Analysis on UCI and pima Indian diabetes
Date : data set.

Aim: To write python programs using numpy, pandas, seaborn and sklearn to demonstrate multiple
regression analysis on UCI diabetes_data_upload data set and Pima Indian diabetes data set.

Multiple linear regression (MLR), also known simply as multiple regression, is a statistical technique
that uses several explanatory variables to predict the outcome of a response variable. The goal of multiple
linear regression is to model the linear relationship between the explanatory (independent) variables and
response (dependent) variables. In essence, multiple regression is the extension of ordinary least-squares
(OLS) regression because it involves more than one explanatory variable.

Procedure:
Load sklearn Libraries.
Load Data
Load the diabetes dataset
Split Dataset
Fitting multiple linear regression to the training
Predict the Test set results.
Finding Coefficient and Mean Square Error

Program
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
diabetes_df =
pd.read_csv('https://raw.githubusercontent.com/ammishra08/MachineLearning/master/Datasets/diabetes.c
sv')
diabetes_df.head()

diabetes_df.isnull().sum()

63
X = diabetes_df.drop(['Outcome'], axis=1)
X

Y = diabetes_df['Outcome']

from sklearn.preprocessing import MinMaxScaler


scaler = MinMaxScaler(feature_range=(0,1))
scaled_data=scaler.fit_transform(X)
scaled_data

array([[0.35294118, 0.74371859, 0.59016393, ..., 0.50074516, 0.23441503,


0.48333333],
[0.05882353, 0.42713568, 0.54098361, ..., 0.39642325, 0.11656704,
0.16666667],
[0.47058824, 0.91959799, 0.52459016, ..., 0.34724292, 0.25362938,
0.18333333],
...,
[0.29411765, 0.6080402 , 0.59016393, ..., 0.390462 , 0.07130658,
0.15 ],
[0.05882353, 0.63316583, 0.49180328, ..., 0.4485842 , 0.11571307,
0.43333333],

64
[0.05882353, 0.46733668, 0.57377049, ..., 0.45305514, 0.10119556,
0.03333333]])

from sklearn.model_selection import train_test_split


X_train, X_test, Y_train, Y_test = train_test_split(scaled_data, Y, test_size = 0.2, random_state = 0)

## Linear Regression
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()

# fit - training
lin_reg.fit(X_train, Y_train)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)

lin_reg.score(X_test, Y_test)

0.32230203252064193

from sklearn.metrics import mean_squared_error, r2_score


predictions = lin_reg.predict(X_test)
predictions

array([ 1.02722998, 0.21253413, 0.10381276, 0.60393473, 0.17002471,


-0.05457028, 0.675663 , 0.79325937, 0.42166086, 0.39622603,
0.54701077, 1.02912189, 0.35228136, 0.23350894, 0.15728248,
0.21826961, 0.81373747, -0.1082133 , 0.45245761, 0.30879222,
0.60311668, 0.42373503, 0.29284079, 0.03061766, 0.00431851,
0.39418932, -0.0082751 , 0.87734866, 0.15116385, 0.19811451,
0.48032035, 0.295183 , 0.10840742, 0.46328402, 0.13681791,
0.65855667, 0.47003279, 0.09510088, 0.37817058, 0.68072097,
0.332749 , 0.2586776 , 0.21951081, 0.76684137, 0.68821076,
-0.27441513, 0.0987557 , 0.27087554, 0.38939506, 0.35041871,
0.43935041, 0.24781632, 0.81815957, 0.49109104, 0.16853431,
-0.51171306, 0.03572902, 0.50088809, 0.33684171, 0.11553777,
0.65166002, 0.46991751, 0.14554516, 0.66679449, 0.63866328,
0.89808907, 0.66213959, 0.20150269, 0.40340029, 0.13516788,
0.16370524, 0.4534224 , 0.11121275, 0.99828137, 0.78204325,
0.36595794, 0.13108785, 0.59067726, 0.05657107, 0.23855416,
0.36823904, 0.4121898 , 0.26336844, -0.07299263, 0.25818704,
0.25693108, 0.35301164, 0.42926256, 0.82843918, 0.22926609,
0.22632564, 0.23801857, 0.2944592 , -0.03628869, 0.60420943,
0.24917705, 0.42906185, 0.49482025, 0.56608061, 0.31039227,
0.29288979, 0.11395321, 0.26668827, -0.0456346 , 0.58228827,
0.39490815, 0.17984688, 0.35329218, -0.02655586, 0.7066657 ,
0.13448725, 0.34978366, 0.58066325, 0.4222968 , 0.51017991,
0.58575295, 0.14365557, 0.62060328, 0.12038252, 0.67041524,
0.40225373, 0.39184056, 0.33543204, 0.47604064, 0.28144448,
-0.01233853, 0.35715677, 0.39592963, 0.49658979, 0.41584511,
0.409619 , -0.00508886, 0.02264073, 0.6765481 , 0.34839629,
0.42568242, 0.18874433, 0.4067827 , 0.62798988, 0.26791715,
0.09692232, 0.52634048, 0.0608477 , 0.0670275 , 0.3885636 ,
0.09995645, 0.06983883, 0.11027726, 0.17709799, 0.24256162,

65
0.08541831, 0.59232843, 0.1303017 , 0.23072725])

mean_squared_error(Y_test, predictions)

0.14370648838141728

r2_score(Y_test, predictions)

0.32230203252064193

Result: Thus python programs for linear regression using pandas, Numpy, Seaborn and sklearn has been
written and executed successfully.

66
Ex No : 5(d)
Multiple Linear Regression Analysis on UCI and pima Indian diabetes data set.
Date :

Aim : To write a python program, which compares the results of the two different data sets.

Procedure
Step 1: Prepare the datasets to be comparedStep 2:

Create the two DataFrames

Based on the above data, you can then create the following two DataFrames Step 3: Compare the
values between the two Pandas DataFrames

In this step, you’ll need to import the NumPy package.


Let’s say that you have the following data stored in a CSV file called car1.csv

While you have the data below stored in a second CSV file called car2.csv

Program

import pandas as pd
import numpy as np
data_1 = pd.read_csv(r'd:\car1.csv')df1 =
pd.DataFrame(data_1) data_2 =
pd.read_csv(r'd:\car2.csv')df2 =
pd.DataFrame(data_2) df1['amount1'] =
df2['amount1']

df1['prices_match'] = np.where(df1['amount'] == df2['amount1'], 'True', 'False')


df1['price_diff'] = np.where(df1['amount'] == df2['amount1'], 0, df1['amount'] -df2['amount1'])
print(df1)

Output

Model City Year amount amount1 prices_match price_diff


0 Maruti Chennai 2022 600000 600000 True 0
1 Hyndai Chennai 2022 700000 700000 True 0
2 Ford Chennai 2022 800000 850000 False -50000
3 Kia Chennai 2022 900000 900000 True 0
4 XL6 Chennai 2022 1000000 1000000 True 0
5 Tata Chennai 2022 1100000 1150000 False -50000

67
6 Audi Chennai 2022 1200000 1200000 True 0
7 Ertiga Chennai 2022 1300000 1300000 True 0

Please click here to download the DatasetDataset 1:


car1.csv
Dataset 2: car2.csv

Result: Thus the two results have been compared.

68
Ex No : 6(a) Apply and explore various plotting functions on UCI data sets.
Date : Normal Curve

Aim: To apply and explore normal curve on UCI data sets using python programs and libraries Numpy,
pandas, seaborn.

Procedure:
Step 1) Import the required packages such as numpy, matplotlib.
Step 2) Import the norm function from Scipy’s stat library.
step 3) Initialize mean and Standard deviation
step 4) Calculate z-transform z1 & z2.
step 5) Set the title of the graph
step 6) Set the label and limits of the graph.
step 7) Plot the graph and save it in a file.

Program:

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
%matplotlib inline
# define constants
mu = 998.8
sigma = 73.10
x1 = 900
x2 = 1100
# calculate the z-transform
z1 = ( x1 - mu ) / sigma
z2 = ( x2 - mu ) / sigma
x = np.arange(z1, z2, 0.001) # range of x in spec
x_all = np.arange(-10, 10, 0.001) # entire range of x, both in and out of spec
# mean = 0, stddev = 1, since Z-transform was calculated
y = norm.pdf(x,0,1)
y2 = norm.pdf(x_all,0,1)
# build the plot
fig, ax = plt.subplots(figsize=(9,6))
plt.style.use('fivethirtyeight')
ax.plot(x_all,y2)

69
ax.fill_between(x,y,0, alpha=0.3, color='b')
ax.fill_between(x_all,y2,0, alpha=0.1)
ax.set_xlim([-4,4])
ax.set_xlabel('# of Standard Deviations Outside the Mean')
ax.set_yticklabels([])
ax.set_title('Normal Gaussian Curve')

plt.savefig('normal_curve.png', dpi=72, bbox_inches='tight')


plt.show()

Output:

Result:
The program to explore normal curve using python and its libraries has been written and executed
successfully.

70
Ex No : 6(b)
Density and Contour Plots
Date :

Aim: To write python programs to demonstrate density and contour plots using UCI data sets.
Procedure:
Step 1) Import the required packages such as numpy, matplotlib. seaborn
Step 2) load the iris data set
step 3) Set the title of the graph
step 4) Set the label and limits of the graph.
step 5) plot the density plot and contour plot using the functions
Program: To demonstrate density plot
import pandas as pd
iris = pd.read_csv(r"C:\Users\New\AppData\Local\Programs\Python\Python39\iris.csv")
print(iris.head())

sepallength sepalwidth petallength petalwidth species


0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

# DENSITY PLOT
fig = plt.figure(figsize = (15,20))
ax = fig.gca()
iris.plot(ax = ax, kind='density', subplots=True, layout=(4,4), sharex=False)
plt.show()

71
Program: To demonstrate contour plots
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
data = sns.load_dataset("iris")
data.head()

setosa = data[data.species == 'setosa']


virginica = data[data.species == 'virginica']

plt.title("Flowers (Setosa & Virginica)")

sns.kdeplot(x=setosa.sepal_length, y=setosa.sepal_width, shade=True, cmap='Reds', shade_lowest=False)


sns.kdeplot(x=virginica.sepal_length, y=virginica.sepal_width, shade=True, cmap='Blues',
shade_lowest=False);

Result: The python programs to explore density plot and contour plot on iris data set has been written and
executed successfully.

72
Ex No : 6(c)
Correlation and Scatter plots
Date :

Aim: To write python programs to demonstrate Correlation and Scatter plots using UCI data sets.
Procedure:
Step 1) Import the required packages such as numpy, matplotlib. seaborn
Step 2) load the iris data set
step 3) Derive correlation matrix
step 4) Set the title of the graph
step 5) Set the label and limits of the graph.
step 6) plot the Correlation plot
step 7)using the iris data plot the scatter plot
Program: To demonstrate correlation plot
import matplotlib.pyplot as plt
import seaborn as sb
import pandas as pd
iris = pd.read_csv(r"C:\Users\New\AppData\Local\Programs\Python\Python39\iris.csv")
iris.corr()

fig = plt.figure(figsize = (15,15))


ax = fig.gca()
plt.title("Iris Correlation Plot")
sb.heatmap(iris.corr(), annot=True, ax=ax)
plt.show()

73
Program: To demonstrate scatter plot
import pandas as pd
iris = pd.read_csv(r"C:\Users\New\AppData\Local\Programs\Python\Python39\iris.csv")
print(iris.head())

sepallength sepalwidth petallength petalwidth species


0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

# create color dictionary


colors = {'Iris-setosa':'r', 'Iris-versicolor':'g', 'Iris-virginica':'b'}
# create a figure and axis
fig, ax = plt.subplots()
# plot each data-point
for i in range(len(iris['sepallength'])):
ax.scatter(iris['sepallength'][i], iris['sepalwidth'][i],color=colors[iris['species'][i]])
# set a title and labels

74
ax.set_title('Iris Dataset')
ax.set_xlabel('sepallength')
ax.set_ylabel('sepalwidth')

Result
Thus the python program to demonstrate correlation plot and scatter plot has been written and executed
successfully.

75
Ex No : 6(d)
Histogram
Date :

Aim: To write python programs to demonstrate histogram using UCI data sets.
Procedure:
Step 1) Import the required packages such as numpy, matplotlib. seaborn
Step 2) load the iris data set
step 3) Set the label and limits of the graph.
step 4) plot the histogram
Program: To demonstrate correlation plot.
import pandas as pd
iris = pd.read_csv(r"C:\Users\New\AppData\Local\Programs\Python\Python39\iris.csv")
print(iris.head())

sepallength sepalwidth petallength petalwidth species


0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

fig = plt.figure(figsize = (15,20))


ax = fig.gca()
iris.hist(ax = ax)
plt.show()

76
Result
Thus the python program to demonstrate histogram has been written and executed successfully.

77
Ex No : 6(e)
Three dimensional plotting
Date :

Aim: To apply and explore three dimensional plotting functions on UCI data sets.
Procedure

Download diabetes.csv file.

Import all required libraries

Load the diabetes data

Set the labels

Using fig.add_subplot(111, projection='3d') plot the three dimensional plot


Program:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # To visualize
from mpl_toolkits.mplot3d import Axes3D
data = pd.read_csv(r"C:\Users\New\AppData\Local\Programs\Python\Python39\pima_diabetes.csv")
data.head()

fig = plt.figure(figsize=(4,4))
ax = fig.add_subplot(111, projection='3d')

78
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
x = data['Age'].values
y = data['Glucose'].values
z = data['Outcome'].values
ax.set_xlabel("Age (Year)")
ax.set_ylabel("Glucose (Reading)")

ax.set_zlabel("Outcome (0 or 1)")
ax.scatter(x, y, z, c='r', marker='o')
plt.show()

Result: Thus a python program to demonstrate three dimensional plotting has been written and executed
successfully.

79
Ex No : 7
Visualizing Geographic Data with Basemap.
Date :

Aim: To explore Geographic Data with Basemap using python.

Procedure:
Installing Basemap package
Import mpl_toolkits from basemap, matplotlib
Adding vector layers to a map
Projection, bounding box, & resolution
Plotting a specific region
Background relief maps
Plotting geographic data using Basemap

Programs:

The drawcoastlines function has the following main arguments:


 linewidth: 1.0, 2.0, 3.0…
 linestyle: solid, dashed…
 color: black, red…

Let’s apply some changes to the coastlines

from mpl_toolkits.basemap import Basemap


import matplotlib.pyplot as plt
fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='dashed', color='red')
plt.title("Coastlines", fontsize=20)
plt.show()

80
The drawcountries() function uses similar arguments like drawcountries() as shown below:

fig = plt.figure(figsize = (12,12))


m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid', color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
plt.title("Country boundaries", fontsize=20)
plt.show()

Draw major rivers

 Use the drawrivers() function to add major rivers on the map


 The drawrivers() function can take linewidth, linestyle, color arguments

fig = plt.figure(figsize = (12,12))


m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid', color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
m.drawrivers(linewidth=0.5, linestyle='solid', color='#0000ff')
plt.title("Major rivers", fontsize=20)
plt.show()

81
The fillcontinents() function can take the following arguments:

 color: fills continents (default gray)


 lake_color: fills inland lakes
 alpha: sets transparency for continents

fig = plt.figure(figsize = (12,12))


m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid', color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
m.fillcontinents(color='coral',lake_color='aqua', alpha=0.9)
plt.title("Color filled continents", fontsize=20)
plt.show()

82
Draw map boundary

 The drawmapboundary() function is used to draw the earth boundary on the map
 The drawmapboundary() function can take the following arguments:
 linewidth: sets line width for boundary line (default: 1)
 color: sets the color of the boundary line (default: black)
 fill_color: fills the map background region

fig = plt.figure(figsize = (12,12))


m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid', color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
m.fillcontinents(color='coral',lake_color='aqua')
m.drawmapboundary(color='b', linewidth=2.0, fill_color='aqua')
plt.title("Filled map boundary", fontsize=20)
plt.show()

Draw and label longitude lines

 The drawmeridians() function is used to draw & label meridians/longitude lines


 The drawmeridians()function can take the following arguments:
 List of longitude values created with range() for integer values & np.arange() for floats
values
 color: sets the color of the line longitude lines (default: black)
 textcolor: sets the color of labels (default: black)
 linewidth: sets the line width for the longitude lines
 dashes: sets the dash pattern for the longitude lines (default: [1,1])
 labels: sets the label’s position with four values [0,0,0,0] representing left, right, top, & bottom.
Change these values to 1 where you want the labels to appear

83
fig = plt.figure(figsize = (12,12))
m = Basemap()
m.drawcoastlines(linewidth=1.0, linestyle='solid', color='black')
m.drawcountries(linewidth=1.0, linestyle='solid', color='k')
m.fillcontinents(color='coral',lake_color='aqua')
m.drawmeridians(range(0, 360, 20), color='k', linewidth=1.0, dashes=[4, 4], labels=[0, 0, 0, 1])
m.drawparallels(range(-90, 100, 10), color='k', linewidth=1.0, dashes=[4, 4], labels=[1, 0, 0, 0])
plt.ylabel("Latitude", fontsize=15, labelpad=35)
plt.xlabel("Longitude", fontsize=15, labelpad=20)
plt.show()

Projection, bounding box, & resolution

The Basemap() function is used to set projection, bounding box, & resolution of a map

Map projection:

 Inside the Basemap() function, the projection=" argument can take several pre-defined
projections listed in the table below or visit this site to get more information.
 To specify the desired projection, use the general syntax shown below:

m = Basemap(projection='aeqd')
m = Basemap(projection='cyl')

84
Some projections
require setting
bounding box, map center, & map size of the map using the following arguments:

a) Bounding box/map corners:

 llcrnrlat: lower-left corner geographical latitude


 urcrnrlat: upper-right corner geographical latitude
 llcrnrlon: lower-left corner geographical longitude
 urcrnrlon: upper-right corner geographical longitude

Example:

m = Basemap(projection='cyl', ,llcrnrlat=-80,urcrnrlat=80,llcrnrlon=-
180,urcrnrlon=180)

b) Map center:

 lon_0: central longitude


 lat_0: central latitude

Example:

m = Basemap(projection='ortho', lon_0 = 25, lat_0 = 10)

c) Map resolution: The map resolution argument determines the quality of vector layers such as
coastlines, lakes, & rivers etc. The available options are:

85
 c: crude
 l: low
 i: intermediate
 h: high
 f: full

Let’s see some examples on how the map projection, bounding box, map center, & map
resolution arguments used to create and modify maps:

Create a global map with a Mercator Projection

fig = plt.figure(figsize = (10,8))


m = Basemap(projection='merc',llcrnrlat=-80,urcrnrlat=80,llcrnrlon=-180,urcrnrlon=180)
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lightblue')
m.drawcountries(linewidth=1, linestyle='solid', color='k' )
m.drawmapboundary(fill_color='lightblue')
plt.title("Mercator Projection", fontsize=20)

Create a global map with a Cylindrical Equidistant Projection.


fig = plt.figure(figsize = (10,8))
m = Basemap(projection='cyl',llcrnrlat=-80,urcrnrlat=80,llcrnrlon=-180,urcrnrlon=180)
m.drawcoastlines()
m.fillcontinents(color='tan',lake_color='lightblue')
m.drawcountries(linewidth=1, linestyle='solid', color='k' )

86
m.drawmapboundary(fill_color='lightblue')
plt.title(" Cylindrical Equidistant Projection", fontsize=20)

Create a global map with Orthographic Projection


plt.figure(figsize=(16, 12))

m = Basemap(projection='ortho', lat_0=20, lon_0=78)


m.drawcoastlines()
m.drawcountries()
m.etopo(scale=0.5,alpha=0.5);
plt.show()

plt.figure(figsize=(8, 8))
m = Basemap(projection='ortho', lat_0=20, lon_0=78)
m.drawcoastlines()
m.drawcountries()
m.bluemarble(scale=0.5);

87
plt.show()

import numpy as np
from itertools import chain

def draw_map(m, scale=0.2): # draw a shaded-relief image


m.shadedrelief(scale=scale)

# lats and longs are returned as a dictionary


lats = m.drawparallels(np.linspace(-90, 90, 13))
lons = m.drawmeridians(np.linspace(-180, 180, 13))

#plt.figure(figsize=(16, 12),edgecolor='w')
m = Basemap(projection='ortho', lat_0=20, lon_0=78)
m.drawcoastlines()
m.drawcountries()

m.bluemarble(scale=0.5);
draw_map(m)
plt.show()

88
Plotting a specific region

By passing bounding box information

fig = plt.figure(figsize=(8, 8))


m = Basemap(projection='lcc', resolution=None,
width=8E6, height=8E6,
lat_0=45, lon_0=-100,)
m.etopo(scale=0.5, alpha=0.5)

# Map (long, lat) to (x, y) for plotting


x, y = m(-122.3, 47.6)
plt.plot(x, y, 'ok', markersize=5)
plt.text(x, y, ' Seattle', fontsize=12);

89
Mapping Geographical Data with Basemap using a dataset-
datasets_557_1096_cities_r2.csv

# importing packages
import pandas as pd
import numpy as np
from numpy import array
import matplotlib as mpl
# for plots
import matplotlib.pyplot as plt
from matplotlib import cm
from mpl_toolkits.basemap import Basemap
%matplotlib inline

cities = pd.read_csv (r"F:/datasets_557_1096_cities_r2.csv")

fig = plt.figure(figsize=(20,20))
states = cities.groupby('state_name')['name_of_city'].count().sort_values(ascending=True)
states.plot(kind="barh", fontsize = 20)
plt.grid(b=True, which='both', color='Black',linestyle='-')
plt.xlabel('No of cities taken for analysis', fontsize = 20)
plt.show ()

90
cities['latitude'] = cities['location'].apply(lambda x: x.split(',')[0])
cities['longitude'] = cities['location'].apply(lambda x: x.split(',')[1])
print("The Top 10 Cities sorted according to the Total Population (Descending Order)")
top_pop_cities = cities.sort_values(by='population_total',ascending=False)
top10_pop_cities=top_pop_cities.head()

The Top 10 Cities sorted according to the Total Population (Descending Order)
plt.subplots(figsize=(20, 15))
map = Basemap(width=1200000,height=900000,projection='lcc',resolution='l',
llcrnrlon=67,llcrnrlat=5,urcrnrlon=99,urcrnrlat=37,lat_0=28,lon_0=77)

map.drawmapboundary ()
map.drawcountries ()
map.drawcoastlines ()

lg=array(top10_pop_cities['longitude'])
lt=array(top10_pop_cities['latitude'])
pt=array(top10_pop_cities['population_total'])
nc=array(top10_pop_cities['name_of_city'])

x, y = map(lg, lt)
population_sizes = top10_pop_cities["population_total"].apply(lambda x: int(x / 5000))
plt.scatter(x, y, s=population_sizes, marker="o", c=population_sizes, cmap=cm.Dark2, alpha=0.7)

for ncs, xpt, ypt in zip(nc, x, y):


plt.text(xpt+60000, ypt+30000, ncs, fontsize=10, fontweight='bold')

plt.title('Top 10 Populated Cities in India',fontsize=20)

Text(0.5, 1.0, 'Top 10 Populated Cities in India')

91
Result:
Thus the python program to demonstrate Visualizing Geographic Data with Basemap has been written
and executed successfully.

92

You might also like