Dva Lab Manual

DATA VISUALIZATION AND ANALYTICS
LAB MANUAL
NAME: HIYA CHOPRA

SAP: 500083441
ROLL NUMBER: R214220519
BTECH-CSE-BAO
TOPIC 1:
INTRODUCTION TO PYTHON
Python is an interpreted high level programming language used for building

large software. Its design philosophy emphasizes towards code readability and
clarity.
Points being covered:
1. Creation of tuples
2. Lists
3. Dictionary
4. Series
TUPLES:
Tuples are python objects separated by commas and are generally immutable.
CODE:
OUTPUT:
LISTS:
Lists are like dynamic arrays which may contain integers, strings, python
objects separated by commas. Unlike tuples, lists are mutable, hence they can
be altered after creation.
CODE:
OUTPUT:
DICTIONARY:
Dictionary is an unordered collection of data values which can store more than
one data type in a key: value format.
CODE:
OUTPUT:
SERIES:
Series is a 1D array capable to store the values of any data type. The values are
indexed.
CODE:
Output:
TOPIC 2:
INTRODUCTION TO PANDAS PACKAGES
Pandas is a library/package used in the python programming language

providing fast, flexible and expressive data structures. It contains functions
used in data manipulation and in data analysis.
1. creating data frames using lists
2. sorting rows
3. select all rows for a specific column
4. Select few rows for multiple columns
5. say list[],select all rows for 4rdcolumn using iloc(row number, column
number)
6. Calculate row-wise mean.
CODE: CREATING AND SORTING OF DATA
OUTPUT:
CODE: SELECTING ROWS
OUTPUT:
CODE:
OUTPUT:
TOPIC 3:
INTRODUCTION TO NUMPY
Numpy stands for numerical python. It’s a multidimensional library containing

array objects and it’s used when we are dealing with arrays.
1. create array – 1D,2Dand 3D array of random number using numpy
library
2. perform indexing and slicing of array
1 DIMENSIONAL ARRAY:
A 1D array is a structured collection of components that can be accessed
individually by specifying the position of a component with a single index
value.
CODE:
OUTPUT:
2 DIMENSIONAL ARRAY:
It’s an array organised in the form of a matrix and can be represented as a
collection of rows and columns.
CODE:
OUTPUT:
3 DIMENSIONAL ARRAY
It’s a multidimensional array, which can be represented as a collection of 2D
arrays.
CODE:
OUTPUT:
TOPIC 4:
DATA MERGING
Data merging is a process of merging two data sets into one, and aligning the
rows from each based on common attributes and columns.
Points to be covered:
1. Merging 2 data frames
2. default Inner Join
3. Left Outer Join
4. Right Outer Join
5. Outer Join
6. concatenation
7. Hierarchical Index
8. To remove Index use ignore index=True
CODE:
import pandas as pd
import numpy as np
# student -> file1(name,age,address,mob no) file2(name,sap id, sem, course ,
university,mob no)
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1':
range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})
print("df1:\n", df1,"\n\ndf2:\n",df2)
df_merge = pd.merge(df1,df2)
print(df_merge)
df1 = pd.merge(df1, df2, on='key') # default Inner Join - Use intersection of
Keys
print(df1)
df1=pd.merge(df1,df2,how='left',on='key') #Left Outer Join- Use keys from left
object
print(df1)
df1 = pd.merge(df1,df2,how='right',on='key') #Right Outer Join- Use keys from
right object
print(df1)
df1 =pd.merge(df1, df2, how='outer', on='key') #Outer Join- Use union of both
object keys
print(df1)
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], 'value':
range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a','b'])
print("left1:\n", left1,"\n\nright1:\n",right1,"\n\nAfter Merge:\n")
df1= pd.merge(left1, right1, left_on='key', right_index=True, how = 'outer')
print(df1)
#concatenation
# file1(sap_id, name, sem, cgpa) file2(sap_id, name, sem, cgpa)
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])
#Calling concat with these object in a list glues together the values and
indexes-;/
df1 = pd.concat([s1, s2, s3])
print(df1)
df1 = pd.concat([s1, s2, s3], axis=1)
print(df1)
#Same Operation on dataframe
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b',
'c'],columns=['one', 'two'])
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a',
'c'],columns=['three', 'four'])
print("df1:\n", df1,"\n\ndf2:\n",df2,"\n\nAfter Concat:\n")
df1 = pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])
print(df1)
#Hierarchical Index
df1 = pd.concat([df1, df2], axis=1, keys=['level1', 'level2'],names=['upper',
'lower'])
print(df1)
#To remove Index use ignore_index=True
df1 = pd.concat([df1, df2],ignore_index=True)
print(df1)
OUTPUT:
TOPIC 5:
DATA CLEANING
Data cleaning or cleansing is the process of detecting, removing or correcting

corrupt or inaccurate records from a record set, table or dataset.
1. Fill the NAN's with preceding Values
2. removing duplicates
3. handle missing value
4. remove extreme values (outliers),
5. To select all rows having a certain value using condition,
6. To remove rows and column containing nulls
CODE: FILLING NAN WITH PRECEDING VALUES
OUTPUT:
CODE: REMOVING DUPLICATES
OUTPUT:
CODE: REMOVING ROWS/COLUMNS CONTAINING VALUES BASED ON

CONDITION
OUTPUT:
TOPIC 6:
DATA TRANSFORMATION
Data transformation is the process of converting raw data into a format or

structure that would be more suitable for an algorithm.
1. Show the use of applymap,
2. get_dummy function
3. Convert categorical variable into "dummy" matrix used in statistical and
machine learning
4. Passing own bin names
5. show bin counts
6. Data Smoothing, Binning Continuous data is discretized by storing into
bins for analysis
7. Vectorised Operation
8. lambda function->anonymous function with any number of arguments
but single expression
CODE:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['1','2','3'])
print("\ndf:\n",df,"\n")
print("Mean equals to-",df.apply(np.mean,axis=0))
print("Row wise mean equals to",df.apply(np.mean,axis=1)) #row wise)
#lambda function
x = lambda a : a + 10
print(type(x))
print(x(10))
x = lambda a, b : a * b
print(x(10,12))
print("\ndf:\n",df,"\n\n")
print(df.apply(lambda x: x.max() - x.min()))
#Vectorized Operation
print("\ndf:\n",df,"\n\n")
print(df.applymap(lambda x:x*100))
# Data Smoothing
#Binning-> Continous data is discretized by stroring into bins for analysis
Age = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
#binning into discrete age buckets
bins = [18, 25, 35, 60, 100] #create bins of (18,25], (25,35], (35,60],
(60,100]
Humans = pd.cut(Age, bins)
print(Humans)
# bin counts
print(pd.value_counts(Humans))
#Passing own bin names
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
print(pd.cut(Age, bins, labels=group_names))
#Convert categorical variable into "dummy" matrix -
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
print("\ndf:\n",df,"\n\nDummy matrix:\n")
print(pd.get_dummies(df['key']))
#To add prefix to the columns
print("\ndf:\n",df,"\n")
dummies = pd.get_dummies(df['key'], prefix='Dummy')
print("\ndummies:\n",dummies,"\n")
df_with_dummy = df[['data1']].join(dummies)
print("\ndf_with_dummy:\n",df_with_dummy,"\n")
OUTPUT:
TOPIC 7:
INTRODUCTION TO MATPLOTLIBS
Matplotlib.pyplot is a multi-platform visualization library in python used for

creating 2D plots of arrays. It contains several plots like line, bar, scatter,
histogram, etc.
1. Using matplotlib create different graphs with proper labelling -
2. histogram
3. stack plot
4. line plot
5. pie
6. scatter
HISTOGRAM:
A histogram is an approximate representation of the distribution of numerical
data.
CODE:
OUTPUT:
STACK PLOT
Stack plot is used to draw a stacked area plot where plotting is done vertically
o top of each other rather than overlapping with one another.
CODE:
OUTPUT:
LINE PLOT:
Line plot is a graph that displays data using a number line.
CODE:
OUTPUT:
PIE PLOT:
It’s a circular statistical graph which is divided into slices to illustrate numerical
proportion.
CODE:
OUTPUT:
SCATTER PLOT:
It is used to describe the relationship between two variables, represented by
dots.
CODE:
OUTPUT:
TOPIC 8:
CUSTOMIZING PLOT WITH MATPLOTLIB
1. use of subplots
2. Grid histograms
3. graphs overlays using box and whisker plots
4. proper labelling of graph x, y axis title using legend
USE OF SUBPLOTS
Subplot in pyplot is a module in matplotlib library is used to create a figure and
a set of subplots. It’s used for creating multiple axis.
CODE:
OUTPUT:
GRID HISTOGRAM
CODE:
OUTPUT:
BOX PLOT
Boxplot is a way of displaying the distribution of data based on a five number
summary: minimum, first quartile, median, third quartile and maximum.
CODE:
fig = plt.figure(figsize =(10, 7))
ax = fig.add_subplot(111)
ax.boxplot(total_revenue, patch_artist = True,notch ='True', vert = 0)
plt.xlabel("revenue in exponential form")
ax.set_yticklabels(['mexico ', 'australia'])
plt.ylabel('country')
plt.show()
OUTPUT:
LABELLING OF GRAPHS AND USE OF LEGENDS

A legend is an area describing the elements of the graph.
CODE:
OUTPUT:
TOPIC 9:
USING SEABORN FOR DATA VISUALIZATION
Seaborn is an open source python library built on top of matplotlib. It is used

for data visualization and exploratory data analysis. It works easily with data
frames and also with python libraries.
Points to be covered: show 4 visualizations
1. seaborn plot functions
2. Multi plot grid
3. plotting categorical data
VISUALIZATION 1: HEAT MAP

CODE:
OUTPUT:
VISUALIZATION 2
CODE:
OUTPUT:
VISUALIZATION 3: MULTI PLOT GRID

CODE:
OUTPUT:
VISUALIZATION 4: CATEGORICAL DATA
CODE:
OUTPUT:
TOPIC 10:
COURSERA

Dva Lab Manual

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dva Lab Manual

Uploaded by

Copyright:

Available Formats

DATA VISUALIZATION AND ANALYTICS

NAME: HIYA CHOPRA

Python is an interpreted high level programming language used for building

Pandas is a library/package used in the python programming language

CODE: CREATING AND SORTING OF DATA

Numpy stands for numerical python. It’s a multidimensional library containing

Data cleaning or cleansing is the process of detecting, removing or correcting

CODE: FILLING NAN WITH PRECEDING VALUES

CODE: REMOVING ROWS/COLUMNS CONTAINING VALUES BASED ON

Data transformation is the process of converting raw data into a format or

Matplotlib.pyplot is a multi-platform visualization library in python used for

LABELLING OF GRAPHS AND USE OF LEGENDS

Seaborn is an open source python library built on top of matplotlib. It is used

VISUALIZATION 1: HEAT MAP

VISUALIZATION 3: MULTI PLOT GRID

You might also like