You are on page 1of 32

DATA VISUALIZATION AND ANALYTICS

LAB MANUAL

NAME: HIYA CHOPRA


SAP: 500083441
ROLL NUMBER: R214220519
BTECH-CSE-BAO

TOPIC 1:
INTRODUCTION TO PYTHON

Python is an interpreted high level programming language used for building


large software. Its design philosophy emphasizes towards code readability and
clarity.
Points being covered:
1. Creation of tuples
2. Lists
3. Dictionary
4. Series

TUPLES:
Tuples are python objects separated by commas and are generally immutable.

CODE:
OUTPUT:

LISTS:
Lists are like dynamic arrays which may contain integers, strings, python
objects separated by commas. Unlike tuples, lists are mutable, hence they can
be altered after creation.

CODE:

OUTPUT:

DICTIONARY:
Dictionary is an unordered collection of data values which can store more than
one data type in a key: value format.
CODE:

OUTPUT:

SERIES:
Series is a 1D array capable to store the values of any data type. The values are
indexed.

CODE:

Output:
TOPIC 2:
INTRODUCTION TO PANDAS PACKAGES

Pandas is a library/package used in the python programming language


providing fast, flexible and expressive data structures. It contains functions
used in data manipulation and in data analysis.
Points being covered:
1. creating data frames using lists
2. sorting rows
3. select all rows for a specific column
4. Select few rows for multiple columns
5. say list[],select all rows for 4rdcolumn using iloc(row number, column
number)
6. Calculate row-wise mean.

CODE: CREATING AND SORTING OF DATA

OUTPUT:
CODE: SELECTING ROWS

OUTPUT:

CODE:
OUTPUT:
TOPIC 3:
INTRODUCTION TO NUMPY

Numpy stands for numerical python. It’s a multidimensional library containing


array objects and it’s used when we are dealing with arrays.
Points being covered:
1. create array – 1D,2Dand 3D array of random number using numpy
library
2. perform indexing and slicing of array

1 DIMENSIONAL ARRAY:
A 1D array is a structured collection of components that can be accessed
individually by specifying the position of a component with a single index
value.

CODE:
OUTPUT:

2 DIMENSIONAL ARRAY:
It’s an array organised in the form of a matrix and can be represented as a
collection of rows and columns.

CODE:

OUTPUT:
3 DIMENSIONAL ARRAY
It’s a multidimensional array, which can be represented as a collection of 2D
arrays.

CODE:

OUTPUT:
TOPIC 4:
DATA MERGING

Data merging is a process of merging two data sets into one, and aligning the
rows from each based on common attributes and columns.
Points to be covered:
1. Merging 2 data frames
2. default Inner Join
3. Left Outer Join
4. Right Outer Join
5. Outer Join
6. concatenation
7. Hierarchical Index
8. To remove Index use ignore index=True

CODE:
import pandas as pd
import numpy as np
# student -> file1(name,age,address,mob no) file2(name,sap id, sem, course ,
university,mob no)
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 'data1':
range(7)})
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)})
print("df1:\n", df1,"\n\ndf2:\n",df2)
df_merge = pd.merge(df1,df2)
print(df_merge)
df1 = pd.merge(df1, df2, on='key') # default Inner Join - Use intersection of
Keys
print(df1)
df1=pd.merge(df1,df2,how='left',on='key') #Left Outer Join- Use keys from left
object
print(df1)
df1 = pd.merge(df1,df2,how='right',on='key') #Right Outer Join- Use keys from
right object
print(df1)
df1 =pd.merge(df1, df2, how='outer', on='key') #Outer Join- Use union of both
object keys
print(df1)
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], 'value':
range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a','b'])
print("left1:\n", left1,"\n\nright1:\n",right1,"\n\nAfter Merge:\n")
df1= pd.merge(left1, right1, left_on='key', right_index=True, how = 'outer')
print(df1)
#concatenation
# file1(sap_id, name, sem, cgpa) file2(sap_id, name, sem, cgpa)
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])
#Calling concat with these object in a list glues together the values and
indexes-;/
df1 = pd.concat([s1, s2, s3])
print(df1)
df1 = pd.concat([s1, s2, s3], axis=1)
print(df1)
#Same Operation on dataframe
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b',
'c'],columns=['one', 'two'])
df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a',
'c'],columns=['three', 'four'])
print("df1:\n", df1,"\n\ndf2:\n",df2,"\n\nAfter Concat:\n")
df1 = pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])
print(df1)
#Hierarchical Index
df1 = pd.concat([df1, df2], axis=1, keys=['level1', 'level2'],names=['upper',
'lower'])
print(df1)
#To remove Index use ignore_index=True
df1 = pd.concat([df1, df2],ignore_index=True)
print(df1)

OUTPUT:
TOPIC 5:
DATA CLEANING

Data cleaning or cleansing is the process of detecting, removing or correcting


corrupt or inaccurate records from a record set, table or dataset.
Points to be covered:
1. Fill the NAN's with preceding Values
2. removing duplicates
3. handle missing value
4. remove extreme values (outliers),
5. To select all rows having a certain value using condition,
6. To remove rows and column containing nulls

CODE: FILLING NAN WITH PRECEDING VALUES

OUTPUT:
CODE: REMOVING DUPLICATES

OUTPUT:

CODE: REMOVING ROWS/COLUMNS CONTAINING VALUES BASED ON


CONDITION
OUTPUT:

TOPIC 6:
DATA TRANSFORMATION

Data transformation is the process of converting raw data into a format or


structure that would be more suitable for an algorithm.
Points to be covered:
1. Show the use of applymap,
2. get_dummy function
3. Convert categorical variable into "dummy" matrix used in statistical and
machine learning
4. Passing own bin names
5. show bin counts
6. Data Smoothing, Binning Continuous data is discretized by storing into
bins for analysis
7. Vectorised Operation
8. lambda function->anonymous function with any number of arguments
but single expression
CODE:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(5,3),columns=['1','2','3'])
print("\ndf:\n",df,"\n")
print("Mean equals to-",df.apply(np.mean,axis=0))
print("Row wise mean equals to",df.apply(np.mean,axis=1)) #row wise)
#lambda function
x = lambda a : a + 10
print(type(x))
print(x(10))
x = lambda a, b : a * b
print(x(10,12))
print("\ndf:\n",df,"\n\n")
print(df.apply(lambda x: x.max() - x.min()))
#Vectorized Operation
print("\ndf:\n",df,"\n\n")
print(df.applymap(lambda x:x*100))
# Data Smoothing
#Binning-> Continous data is discretized by stroring into bins for analysis
Age = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
#binning into discrete age buckets
bins = [18, 25, 35, 60, 100] #create bins of (18,25], (25,35], (35,60],
(60,100]
Humans = pd.cut(Age, bins)
print(Humans)
# bin counts
print(pd.value_counts(Humans))
#Passing own bin names
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
print(pd.cut(Age, bins, labels=group_names))
#Convert categorical variable into "dummy" matrix -
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'], 'data1': range(6)})
print("\ndf:\n",df,"\n\nDummy matrix:\n")
print(pd.get_dummies(df['key']))
#To add prefix to the columns
print("\ndf:\n",df,"\n")
dummies = pd.get_dummies(df['key'], prefix='Dummy')
print("\ndummies:\n",dummies,"\n")
df_with_dummy = df[['data1']].join(dummies)
print("\ndf_with_dummy:\n",df_with_dummy,"\n")

OUTPUT:
TOPIC 7:
INTRODUCTION TO MATPLOTLIBS

Matplotlib.pyplot is a multi-platform visualization library in python used for


creating 2D plots of arrays. It contains several plots like line, bar, scatter,
histogram, etc.
Points to be covered:
1. Using matplotlib create different graphs with proper labelling -
2. histogram
3. stack plot
4. line plot
5. pie
6. scatter

HISTOGRAM:
A histogram is an approximate representation of the distribution of numerical
data.
CODE:
OUTPUT:

STACK PLOT

Stack plot is used to draw a stacked area plot where plotting is done vertically
o top of each other rather than overlapping with one another.

CODE:
OUTPUT:

LINE PLOT:
Line plot is a graph that displays data using a number line.

CODE:
OUTPUT:
PIE PLOT:
It’s a circular statistical graph which is divided into slices to illustrate numerical
proportion.

CODE:

OUTPUT:
SCATTER PLOT:
It is used to describe the relationship between two variables, represented by
dots.

CODE:
OUTPUT:

TOPIC 8:
CUSTOMIZING PLOT WITH MATPLOTLIB
Points to be covered:
1. use of subplots
2. Grid histograms
3. graphs overlays using box and whisker plots
4. proper labelling of graph x, y axis title using legend

USE OF SUBPLOTS
Subplot in pyplot is a module in matplotlib library is used to create a figure and
a set of subplots. It’s used for creating multiple axis.

CODE:

OUTPUT:

GRID HISTOGRAM
CODE:

OUTPUT:

BOX PLOT
Boxplot is a way of displaying the distribution of data based on a five number
summary: minimum, first quartile, median, third quartile and maximum.
CODE:
fig = plt.figure(figsize =(10, 7))
ax = fig.add_subplot(111)
ax.boxplot(total_revenue, patch_artist = True,notch ='True', vert = 0)
plt.xlabel("revenue in exponential form")
ax.set_yticklabels(['mexico ', 'australia'])
plt.ylabel('country')
plt.show()

OUTPUT:

LABELLING OF GRAPHS AND USE OF LEGENDS


A legend is an area describing the elements of the graph.

CODE:
OUTPUT:

TOPIC 9:
USING SEABORN FOR DATA VISUALIZATION

Seaborn is an open source python library built on top of matplotlib. It is used


for data visualization and exploratory data analysis. It works easily with data
frames and also with python libraries.
Points to be covered: show 4 visualizations
1. seaborn plot functions
2. Multi plot grid
3. plotting categorical data

VISUALIZATION 1: HEAT MAP


CODE:
OUTPUT:

VISUALIZATION 2
CODE:
OUTPUT:

VISUALIZATION 3: MULTI PLOT GRID


CODE:

OUTPUT:
VISUALIZATION 4: CATEGORICAL DATA

CODE:

OUTPUT:

TOPIC 10:
COURSERA

You might also like