Unit6 - Working With Data

WORKING WITH
DATA
Ms. Margaret Salve,

Department of Computer Science
• READING AND WRITING FILES
• LOADING DATAWITH PANDAS
• WORKING WITH AND SAVING DATAWITH PANDAS
• DATA CLEANING IN PYTHON
File Handling in Python:
•Python has various functions for File Handling.
open() – Used to open a file in various modes
read() – reading the contents of a file
write() – writing contents to a file
close() – Closes a file
Opening a File:
• Python provides an open() function that accepts two arguments: filename and mode of opening the
file
•open() returns a file object which is used to perform various file operations
Syntax:
fileobj = open(“filename”, “access mode”)

where,
access modes are: r, w, a, rb, wb, ab etc
Example:
fileobj = open("demo.txt", "r") Output:

if fileobj:
print("File opened successfully") File opened successfully
Reading a File:
• The read() method reads all the contents of the file opened
• readline() reads only a single line from the file
• read(count) reads the specified count of characters from the file
Example:
Output:
# File handling
fileobj = open("demo.txt", "r") # Opening a File File opened successfully

if fileobj: All Contents
print("File opened successfully") Hello Everyone
print(fileobj.read()) # Read all contents of a file Welcome to Python Programming
print(“Only 10 Lines")
Only 10 characters
print(fileobj.read(10)) # Reading only 10 characters of a file
Hello Ever
print("Single Line")
print(fileobj.readline()) # Read only a single line of the file Single Line
Hello Everyone
Writing to/Creating a File:
• To create a new file or write to an existing file, the file needs to be opened in “w” or “a” mode
Example:
fileptr = open("new.txt","a")
fileptr.write("This is a new file")
Closing a File:
• Once all the operations are done on the file, we must close it through our Python script using the
close() method.
Example:
fileptr.close()
NumPy in Python:
• NumPy is Numerical Python library used for computation and processing of the multidimensional
and single dimensional array elements.
• NumPy works faster than Lists as array are stored at one continuous memory.
• The NumPy package needs to installed using the following command:
pip install numpy
• Once installed, the package has to be imported in the Python program using the import keyword
• The n-dimensional array object in numpy is known as ndarray
• We can create a NumPy ndarray object by using the array() function
Example:
import numpy as np # import the numpy package Output:
arr = np.array([1, 2, 3, 4, 5]) # Creating ndarray object
[1 2 3 4 5]
print(arr) <class 'numpy.ndarray'>
print(type(arr))
Array Indexing and Slicing:

• Any array element can be accessed by using the index number
• The indexes in NumPy arrays start with 0
• Like, Lists the slicing can also be performed on arrays using the slice operator [:]
NumPy in Python:
Type of Creating the array Accessing the element using

array index
1-D array arr = np.array([1, 2, 3, 4]) print(arr[1])
2-D array arr = np.array([[1,2,3,4,5], [6,7,8,9,10]] print(arr[1, 3]) # Prints 9
Pandas Library:
• Python Pandas is an open-source library that provides high-performance data manipulation in
Python.
• It provides various data structures and operations for manipulating numerical data and time series
• It has functions for analyzing, cleaning, exploring, and manipulating data.
• The name Pandas is derived from the word Panel Data and is developed by Wes
McKinney in 2008
• Pandas is built on top of the Numpy package, means Numpy is required for operating the Pandas.
• It also has to be installed as follows:
pip install pandas
Pandas DataFrames:
• A Pandas DataFrame is a 2 dimensional data structure or a table with rows and columns.
• Pandas DataFrame consists of three principal components, the data, rows, and columns.
• Syntax: pandas.DataFrame( data, index, columns, dtype, copy)
Creating a simple DataFrame:
• A simple DataFrame can be created using a single list or a list of list Output:
import pandas as pd
0
data = ["C", "C++", "Java", "Python", "PHP"] # Create List 0 C
1 C++
df = pd.DataFrame(data) # Create dataframe using the list 2 Java
3 Python
print(df) 4 PHP
Dataframes:
Example:
Columns
import pandas as pd
Output:
data1 = [ [101, "Abc", 90], Roll Name Percent
[102, "Xyz", 80], a 101 Abc 90
[103, "Pqr", 75], b 102 Xyz 80
[104, "Mno", 85], c 103 Pqr 75
[105, "Def", 77] d 104 Mno 85
] e 105 Def 77
df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"],
index=["a","b","c","d","e"] ) Index
print(df1)
Loading Data with Pandas:
• Using the Pandas library, the data can be imported from the csv (comma separated values) files and
load it in the Pandas Dataframes
Example:
import pandas as pd
data = pd.read_csv("IRIS.csv") # Read the csv file

df = pd.DataFrame(data) # Load the imported data into dataframe
print(df)
Additional information about Dataframes:
Example:
Output:
print(len(df1)) # Returns number of rows 5
print(df1.shape) # Returns rows and columns as a tuple (5, 3)
15
print(df1.size) # Returns 'cells' in the table Index(['Roll', 'Name', 'Percent'],
dtype='object')
print(df1.columns) # Returns the column names Roll int64
print(df1.dtypes) # Returns data types of the columns Name object
Percent int64
dtype: object
Creating Dataframes:
We can create a DataFrame using following ways:
• dict
• Lists
• Numpy ndarrrays
• Series
1. Accessing any column in DataFrame:
The columns of a dataframe can be accessed by calling them by their columns name.
Example:
print(df1["Name"]) # Accessing a column of Dataframe
Output:
a. Abc
b. Xyz
c. Pqr
d. Mno
e. Def
2. Accessing any row in DataFrame:
• Pandas provide a special method to retrieve rows from a Data frame.
• the loc[] method to retrieve the row
Example:
data1 = [ [101, "Abc", 90],
[102, "Xyz", 80],
[103, "Pqr", 75], Output:
[104, "Mno", 85], Roll 102
[105, "Def", 77]
Name Xyz
]
Percent 80
index=["a","b","c","d","e"])
print(df1.loc['b']) # Retrieving Data at row indexed “b”
data = df1.loc['b'] # Retrieving data at index 'b‘ and storing it in variable

print(data)
3. Accessing any row in DataFrame using integer index:
• Rows can be selected by passing integer location to an iloc() function
• Example:
data1 = [ [101, "Abc", 90],

[102, "Xyz", 80], Output:
[103, "Pqr", 75], Roll 103
[104, "Mno", 85],
Name Pqr
[105, "Def", 77]
]
Percent 75

index=["a","b","c","d","e"])
print(df1.iloc[2]) # Retrieving Data at row at index number 2

4. Slicing a Dataframe:
• Multiple rows can be selected using : operator.
Example:
data1 = [ [101, "Abc", 90],

[102, "Xyz", 80], Output:
[103, "Pqr", 75], Roll Name Percent
[104, "Mno", 85],
c 103 Pqr 75
[105, "Def", 77]
]
d 104 Mno 85

index=["a","b","c","d","e"])
print(df1[2 : 4) # Retrieving Data at row from index 2 to (4 -1)th index

5. Appending/Adding row to Dataframe:
• New rows can be added to a dataframe using the append() function
Example:
data1 = [ [101, "Abc", 90],

[102, "Xyz", 80],
[103, "Pqr", 75], Output:
[104, "Mno", 85],
Roll Name Percent
[105, "Def", 77]
]
a 101 Abc 90
b 102 Xyz 80
df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"], c 103 Pqr 75
index=["a","b","c","d","e"]) d 104 Mno 85
e 105 Def 77
df2 = pd.DataFrame([[106, 'AAA', 60], [107, 'asfd', 65]],columns=["Roll", f 106 AAA 60
"Name", "Percent"],index=['f','g']) g 107 asfd 65
df1 = df1.append((df2)) # Appending a row to a dataframe
print(df1)
6. Dropping or deleting rows from a Dataframe:
• Rows can be deleted using the index number in drop() method
Example:
df1 = df1.drop('a')
print(df1)
Working with and saving data with pandas:
• A Pandas DataFrame as a CSV file using to_csv() method.
data1 = [[101, "Abc", 90],
[102, "Xyz", 80],
[103, "Pqr", 75],
[104, "Mno", 85],
[105, "Def", 77]
]
df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"], index=["a","b","c","d","e"])
df1.to_csv("StudentData.csv")
Saving Dataframe without header and index:
• Setting the parameter header=False removes the column heading and setting index=False removes
the index while writing DataFrame to CSV file.
data1 = [[101, "Abc", 90],

[102, "Xyz", 80],
[103, "Pqr", 75],
[104, "Mno", 85],
[105, "Def", 77]
]
df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"], index=["a","b","c","d","e"])
df1.to_csv("StudentData.csv“, header = False, index = False) # Saving file without header and index
Data Cleaning in Python:
• Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset.
• Data cleaning aims at filling missing values, smoothing out noise while determining outliers and
rectifying inconsistencies in the data
• Data cleaning consists of – Removing null records, dropping unnecessary columns, treating
missing values, rectifying junk values or otherwise called outliers, restructuring the data to modify
it to a more readable format, etc. is known as data cleaning
• There are various techniques to clean the data like:

1. Removing Null/Duplicate Records/Ignore tuple
2. Dropping unnecessary Columns
3. Treating missing values
Removing Null/Duplicate Records/Ignore tuple:
• If in a particular row a significant amount of data is missing, then it would be better to drop that
row as it would not be adding any value to our model.
Example: RollNo Name Percent City Output:
101Abc 80Pune RollNo Name Percent City
102Xyz 85Mumbai 0 101.0 Abc 80.0 Pune
103Def 90Goa 1 102.0 Xyz 85.0 Mumbai
104 75Delhi 2 103.0 Def 90.0 Goa
105 Pune 3 104.0 NaN 75.0 Delhi
106Bbb 4 105.0 NaN NaN Pune
Pqr 60Nagpur
5 106.0 Bbb NaN NaN
108 Nashik
109 50Mumbai
6 NaN Pqr 60.0 Nagpur
110dgfg 74Pune 7 108.0 NaN NaN Nashik
8 109.0 NaN 50.0 Mumbai
• Sometimes csv file has null values, which are later displayed 9 110.0 dgfg 74.0 Pune
as NaN in Data Frame

Removing Null/Duplicate Records/Ignore tuple:
• Pandas DataFrame dropna() function is used to remove rows and columns with Null/NaN values
•import pandas as
This function pd a new DataFrame and the source DataFram
returns
e remains unchanged.
Output:
RollNo Name Percent City
data = pd.read_csv("Cleaning.csv") # Read the csv file 0 101.0 Abc 80.0 Pune
df = pd.DataFrame(data) # Load the imported data into dataframe 1 102.0 Xyz 85.0 Mumbai
2 103.0 Def 90.0 Goa
print(df) 9 110.0 dgfg 74.0 Pune
newdf = df.dropna() # drop the NaN values from Dataframe df
print(newdf)
2. Dropping unnecessary Columns
• Sometimes the Data set we receive is huge with number of rows and columns
• Some columns from the dataset may not be useful for our model
• Such data is better removed as it would valuable resources like memory and processing time.
• Example: In the previous data the column City is not required for finding the student progression,
hence can be dropped using the dropna() or drop() methods
new2 = df.drop(['City'], axis=1, inplace=False)

print(new2)
where,
axis : Takes value 0 for ‘index’ and 1 for ‘columns’
inplace=True will changes in original dataframe and
• Missing Data can occur when no information is provided for one or more items or for a whole unit
• Missing Data can be a very big problem in a real-life scenarios.
• They must be handled carefully as they can be an indication of something important.
• We can also fill in the missing value using fillna() method and let the user replace NaN values with some value
of their own.
Syntax:
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None)
where,
value : Static, dictionary, array, series or dataframe to fill instead of NaN.
method : Method is used if user doesn’t pass any value. Pandas has different methods like bfill, backfill or ffill
axis: axis takes int or string value for rows/columns.
inplace: It is a boolean which makes the changes in data frame itself if True.
Example:
new3 = df['City'].fillna("No Data", inplace=False) Output:
1 Pune
print(new3) 2 Mumbai
3 Goa
4 Delhi
5 Pune
6 No Data
7 Nagpur
8 Nashik
9 Mumbai
10 Pune

Unit6 - Working With Data

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit6 - Working With Data

Uploaded by

Copyright:

Available Formats

WORKING WITH

Ms. Margaret Salve,

fileobj = open(“filename”, “access mode”)

fileobj = open("demo.txt", "r") Output:

fileobj = open("demo.txt", "r") # Opening a File File opened successfully

Array Indexing and Slicing:

Type of Creating the array Accessing the element using

data = pd.read_csv("IRIS.csv") # Read the csv file

print(df1.loc['b']) # Retrieving Data at row indexed “b”

data = df1.loc['b'] # Retrieving data at index 'b‘ and storing it in variable

data1 = [ [101, "Abc", 90],

df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"],

print(df1.iloc[2]) # Retrieving Data at row at index number 2

data1 = [ [101, "Abc", 90],

df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"],

print(df1[2 : 4) # Retrieving Data at row from index 2 to (4 -1)th index

data1 = [ [101, "Abc", 90],

df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"], index=["a","b","c","d","e"])

data1 = [[101, "Abc", 90],

df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"], index=["a","b","c","d","e"])

• There are various techniques to clean the data like:

as NaN in Data Frame

new2 = df.drop(['City'], axis=1, inplace=False)

You might also like