You are on page 1of 29

WORKING WITH

DATA

Ms. Margaret Salve,


Department of Computer Science
• READING AND WRITING FILES
• LOADING DATAWITH PANDAS
• WORKING WITH AND SAVING DATAWITH PANDAS
• DATA CLEANING IN PYTHON
File Handling in Python:
•Python has various functions for File Handling.
open() – Used to open a file in various modes
read() – reading the contents of a file
write() – writing contents to a file
close() – Closes a file
Opening a File:
• Python provides an open() function that accepts two arguments: filename and mode of opening the
file
•open() returns a file object which is used to perform various file operations
Syntax:

fileobj = open(“filename”, “access mode”)


where,
access modes are: r, w, a, rb, wb, ab etc

Example:

fileobj = open("demo.txt", "r") Output:


if fileobj:
print("File opened successfully") File opened successfully
Reading a File:
• The read() method reads all the contents of the file opened
• readline() reads only a single line from the file
• read(count) reads the specified count of characters from the file
Example:
Output:
# File handling

fileobj = open("demo.txt", "r") # Opening a File File opened successfully


if fileobj: All Contents
print("File opened successfully") Hello Everyone
print(fileobj.read()) # Read all contents of a file Welcome to Python Programming
print(“Only 10 Lines")
Only 10 characters
print(fileobj.read(10)) # Reading only 10 characters of a file
Hello Ever
print("Single Line")
print(fileobj.readline()) # Read only a single line of the file Single Line
Hello Everyone
Writing to/Creating a File:
• To create a new file or write to an existing file, the file needs to be opened in “w” or “a” mode
Example:

fileptr = open("new.txt","a")
fileptr.write("This is a new file")

Closing a File:
• Once all the operations are done on the file, we must close it through our Python script using the
close() method.

Example:
fileptr.close()
NumPy in Python:
• NumPy is Numerical Python library used for computation and processing of the multidimensional
and single dimensional array elements.

• NumPy works faster than Lists as array are stored at one continuous memory.
• The NumPy package needs to installed using the following command:
pip install numpy
• Once installed, the package has to be imported in the Python program using the import keyword
• The n-dimensional array object in numpy is known as ndarray
• We can create a NumPy ndarray object by using the array() function
Example:
import numpy as np # import the numpy package Output:
arr = np.array([1, 2, 3, 4, 5]) # Creating ndarray object
[1 2 3 4 5]
print(arr) <class 'numpy.ndarray'>

print(type(arr))

Array Indexing and Slicing:


• Any array element can be accessed by using the index number
• The indexes in NumPy arrays start with 0
• Like, Lists the slicing can also be performed on arrays using the slice operator [:]
NumPy in Python:

Type of Creating the array Accessing the element using


array index
1-D array arr = np.array([1, 2, 3, 4]) print(arr[1])
2-D array arr = np.array([[1,2,3,4,5], [6,7,8,9,10]] print(arr[1, 3]) # Prints 9
Pandas Library:
• Python Pandas is an open-source library that provides high-performance data manipulation in
Python.

• It provides various data structures and operations for manipulating numerical data and time series
• It has functions for analyzing, cleaning, exploring, and manipulating data.
• The name Pandas is derived from the word Panel Data and is developed by Wes
McKinney in 2008

• Pandas is built on top of the Numpy package, means Numpy is required for operating the Pandas.
• It also has to be installed as follows:
pip install pandas
Pandas DataFrames:
• A Pandas DataFrame is a 2 dimensional data structure or a table with rows and columns.
• Pandas DataFrame consists of three principal components, the data, rows, and columns.
• Syntax: pandas.DataFrame( data, index, columns, dtype, copy)
Creating a simple DataFrame:
• A simple DataFrame can be created using a single list or a list of list Output:
import pandas as pd
0
data = ["C", "C++", "Java", "Python", "PHP"] # Create List 0 C
1 C++
df = pd.DataFrame(data) # Create dataframe using the list 2 Java
3 Python
print(df) 4 PHP
Dataframes:

Example:
Columns

import pandas as pd
Output:
data1 = [ [101, "Abc", 90], Roll Name Percent
[102, "Xyz", 80], a 101 Abc 90
[103, "Pqr", 75], b 102 Xyz 80
[104, "Mno", 85], c 103 Pqr 75
[105, "Def", 77] d 104 Mno 85
] e 105 Def 77
df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"],
index=["a","b","c","d","e"] ) Index
print(df1)
Loading Data with Pandas:
• Using the Pandas library, the data can be imported from the csv (comma separated values) files and
load it in the Pandas Dataframes
Example:

import pandas as pd

data = pd.read_csv("IRIS.csv") # Read the csv file


df = pd.DataFrame(data) # Load the imported data into dataframe
print(df)
Additional information about Dataframes:
Example:
Output:
print(len(df1)) # Returns number of rows 5
print(df1.shape) # Returns rows and columns as a tuple (5, 3)
15
print(df1.size) # Returns 'cells' in the table Index(['Roll', 'Name', 'Percent'],
dtype='object')
print(df1.columns) # Returns the column names Roll int64
print(df1.dtypes) # Returns data types of the columns Name object
Percent int64
dtype: object
Creating Dataframes:
We can create a DataFrame using following ways:
• dict
• Lists
• Numpy ndarrrays
• Series
1. Accessing any column in DataFrame:
The columns of a dataframe can be accessed by calling them by their columns name.
Example:
print(df1["Name"]) # Accessing a column of Dataframe

Output:
a. Abc
b. Xyz
c. Pqr
d. Mno
e. Def
2. Accessing any row in DataFrame:
• Pandas provide a special method to retrieve rows from a Data frame.
• the loc[] method to retrieve the row
Example:
data1 = [ [101, "Abc", 90],
[102, "Xyz", 80],
[103, "Pqr", 75], Output:
[104, "Mno", 85], Roll 102
[105, "Def", 77]
Name Xyz
]
Percent 80
df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"],
index=["a","b","c","d","e"])

print(df1.loc['b']) # Retrieving Data at row indexed “b”

data = df1.loc['b'] # Retrieving data at index 'b‘ and storing it in variable


print(data)
3. Accessing any row in DataFrame using integer index:
• Rows can be selected by passing integer location to an iloc() function
• Example:

data1 = [ [101, "Abc", 90],


[102, "Xyz", 80], Output:
[103, "Pqr", 75], Roll 103
[104, "Mno", 85],
Name Pqr
[105, "Def", 77]
]
Percent 75

df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"],


index=["a","b","c","d","e"])

print(df1.iloc[2]) # Retrieving Data at row at index number 2


4. Slicing a Dataframe:
• Multiple rows can be selected using : operator.
Example:

data1 = [ [101, "Abc", 90],


[102, "Xyz", 80], Output:
[103, "Pqr", 75], Roll Name Percent
[104, "Mno", 85],
c 103 Pqr 75
[105, "Def", 77]
]
d 104 Mno 85

df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"],


index=["a","b","c","d","e"])

print(df1[2 : 4) # Retrieving Data at row from index 2 to (4 -1)th index


5. Appending/Adding row to Dataframe:
• New rows can be added to a dataframe using the append() function
Example:

data1 = [ [101, "Abc", 90],


[102, "Xyz", 80],
[103, "Pqr", 75], Output:
[104, "Mno", 85],
Roll Name Percent
[105, "Def", 77]
]
a 101 Abc 90
b 102 Xyz 80
df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"], c 103 Pqr 75
index=["a","b","c","d","e"]) d 104 Mno 85
e 105 Def 77
df2 = pd.DataFrame([[106, 'AAA', 60], [107, 'asfd', 65]],columns=["Roll", f 106 AAA 60
"Name", "Percent"],index=['f','g']) g 107 asfd 65
df1 = df1.append((df2)) # Appending a row to a dataframe
print(df1)
6. Dropping or deleting rows from a Dataframe:
• Rows can be deleted using the index number in drop() method
Example:

df1 = df1.drop('a')
print(df1)
Working with and saving data with pandas:
• A Pandas DataFrame as a CSV file using to_csv() method.
data1 = [[101, "Abc", 90],
[102, "Xyz", 80],
[103, "Pqr", 75],
[104, "Mno", 85],
[105, "Def", 77]
]

df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"], index=["a","b","c","d","e"])

df1.to_csv("StudentData.csv")
Saving Dataframe without header and index:
• Setting the parameter header=False removes the column heading and setting index=False removes
the index while writing DataFrame to CSV file.

data1 = [[101, "Abc", 90],


[102, "Xyz", 80],
[103, "Pqr", 75],
[104, "Mno", 85],
[105, "Def", 77]
]

df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"], index=["a","b","c","d","e"])

df1.to_csv("StudentData.csv“, header = False, index = False) # Saving file without header and index
Data Cleaning in Python:
• Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset.
• Data cleaning aims at filling missing values, smoothing out noise while determining outliers and
rectifying inconsistencies in the data
• Data cleaning consists of – Removing null records, dropping unnecessary columns, treating
missing values, rectifying junk values or otherwise called outliers, restructuring the data to modify
it to a more readable format, etc. is known as data cleaning

• There are various techniques to clean the data like:


1. Removing Null/Duplicate Records/Ignore tuple
2. Dropping unnecessary Columns
3. Treating missing values
Removing Null/Duplicate Records/Ignore tuple:
• If in a particular row a significant amount of data is missing, then it would be better to drop that
row as it would not be adding any value to our model.
Example: RollNo Name Percent City Output:
101Abc 80Pune RollNo Name Percent City
102Xyz 85Mumbai 0 101.0 Abc 80.0 Pune
103Def 90Goa 1 102.0 Xyz 85.0 Mumbai
104 75Delhi 2 103.0 Def 90.0 Goa
105 Pune 3 104.0 NaN 75.0 Delhi
106Bbb 4 105.0 NaN NaN Pune
Pqr 60Nagpur
5 106.0 Bbb NaN NaN
108 Nashik
109 50Mumbai
6 NaN Pqr 60.0 Nagpur
110dgfg 74Pune 7 108.0 NaN NaN Nashik
8 109.0 NaN 50.0 Mumbai
• Sometimes csv file has null values, which are later displayed 9 110.0 dgfg 74.0 Pune

as NaN in Data Frame


Removing Null/Duplicate Records/Ignore tuple:
• Pandas DataFrame dropna() function is used to remove rows and columns with Null/NaN values
•import pandas as
This function pd a new DataFrame and the source DataFram
returns
e remains unchanged.
Output:
RollNo Name Percent City
data = pd.read_csv("Cleaning.csv") # Read the csv file 0 101.0 Abc 80.0 Pune
df = pd.DataFrame(data) # Load the imported data into dataframe 1 102.0 Xyz 85.0 Mumbai
2 103.0 Def 90.0 Goa
print(df) 9 110.0 dgfg 74.0 Pune
newdf = df.dropna() # drop the NaN values from Dataframe df
print(newdf)
2. Dropping unnecessary Columns
• Sometimes the Data set we receive is huge with number of rows and columns
• Some columns from the dataset may not be useful for our model
• Such data is better removed as it would valuable resources like memory and processing time.
• Example: In the previous data the column City is not required for finding the student progression,
hence can be dropped using the dropna() or drop() methods

new2 = df.drop(['City'], axis=1, inplace=False)


print(new2)

where,
axis : Takes value 0 for ‘index’ and 1 for ‘columns’
inplace=True will changes in original dataframe and
3. Treating missing values
• Missing Data can occur when no information is provided for one or more items or for a whole unit
• Missing Data can be a very big problem in a real-life scenarios.
• They must be handled carefully as they can be an indication of something important.
• We can also fill in the missing value using fillna() method and let the user replace NaN values with some value
of their own.
Syntax:
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None)
where,
value : Static, dictionary, array, series or dataframe to fill instead of NaN.
method : Method is used if user doesn’t pass any value. Pandas has different methods like bfill, backfill or ffill
axis: axis takes int or string value for rows/columns.
inplace: It is a boolean which makes the changes in data frame itself if True.
3. Treating missing values
Example:
new3 = df['City'].fillna("No Data", inplace=False) Output:
1 Pune
print(new3) 2 Mumbai
3 Goa
4 Delhi
5 Pune
6 No Data
7 Nagpur
8 Nashik
9 Mumbai
10 Pune

You might also like