You are on page 1of 11

EX_NO: 1 WORKING WITH PANDAS DATA FRAMES

Aim:
To write a python program to work with pandas data frames.

Algorithm:
1. Import the pandas library: To use pandas, you need to import the library first.
2. Read data into a pandas data frame: Use the read_csv(), read_excel(), or read_sql()
functions to read data from a file or database into a pandas data frame.
3. Explore the data: Use functions like head(), tail(), info(), describe(), and shape to get a
sense of the data you're working with.
4. Select and filter data: Use indexing and slicing to select subsets of the data. You can use
the loc[] and iloc[] methods to select rows and columns based on labels or indices. You can
also filter rows based on conditions using the query() or loc[] method.
5. Manipulate the data: Use pandas functions to manipulate the data, such as adding or
dropping columns, renaming columns, and aggregating data.
6. Handle missing data: Use functions like isnull(), dropna(), fillna(), and interpolate() to
handle missing data in the data frame.
7. Save the data: Use the to_csv(), to_excel(), or to_sql() methods to save the data frame to
a file or database.

Program:
import pandas as pd
# Create a data frame from a dictionary of lists
data = {'Name': ['Alice', 'Bob', 'Charlie', 'Dave', 'Emily'],
'Age': [25, 30, 35, 40, 45],
'Country': ['USA', 'Canada', 'USA', 'Canada', 'USA']}
df = pd.DataFrame(data)
# Print the data frame
print("Original data frame:")
print(df)
# Select rows based on a condition
print("\nSelect rows where Age > 30:")
print(df[df['Age'] > 30])
# Add a new column
df['Gender'] = ['F', 'M', 'M', 'M', 'F']
print("\nData frame with Gender column:")
print(df)
# Group the data by Country and calculate the mean age
grouped = df.groupby('Country')
mean_age = grouped['Age'].mean()
print("\nMean age by Country:")
print(mean_age)
# Sort the data frame by Age in descending order
sorted_df = df.sort_values('Age', ascending=False)
print("\nData frame sorted by Age in descending order:")
print(sorted_df)
Output:
Original data frame:
Name Age Country
0 Alice 25 USA
1 Bob 30 Canada
2 Charlie 35 USA
3 Dave 40 Canada
4 Emily 45 USA
Data frame sorted by Age in descending order:
Name Age Country Gender
4 Emily 45 USA F
3 Dave 40 Canada M
2 Charlie 35 USA M
1 Bob 30 Canada M
0 Alice 25 USA F

Result: Thus the python program to work with pandas dataframes is executed successfully.
EX_NO: 02 BASIC PLOTS USING MATPLOTLIB
Aim: To write a python program to draw basic plots using matplotlib.

Algorithm:
1.Import the necessary libraries: You'll need to import both NumPy and Matplotlib in order to create
plots.
2.Create data: You need data to plot! Create your data as NumPy arrays.
3.Create a figure and axes: Before you can plot anything, you need to create a figure and axes. The
figure is the canvas that holds your plot, while the axes are the actual plot area.
4.Plot the data: Now it's time to plot the data. You can do this using the plot() function.
5.Customize the plot: You can customize your plot in many ways, including adding a title, changing the
axis labels, and adding a legend.
6.Save the plot: Finally, you can save your plot to a file using the savefig() function.

Program:
import matplotlib.pyplot as plt
# Data
a = list(range(1, 6))
b = [0, 0.6, 0.2, 15, 10]
c = [4, 2, 6, 8, 3]
# Plotting
plt.plot(a, label='1st Rep')
plt.plot(b, 'or', label='2nd Rep')
plt.plot(list(range(0, 22, 3)), label='3rd Rep')
plt.plot(c, label='4th Rep')
# Axes labeling
plt.xlabel('Day ->')
plt.ylabel('Temp ->')
# Spines
ax = plt.gca()
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_bounds(0, 20) # Adjusted bounds
# Ticks
plt.xticks(range(0, 6))
plt.yticks(range(0, 21, 3))
# Legend
ax.legend()
# Annotation
plt.annotate('Temperature V/s Days', xy=(1.01, -2.15))
# Title
plt.title('All Features Discussed')
# Show plot
plt.show()

Output:

Result: Thus the python program to draw basic plots using matplotlib is executed successfully.
EX_NO: 03 FREQUENCY DISTRIBUTIONS, AVERAGES, VARIABILITY

Aim:
To write a python program for finding out frequency distributions, averages, variability.

Algorithm:
1.Initialize an empty dictionary called frequency_distribution
2.Calculate the sum of the data and store it in a variable called total_sum
3.Calculate the length of the data and store it in a variable called n
4.Calculate the mean by dividing total_sum by n and store it in a variable called mean
5.Calculate the sum of the squared differences between each value in the data and the
mean and store it in a variable called squared_diff_sum
6.Calculate the variance by dividing squared_diff_sum by n-1 and store it in a variable
called variance
7.Loop over each value in the data: a. If the value is not in the frequency_distribution
dictionary, add it with a value of 1 b. If the value is already in the frequency_distribution
dictionary, increment its value by 1
8.Return the frequency_distribution, mean, and variance

Program:
# Python program to get average of a list
# Importing the NumPy module
import numpy as np
# Taking a list of elements
my_list1 = [2, 40, 2, 502, 177, 7, 9]
# Calculating average using average()
print(np.average(my_list1))
# Output:
# 105.57142857142857
# Python program to get variance of a list
# Importing the NumPy module
import numpy as np
# Taking a list of elements
my_list2 = [2, 4, 4, 4, 5, 5, 7, 9]
# Calculating variance using var()
print(np.var(my_list2))
# Output:
# 4.0
# Python program to get standard deviation of a list
# Importing the NumPy module
import numpy as np
# Taking a list of elements
my_list3 = [290, 124, 127, 899]
# Calculating standard deviation using std()
print(np.std(my_list3))
# Output:
# 318.35750344541907

Result: Thus the python program for finding out frequency distributions, averages, variability is executed
successfully.
EX_NO: 04: NORMAL CURVES, CORRELATION AND SCATTER PLOTS, CORRELATION COEFFICIENT
Aim:
The aim of finding normal curves, correlation and scatter plots, and correlation coefficients is to better
understand the relationship between variables in a dataset.

Algorithm:
1.Normal Curves: To find a normal curve, we need to first calculate the mean and standard deviation of
the dataset. Then, we can use a statistical software or calculator to generate a graph of the normal
distribution. The algorithm for finding a normal curve is as follows: a. Calculate the mean (µ) and
standard deviation (σ) of the dataset. b. Use a statistical software or calculator to generate a graph of
the normal distribution. c. Plot the normal curve on a graph with the x-axis representing the variable of
interest and the y-axis representing the frequency or probability.

2.Scatter Plots: To create a scatter plot, we need to first collect data for two variables that we want to
investigate. Then, we can plot the data points on a graph and look for patterns or trends. The algorithm
for creating a scatter plot is as follows:
a. Collect data for two variables.
b. Plot the data points on a graph with one variable on the x-axis and the other variable on the y-axis. c.
Look for patterns or trends in the data points.

3.Correlation Coefficient: The correlation coefficient is a measure of the strength and direction of the
relationship between two variables. To calculate the correlation coefficient, we need to use a statistical
formula or software. The algorithm for calculating the correlation coefficient is as follows:
a. Collect data for two variables.
b. Calculate the mean (µ) and standard deviation (σ) of each variable.
c. Calculate the covariance (cov) between the two variables.
d. Calculate the correlation coefficient (r) using the formula:
r = cov / (σ1 * σ2)
where cov is the covariance between the two variables, σ1 is the standard deviation of variable 1, and
σ2 is the standard deviation of variable 2.

Program:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import math
# Normal curves
mu, sigma = 0.5, 0.1
s = np.random.normal(mu, sigma, 1000)
# Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, density=True)
# Correlation and scatter plots
y = pd.Series([1, 2, 3, 4, 3, 5, 4])
x = pd.Series([1, 2, 3, 4, 5, 6, 7])
correlation = y.corr(x)
print("Correlation coefficient:", correlation)

Output: 0.8603090020146067

# Function that returns correlation coefficient.


def correlationCoefficient(X, Y, n):
sum_X = 0
sum_Y = 0
sum_XY = 0
squareSum_X = 0
squareSum_Y = 0
i=0
while i < n:
# Sum of elements of array X.
sum_X += X[i]
# Sum of elements of array Y.
sum_Y += Y[i]
# Sum of X[i] * Y[i].
sum_XY += X[i] * Y[i]
# Sum of square of array elements.
squareSum_X += X[i] * X[i]
squareSum_Y += Y[i] * Y[i]
i += 1
# Use formula for calculating correlation coefficient.
corr = (n * sum_XY - sum_X * sum_Y) / (math.sqrt((n * squareSum_X - sum_X * sum_X) * (n *
squareSum_Y - sum_Y * sum_Y)))
return corr
# Driver function
X = [15, 18, 21, 24, 27]
Y = [25, 25, 27, 31, 32]
# Find the size of array.
n = len(X)
# Function call to correlationCoefficient.
print("Correlation coefficient:", '{0:.6f}'.format(correlationCoefficient(X, Y, n)))

Output: 0.953463

Result:
Thus the python program for finding out normal curves,correlation and scatter plots,
Correlation coefficient is executed successfully.
EX_NO: 05 REGRESSION
Aim:
The aim of regression analysis is to examine the relationship between a dependent variable and one or
more independent variables.

Algorithm:
The algorithm for finding a regression equation involves the following steps:
1.Collect data: Collect data on the dependent variable and one or more independent variables.
2.Check for linearity: Check whether there is a linear relationship between the dependent variable and
the independent variables. This can be done by plotting the data on a scatter plot and looking for a
linear pattern.
3.Determine the regression equation: Determine the regression equation that best fits the data. This can
be done using the least squares method.
4.Test the regression equation: Test the regression equation to see if it accurately predicts the value of
the dependent variable. This can be done by comparing the predicted values with the actual values.
5.Interpret the results: Interpret the results to draw conclusions about the relationship between the
dependent variable and the independent variables. The formula for the regression equation is: y = b0 +
b1x1 + b2x2 + ... + bnxn

Program:
import numpy as np
import matplotlib.pyplot as plt
def estimate_coef(x, y):
# number of observations/points
n = np.size(x)
# mean of x and y vector
m_x = np.mean(x)
m_y = np.mean(y)
# calculating cross-deviation and deviation about x
SS_xy = np.sum(y*x) - n*m_y*m_x
SS_xx = np.sum(x*x) - n*m_x*m_x
# calculating regression coefficients
b_1 = SS_xy / SS_xx
b_0 = m_y - b_1*m_x
return (b_0, b_1)
def plot_regression_line(x, y, b):
# plotting the actual points as scatter plot
plt.scatter(x, y, color="m", marker="o", s=30)
# predicted response vector
y_pred = b[0] + b[1]*x
# plotting the regression line
plt.plot(x, y_pred, color="g")
# putting labels
plt.xlabel('x')
plt.ylabel('y')
# function to show plot
plt.show()
def main():
# observations / data
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
# estimating coefficients
b = estimate_coef(x, y)
print("Estimated coefficients:\nb_0 = {}\nb_1 = {}".format(b[0], b[1]))
# plotting regression line
plot_regression_line(x, y, b)
if __name__ == "__main__":
main()

OUTPUT:

Result: Thus the python program for finding out the regression is executed successfully

You might also like