You are on page 1of 30

Rayat Shikshan Sanstha's

KARMAVEER BHAURAO PATIL


COLLEGE,VASHI
[AutonomousCollege]

Reaccredited NAAC with Grade A+' (CGPA3.53)|ISO 9001:2008 Certified Institute


‘BestCollege’ Award by University of Mumbai

[ DEPARTMENT OF INFORMATION TECHNOLOGY ]

CERTIFICATE

This Is To Certify That


Mr./Ms. YASH VIJAY PANWAL
Student of T.Y.B.Sc.IT. Class From Karmaveer Bhaurao Patil College,
Vashi [Autonomous], Navi Mumbai Has Satisfactorily Completed The
Practical Course In Subject DATA SCIENCE As per The Syllabus Laid By
The University Of Mumbai During The Academic Year 2023-24.

ROLL NO .: 237738
EXAM NO.: 237738

SAHIL VICHARE MADHURI GABHANE


Date:__/__/2023 Head, Department of IT

External Examiner
INDEX
SR. No.: Practical Name Sign
1
Practical of Principal Component Analysis

2 Practical of Clustering

3 Practical of Time-series forecasting

4 Practical of Simple/Multiple Linear Regression

5 Practical of Logistics Regression

6 Practical of Hypothesis testing

7 Practical of Analysis of Variance

8 Practical of Decision Tree

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
PRACTICAL NO.: 1

Aim: Practical of Principal Component Analysis.

Program :
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Generate synthetic data
np.random.seed(0)
X = np.random.rand(100, 3) # 100 samples with 3 features
from sklearn.preprocessing import StandardScaler
# Standardise the data
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
# Create a PCA object with the number of components you want
pca = PCA(n_components=2) # Reduce to 2 principal components
# Fit the PCA model to the standardised data
pca.fit(X_std)
# Transform the data to the first 2 principal components
X_pca = pca.transform(X_std)
explained_variance = pca.explained_variance_ratio_
print("Explained Variance Ratios:", explained_variance)
plt.figure(figsize=(8, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.title('PCA of Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
Output :

Explanation :
Import the necessary libraries:
numpy for numerical operations.
sklearn.decomposition.PCA for performing PCA.
matplotlib.pyplot for data visualisation.
sklearn.preprocessing.StandardScaler for standardising the data.
Generate synthetic data:
A random 3-dimensional dataset with 100 samples is created using
np.random.rand.
Standardise the data:
The data is standardised using StandardScaler to have a mean of 0 and a
standard deviation of 1 for each feature.

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
Create a PCA object:
An instance of the PCA class is created with the number of components set to 2.
This means the data will be reduced to 2 principal components.
Fit the PCA model:
The PCA model is fitted to the standardised data using the .fit method.
Transform the data:
The data is transformed into the first two principal components using
the .transform method and stored in the X_pca variable.
Calculate and print explained variance ratios:
The explained variance ratios for each principal component are calculated using
the .explained_variance_ratio_ attribute and printed.
Create a scatter plot:
A scatter plot is created to visualise the data in the reduced 2-dimensional space.
The x-axis represents the first principal component, and the y-axis represents
the second principal component.
Display the plot:
Finally, the plot is displayed using plt.show().

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
PRACTICAL NO.: 2

Aim: Practical of Clustering

Program :

import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from sklearn.datasets import make_blobs

# Generate synthetic data

n_samples = 300

n_features = 2

n_clusters = 3

X, _ = make_blobs(n_samples=n_samples, n_features=n_features,
centers=n_clusters,

random_state=42)

# Create a KMeans clustering model

kmeans = KMeans(n_clusters=n_clusters, random_state=42)

# Fit the model to your data

kmeans.fit(X)

# Predict the cluster labels for each data point

labels = kmeans.predict(X)

# Plot the data points and color them by cluster

plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200,

c='red', label='Cluster Centers')


Yash Vijay Panwal
Roll No.: 237738
Batch : A2
plt.xlabel('Feature 1')

plt.ylabel('Feature 2')

plt.legend()

plt.title('Clustering Results')

plt.show()

Output :

Explanation :

Import the necessary libraries:


Yash Vijay Panwal
Roll No.: 237738
Batch : A2
numpy for numerical operations.
sklearn.decomposition.PCA for performing PCA.
matplotlib.pyplot for data visualization.
sklearn.preprocessing.StandardScaler for standardizing the data.
Generate synthetic data:
A random 3-dimensional dataset with 100 samples is created using
np.random.rand.
Standardize the data:
The data is standardized using StandardScaler to have a mean of 0 and a
standard deviation of 1 for each feature.
Create a PCA object:
An instance of the PCA class is created with the number of components set to 2.
This means the data will be reduced to 2 principal components.
Fit the PCA model:
The PCA model is fitted to the standardized data using the .fit method.
Transform the data:
The data is transformed into the first two principal components using
the .transform method and stored in the X_pca variable.
Calculate and print explained variance ratios:
The explained variance ratios for each principal component are calculated using
the .explained_variance_ratio_ attribute and printed.
Create a scatter plot:
A scatter plot is created to visualize the data in the reduced 2-dimensional
space. The x-axis represents the first principal component, and the y-axis
represents the second principal component.
Display the plot:
Finally, the plot is displayed using plt.show().
PRACTICAL NO.: 3

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
Aim: Practical of Time-series forecasting

Program :

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from statsmodels.tsa.arima.model import ARIMA

# Generate synthetic time series data

np.random.seed(0)

n = 100

time = pd.date_range(start='2023-01-01', periods=n, freq='D')

values = np.cumsum(np.random.randn(n))

time_series = pd.Series(values, index=time)

plt.figure(figsize=(10, 6))

plt.plot(time_series)

plt.title('Time Series Data')

plt.xlabel('Date')

plt.ylabel('Value')

plt.show()

# Use the entire time series for training

train_data = time_series

# Create and fit the ARIMA model (p, d, q)

p, d, q = 1, 1, 1

model = ARIMA(train_data, order=(p, d, q))

results = model.fit()

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
# Make forecasts for future time points (e.g., next 10 periods)

forecast_periods = 10

forecast_values = results.forecast(steps=forecast_periods)

# Create a date range for the forecasted periods

forecast_index = pd.date_range(start=time_series.index[-1] + pd.DateOffset(1),


periods=forecast_periods, freq='D')

# Plot the original time series and forecasts

plt.figure(figsize=(10, 6))

plt.plot(time_series, label='Original Time Series')

plt.plot(forecast_index, forecast_values, 'r--', label='Forecast')

plt.title('Time Series Forecasting with ARIMA')

plt.xlabel('Date')

plt.ylabel('Value')

plt.legend()

plt.show()

Output :

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
Explanation :

Import the necessary libraries:

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
numpy for numerical operations.
pandas for time series data manipulation.
matplotlib.pyplot for data visualisation.
statsmodels.tsa.arima_model.ARIMA for ARIMA modelling.
Generate synthetic time series data:
A synthetic time series dataset is created with 100 data points. The time series is
generated with a random walk process using np.cumsum to accumulate random
values over time. The time index is created with daily frequency starting from
'2023-01-01'.
Plot the time series data:
The code creates a line plot to visualise the generated time series data. The x-
axis represents the date, and the y-axis represents the value of the time series.
This is done using plt.plot.
Use the entire time series for training:
The entire time series is used for training the ARIMA model. This is stored in
the train_data variable.
Create and fit the ARIMA model:
An ARIMA model is created with specified orders (p, d, q) where p is the
autoregressive order, d is the differencing order, and q is the moving average
order. The model is fitted to the training data using model.fit().
Make forecasts for future time points:
The code makes forecasts for future time points by calling
results.forecast(steps=forecast_periods). In this example, forecasts are made for
the next 10 periods.
Plot the original time series and forecasts:
Another line plot is created to visualise both the original time series and the
forecasts. The original time series is plotted in blue, and the forecasted values
are plotted in red with a dashed line. The forecasted values are aligned with the
corresponding future time points. A legend is added to distinguish the original
time series from the forecasts.
Display the forecasted plot:

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
Finally, the plot showing the original time series and the forecasted values is
displayed using plt.show().

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
PRACTICAL NO.:4

Aim: Practical of Simple/Multiple Linear Regression

Program :

import numpy as np

from sklearn.linear_model import LinearRegression

import matplotlib.pyplot as plt

np.random.seed(0)

X = np.random.rand(100, 1)

y = 2 * X + 1 + np.random.randn(100, 1)

model = LinearRegression()

model.fit(X, y)

y_pred = model.predict(X)

plt.scatter(X, y, label='Data')

plt.plot(X, y_pred, color='red', linewidth=2, label='Linear Regression')

plt.xlabel('Independent Variable (X)')

plt.ylabel('Dependent Variable (y)')

plt.legend()

plt.show()

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
Output :

Explanation :

Import the necessary libraries:

numpy for numerical operations.

sklearn.linear_model.LinearRegression for linear regression modelling.

matplotlib.pyplot for data visualisation.

Generate synthetic data:

Random data is generated for the independent variable (X) and dependent
variable (y). The relationship between X and y is linear with some random
noise. The np.random.randn function is used to add random noise to the linear
relationship.

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
Create a LinearRegression model:

An instance of the LinearRegression model is created, which will be used to fit


the linear regression to the data.

Fit the model to the data:

The model.fit(X, y) method is used to fit the linear regression model to the data.
This process estimates the coefficients of the linear relationship between X and
y.

Make predictions:

The model is used to make predictions on the same independent variable (X),
and the predicted values are stored in the y_pred variable using the
model.predict(X) method.

Create a scatter plot:

A scatter plot is created to visualize the original data points. The data points are
represented by blue dots using plt.scatter.

Plot the linear regression line:

A red line is plotted to represent the linear regression model's fit to the data. The
y_pred values are plotted against the original X values. This line is drawn using
plt.plot with a specified color and linewidth.

Set labels and legend:

X and y-axis labels are set using plt.xlabel and plt.ylabel. A legend is added to
distinguish between the data and the linear regression line using plt.legend.

Display the plot:

Finally, the plot is displayed using plt.show().

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
PRACTICAL NO.:5

Aim: Practical of Logistics Regression

Program :

import numpy as np

import matplotlib.pyplot as plt

def sigmoid(x):

return 1 / (1 + np.exp(-x))

# Generate values for the x-axis

x = np.linspace(-6, 6, 100)

# Calculate corresponding y values using the sigmoid function

y = sigmoid(x)

# Generate some data points

data_x = np.array([-4, -3, -2, -1, 0, 1, 2, 3, 4])

data_y = np.array([0.05, 0.1, 0.15, 0.4, 0.6, 0.7, 0.85, 0.9, 0.95])

# Create a plot to visualize the S-shaped curve and data points

plt.figure(figsize=(8, 6))

plt.plot(x, y, label='Sigmoid Function', color='blue', linewidth=2)

plt.scatter(data_x, data_y, color='red', marker='o', label='Data Points', s=100)

plt.xlabel('X')

plt.ylabel('Sigmoid(X)')

plt.title('S-shaped Curve with Data Points')

plt.axhline(0.5, color='green', linestyle='--', label='Threshold (0.5)')

plt.legend()

plt.grid('True')
Yash Vijay Panwal
Roll No.: 237738
Batch : A2
plt.show()

Output :

Explanation :

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
Importing libraries:

import numpy as np: This imports the NumPy library and gives it the alias np to
make it easier to use in the code.

import matplotlib.pyplot as plt: This imports the Matplotlib library's pyplot


module and gives it the alias plt for creating plots.

Sigmoid function definition:

The sigmoid function takes a value x as input and returns the sigmoid function's
output, which is calculated as 1 / (1 + np.exp(-x)). This is the formula for the
sigmoid function.

Generating values for the x-axis:

x = np.linspace(-6, 6, 100): This line creates an array x that spans from -6 to 6


with 100 evenly spaced values. These values will be used as the x-coordinates
for the sigmoid curve.

Calculating corresponding y values using the sigmoid function:

y = sigmoid(x): This line computes the corresponding y-values for the sigmoid
curve by applying the sigmoid function to each value in the x array.

Generating data points:

data_x and data_y represent some data points that you want to plot on the same
graph.

data_x is an array of x-coordinates for the data points.

data_y is an array of y-coordinates for the data points.

Creating the plot:

plt.figure(figsize=(8, 6)): This line creates a figure for the plot with a specified
size of 8x6 inches.

plt.plot(x, y, label='Sigmoid Function', color='blue', linewidth=2): This plots the


sigmoid curve using the x and y values, with a blue color and a line width of 2.
It's given the label 'Sigmoid Function' for the legend.

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
plt.scatter(data_x, data_y, color='red', marker='o', label='Data Points', s=100):
This plots the data points as red dots (markers) with a label 'Data Points'. The s
parameter sets the size of the markers.

plt.xlabel('X') and plt.ylabel('Sigmoid(X)'): These set the labels for the x-axis
and y-axis, respectively.

plt.title('S-shaped Curve with Data Points'): This sets the title for the plot.

plt.axhline(0.5, color='green', linestyle='--', label='Threshold (0.5)'): This adds a


horizontal dashed line at y = 0.5 to represent the threshold of 0.5 for the sigmoid
function.

plt.legend(): This displays the legend on the plot to label the sigmoid curve and
data points.

plt.grid(True): This adds a grid to the plot.

plt.show(): This displays the plot.

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
PRACTICAL NO.:6

Aim: Practical of Hypothesis testing.

Program :

import numpy as np

from scipy import stats

# Generate sample data for Group A and Group B

np.random.seed(0)

group_a = np.random.normal(loc=100, scale=10, size=30) # Group A data

group_b = np.random.normal(loc=105, scale=10, size=30) # Group B data

# Perform a two-sample t-test

t_statistic, p_value = stats.ttest_ind(group_a, group_b)

# Define the significance level (alpha)

alpha = 0.05

# Print the results

print(f'T-Statistic: {t_statistic}')

print(f'P-Value: {p_value}')

# Perform a significance test

if p_value < alpha:

print('Reject the null hypothesis. There is a significant difference between the


groups.')

else:

print('Fail to reject the null hypothesis. There is no significant difference


between the groups.')

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
Output :

Explanation :

Importing libraries:

import numpy as np: This imports the NumPy library and gives it the alias np to
make it easier to use in the code.

import matplotlib.pyplot as plt: This imports the Matplotlib library's pyplot


module and gives it the alias plt for creating plots.

Sigmoid function definition:

The sigmoid function takes a value x as input and returns the sigmoid function's
output, which is calculated as 1 / (1 + np.exp(-x)). This is the formula for the
sigmoid function.

Generating values for the x-axis:

x = np.linspace(-6, 6, 100): This line creates an array x that spans from -6 to 6


with 100 evenly spaced values. These values will be used as the x-coordinates
for the sigmoid curve.

Calculating corresponding y values using the sigmoid function:

y = sigmoid(x): This line computes the corresponding y-values for the sigmoid
curve by applying the sigmoid function to each value in the x array.

Generating data points:

data_x and data_y represent some data points that you want to plot on the same
graph.

data_x is an array of x-coordinates for the data points.

data_y is an array of y-coordinates for the data points.

Creating the plot:

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
plt.figure(figsize=(8, 6)): This line creates a figure for the plot with a specified
size of 8x6 inches.

plt.plot(x, y, label='Sigmoid Function', color='blue', linewidth=2): This plots the


sigmoid curve using the x and y values, with a blue color and a line width of 2.
It's given the label 'Sigmoid Function' for the legend.

plt.scatter(data_x, data_y, color='red', marker='o', label='Data Points', s=100):


This plots the data points as red dots (markers) with a label 'Data Points'. The s
parameter sets the size of the markers.

plt.xlabel('X') and plt.ylabel('Sigmoid(X)'): These set the labels for the x-axis
and y-axis, respectively.

plt.title('S-shaped Curve with Data Points'): This sets the title for the plot.

plt.axhline(0.5, color='green', linestyle='--', label='Threshold (0.5)'): This adds a


horizontal dashed line at y = 0.5 to represent the threshold of 0.5 for the sigmoid
function.

plt.legend(): This displays the legend on the plot to label the sigmoid curve and
data points.

plt.grid(True): This adds a grid to the plot.

plt.show(): This displays the plot.

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
PRACTICAL NO.:7

Aim: Practical of Analysis of Variance.

Program :

import numpy as np

from scipy import stats

# Create sample data for three groups (A, B, and C)

group_A = [65, 70, 75, 80, 85]

group_B = [75, 80, 85, 90, 95]

group_C = [60, 70, 80, 90, 100]

# Perform one-way ANOVA

f_statistic, p_value = stats.f_oneway(group_A, group_B, group_C)

# Print the results

print("F-statistic:", f_statistic)

print("p-value:", p_value)

# Interpret the results

alpha = 0.05 # Set the significance level

if p_value < alpha:

print("Reject the null hypothesis: There is a significant difference among


group means.")

else:

print("Fail to reject the null hypothesis: There is no significant difference


among group means.")

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
Output :

Explanation :

1. Importing libraries:

- `import numpy as np`: Imports the NumPy library with the alias `np` for
numerical operations.

- `from scipy import stats`: Imports the `stats` module from the SciPy library,
which contains statistical functions and tests, including ANOVA.

2. Creating sample data for three groups:

- `group_A`, `group_B`, and `group_C` are Python lists representing the data
for three groups. Each group contains five sample values.

3. Performing one-way ANOVA:

- `f_statistic, p_value = stats.f_oneway(group_A, group_B, group_C)`: This


line performs a one-way ANOVA test on the three groups. It calculates the F-
statistic and the corresponding p-value.

- The F-statistic is a test statistic that measures the variance between group
means relative to the variance within the groups.

- The p-value represents the probability that the observed differences in


means are due to random chance.

4. Printing the results:

- `print("F-statistic:", f_statistic)`: This line prints the F-statistic obtained from


the ANOVA test.

- `print("p-value:", p_value)`: This line prints the p-value obtained from the
ANOVA test.

5. Interpreting the results:


Yash Vijay Panwal
Roll No.: 237738
Batch : A2
- `alpha = 0.05`: The significance level, `alpha`, is set to 0.05. This is the
threshold at which you'll decide whether to reject or fail to reject the null
hypothesis.

- `if p_value < alpha:`: This condition checks whether the p-value is less than
the significance level. If it is, you reject the null hypothesis.

- `print("Reject the null hypothesis: There is a significant difference among


group means.")`: If the condition is met, this message is printed, indicating that
there is a significant difference among the group means.

- If the condition is not met, the code prints, "Fail to reject the null hypothesis:
There is no significant difference among group means."

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
PRACTICAL NO.:8

Aim: Practical of Decision Tree

Program :

from sklearn.datasets import load_iris


from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn import tree

data = load_iris()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,


random_state=42)

clf = DecisionTreeClassifier(random_state=42)

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)


print(f"Accuracy: {accuracy:.2f}")

plt.figure(figsize=(12, 8))
tree.plot_tree(clf, filled=True, feature_names=data.feature_names,
class_names=data.target_names)
plt.show ()

Yash Vijay Panwal


Roll No.: 237738
Batch : A2
Output :

Explanation :
Yash Vijay Panwal
Roll No.: 237738
Batch : A2
1.Importing libraries :
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from sklearn import tree

load_iris: This function is used to load the Iris dataset, a well-known dataset in
machine learning.
DecisionTreeClassifier: This class is used to create a Decision Tree Classifier
model.
train_test_split: This function is used to split the dataset into training and testing
sets.
accuracy_score: This function is used to calculate the accuracy of the model.
matplotlib.pyplot: This library is used for data visualization.
tree: This module is used for visualizing the decision tree.

2.Load the Iris Dataset:


iris = load_iris()
This line loads the Iris dataset and stores it in the iris variable.

3.Split the Dataset into Training and Testing Sets:


X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target,
test_size=0.3, random_state=42)
This code splits the dataset into training (X_train, y_train) and testing (X_test,
y_test) sets. It uses 70% of the data for training and 30% for testing, and
random_state is set for reproducibility.

4.Create a Decision Tree Classifier:


clf = DecisionTreeClassifier()
This line initialises a Decision Tree Classifier model.

5.Train the Model:


Yash Vijay Panwal
Roll No.: 237738
Batch : A2
clf.fit(X_train, y_train)
Here, the model is trained on the training data

6.Make Predictions on the Test Data:


y_pred = clf.predict(X_test)
The model makes predictions on the test data.

7.Evaluate Model Accuracy:


accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
This code calculates and prints the accuracy of the model on the test data.

8.Visualize the Decision Tree:


plt.figure(figsize=(12, 8))
tree.plot_tree(clf, filled=True, feature_names=iris.feature_names,
class_names=iris.target_names)
plt.show()
This part of the code creates a visualization of the decision tree model using
matplotlib and the tree.plot_tree function. It shows the structure of the decision
tree, including the split criteria and class labels

Yash Vijay Panwal


Roll No.: 237738
Batch : A2

You might also like