You are on page 1of 15

EX_NO: 06 Z-TEST

Aim:
The aim of a z-test is to determine whether the mean of a sample is statistically
significantly different from the known or hypothesized population mean.

Algorithm:
The algorithm for a one-sample z-test involves the following steps:
1. State the null and alternative hypotheses: The null hypothesis (H0) is that there is no significant
difference between the sample mean and the population mean, while the alternative hypothesis (Ha) is
that there is a significant difference.
2. Determine the level of significance: Choose the level of significance, α, that will be used to test the
hypothesis. Typically, α is set at 0.05 or 0.01.
3. Collect data: Collect a random sample from the population of interest, and calculate the sample mean
and sample standard deviation.
4. Calculate the test statistic: Calculate the z-test statistic using the formula: z = ( - μ) / (σ / √n) x̄ Where
the sample mean, μ is is the population mean, σ is the popula x̄ tion standard deviation, and n is the
sample size.
5. Determine the critical value: Determine the critical value of z at the chosen level of significance and
degrees of freedom.
6. Compare the test statistic to the critical value: If the test statistic is greater than the critical value, reject
the null hypothesis. If the test statistic is less than the critical value, fail to reject the null hypothesis.
7. Interpret the results: If the null hypothesis is rejected, it can be concluded that the sample mean is
significantly different from the population mean at the chosen level of significance. If the null hypothesis
is not rejected, it can be concluded that there is not enough evidence to support the alternative
hypothesis.

Program:
import math
import numpy as np
from numpy.random import randn
from statsmodels.stats.weightstats import ztest
# Generate a random array of 50 numbers having mean 110 and sd 15
# similar to the IQ scores data we assume above
mean_iq = 110
sd_iq = 15/math.sqrt(50)
alpha = 0.05
null_mean = 100
data = sd_iq * randn(50) + mean_iq
# Print mean and standard deviation
print('mean=%.2f stdv=%.2f' % (np.mean(data), np.std(data)))
# Perform the test
ztest_Score, p_value = ztest(data, value=null_mean, alternative='larger')
# Compare the p-value with alpha
if p_value < alpha:
print("Reject Null Hypothesis")
else:
print("Fail to Reject Null Hypothesis")

Output:
Mean=109.65
stdv=2.02

Reject Null Hypothesis

Result:
Thus the python program for performing z-test is executed successfully
Experiment No: 7 T-TEST

Aim:
The aim of a t-test is to determine whether the mean of a sample is statistically significantly different from
the hypothesized population mean.

Algorithm:
The algorithm for a one-sample t-test involves the following steps:
1. State the null and alternative hypotheses: The null hypothesis (H0) is that there is no significant
difference between the sample mean and the population mean, while the alternative hypothesis (Ha) is
that there is a significant difference.
2. Determine the level of significance: Choose the level of significance, α, that will be used to test the
hypothesis. Typically, α is set at 0.05 or 0.01.
3. Collect data: Collect a random sample from the population of interest, and calculate the sample mean
and sample standard deviation.
4.Calculate the test statistic: Calculate the t-test statistic using the formula: t = ( - μ) / (s / √n) x̄ where is
the sample mean, μ is the hypothesized population mean, s x̄ is the sample standard deviation, and n is
the sample size.
5. Determine the degrees of freedom: Determine the degrees of freedom for the t distribution using the
formula: df = n - 1.
6. Determine the critical value: Determine the critical value of t at the chosen level of significance and
degrees of freedom.
7. Compare the test statistic to the critical value: If the absolute value of the test statistic is greater than
the critical value, reject the null hypothesis. If the absolute value of the test statistic is less than the critical
value, fail to reject the null hypothesis.
8. Interpret the results: If the null hypothesis is rejected, it can be concluded that the sample mean is
significantly different from the hypothesized population mean at the chosen level of significance. If the
null hypothesis is not rejected, it can be concluded that there is not enough evidence to support the
alternative hypothesis.

Program:
import numpy as np
from scipy import stats
# Defining sample size
N = 10
# Gaussian distributed data with mean = 2 and var = 1
x = np.random.randn(N) + 2
# Gaussian distributed data with mean = 0 and var = 1
y = np.random.randn(N)
# Calculating the variance to get the standard deviation
var_x = x.var(ddof=1)
var_y = y.var(ddof=1)
# Standard Deviation
SD = np.sqrt((var_x + var_y) / 2)
print("Standard Deviation =", SD)
# Calculating the T-Statistics
tval = (x.mean() - y.mean()) / (SD * np.sqrt(2 / N))
# Degrees of freedom
dof = 2 * N - 2
# p-value after comparison with the T-Statistics
pval = 1 - stats.t.cdf(tval, df=dof)
print("t = " + str(tval))
print("p = " + str(2 * pval))
## Cross Checking using the internal function from SciPy Package
tval2, pval2 = stats.ttest_ind(x, y)
print("t = " + str(tval2))
print("p = " + str(pval2))

Output:
Standard Deviation = 1.123877698207986
t = 4.198892678764282
p = 0.001675908259288206
t = 4.198892678764282
p = 0.001675908259288206

Result:
Thus the python program for performing t-test is executed successfully
Experiment No: 8 ANOVA

Aim:
The aim of an ANOVA (Analysis of Variance) is to determine whether there is a significant difference
between the means of three or more groups.

Algorithm:
The algorithm for a one-way ANOVA involves the following steps:
1. State the null and alternative hypotheses: The null hypothesis (H0) is that there is no significant
difference between the means of the groups, while the alternative hypothesis (Ha) is that there is a
significant difference.
2. Determine the level of significance: Choose the level of significance, α, that will be used to test the
hypothesis. Typically, α is set at 0.05 or 0.01.
3. Collect data: Collect data from three or more groups, and calculate the mean and variance for each
group.
4.Calculate the sum of squares between groups: Calculate the sum of squares between groups using the
formula: SSbetween = ∑ni( i - )^2 x̄ x̄ where ni is the sample size for group i, i is the mean of group i, a x̄ x̄
nd is the overall mean.
5. Calculate the sum of squares within groups: Calculate the sum of squares within groups using the
formula: SSwithin = ∑∑(xi - i)^2 x̄ where xi is the value of the ith observation in the jth group, i is the mean
of group j x̄ , and j is the number of groups.
6. Calculate the F-statistic: Calculate the F-statistic using the formula: F = (SSbetween / (k-1)) / (SSwithin /
(N-k)) where k is the number of groups, and N is the total number of observations.
7. Determine the critical value: Determine the critical value of F at the chosen level of significance and
degrees of freedom.
8. Compare the F-statistic to the critical value: If the F-statistic is greater than the critical value, reject the
null hypothesis. If the F-statistic is less than the critical value, fail to reject the null hypothesis.
9. Interpret the results: If the null hypothesis is rejected, it can be concluded that there is a significant
difference between the means of the groups at the chosen level of significance. If the null hypothesis is
not rejected, it can be concluded that there is not enough evidence to support the alternative hypothesis

Program:
# Installing the package
install.packages("dplyr")
# Loading the package
library(dplyr)
# Variance in mean within group and between group
boxplot(mtcars$disp ~ factor(mtcars$gear), xlab = "gear", ylab = "disp")
# Step 1: Setup Null Hypothesis and Alternate Hypothesis
# H0: mu1 = mu2 = mu3 (There is no difference between average displacement for different gear)
# H1: Not all means are equal
# Step 2: Calculate test statistics using aov function
mtcars_aov <- aov(mtcars$disp ~ factor(mtcars$gear))
summary(mtcars_aov)
# Step 3: Calculate F-Critical Value
# For 0.05 Significant value, critical value = alpha = 0.05
# Step 4: Compare test statistics with F-Critical value and conclude test
p_value <- summary(mtcars_aov)[[1]]$`Pr(>F)`[1]
alpha <- 0.05
if (p_value < alpha) {
print("Reject Null Hypothesis: There is a significant difference between at least two group means.")
} else {
print("Fail to Reject Null Hypothesis: There is no significant difference between group means.")
}

Result:
Thus the R program for performing ANOVA is executed successfully.
Experiment No: 9 BUILDING AND VALIDATING LINEAR MODELS
Aim:
The aim of building and validating linear models is to create a model that accurately describes the
relationship between a dependent variable and one or more independent variables, and to determine
whether the model is a good fit for the data.
Algorithm: The algorithm for building and validating linear models involves the following steps:
1. Collect data: Collect data on the dependent variable and one or more independent variables.
2. Choose a linear model: Choose a linear model that describes the relationship between the dependent
variable and independent variable(s). A simple linear model has one independent variable, while a
multiple linear model has two or more independent variables.
3. Estimate model coefficients: Use a statistical software package to estimate the coefficients of the linear
model that best fit the data. The most common method for doing this is least squares regression. 4.
Evaluate model fit: Evaluate the fit of the model by examining the residual plots, which show the
difference between the predicted and actual values of the dependent variable. A good model will have
residuals that are randomly distributed around zero, with no discernible patterns.
5. Test for significance: Test the significance of the model by calculating the p-value for the overall F-test
of the model. A low p-value indicates that the model is a good fit for the data.
6. Evaluate individual coefficients: Evaluate the significance of individual coefficients in the model by
calculating their t-values and p-values. A low p-value indicates that the coefficient is significant and should
be included in the model.
7. Validate the model: Validate the model by testing it on new data that was not used to estimate the
coefficients. This can be done by using a hold-out sample, or by using cross validation techniques. 8.
Refine the model: Refine the model by making adjustments to the model specification, such as adding or
removing variables, transforming variables, or adding interaction terms.
9. Interpret the results: Interpret the coefficients of the model in terms of the relationship between the
dependent variable and independent variable(s).

Program:
# Importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_boston
sns.set(style="ticks", color_codes=True)
plt.rcParams['figure.figsize'] = (8, 5)
plt.rcParams['figure.dpi'] = 150
# Loading the data
boston = load_boston()
# You can check those keys with the following code.
print(boston.keys())
# The output will be as follows:
# dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
print(boston.DESCR)
# You will find these details in the output:
# Attribute Information (in order):
# — CRIM: per capita crime rate by town
# — ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
# — INDUS: proportion of non-retail business acres per town
# — CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
# — NOX: nitric oxides concentration (parts per 10 million)
# — RM: average number of rooms per dwelling
# — AGE: proportion of owner-occupied units built prior to 1940
# — DIS: weighted distances to five Boston employment centres
# — RAD: index of accessibility to radial highways
# — TAX: full-value property-tax rate per $10,000
# — PTRATIO: pupil-teacher ratio by town
# — B: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
# — LSTAT: % lower status of the population
# — MEDV: Median value of owner-occupied homes in $1000’s : Missing
# Attribute Values: None
df = pd.DataFrame(boston.data, columns=boston.feature_names)
print(df.head()) # print the top 5 rows in the dataset
print(df.columns) # print the columns present in the dataset

# First five records from the dataset


print(df.head())
# Plotting heatmap for overall dataset
sns.heatmap(df.corr(), square=True, cmap='RdYlGn')
plt.show()
# Plotting heatmap for overall dataset
sns.heatmap(df.corr(), square=True, cmap='RdYlGn')
plt.title('Correlation Heatmap of Overall Dataset')
plt.show()
# Plotting regression plot to visualize correlation between 'RM' and 'MEDV'
sns.lmplot(x='RM', y='MEDV', data=df)
plt.title('Regression Plot: RM vs MEDV')
plt.xlabel('Average Number of Rooms per Dwelling (RM)')
plt.ylabel('Median Value of Owner-Occupied Homes (MEDV)')
plt.show()

Result:
Thus the python program for building and validating linear models is executed successfully.
Experiment No: 10 BUILDING AND VALIDATING LOGISTICS MODELS
Aim:
The aim of building and validating logistic models is to create a model that accurately predicts the
probability of a binary outcome (e.g., success or failure) based on one or more independent variables, and
to determine whether the model is a good fit for the data.

Algorithm: The algorithm for building and validating logistic models involves the following steps:
1. Collect data: Collect data on the binary outcome variable and one or more independent variables. 2.
Choose a logistic model: Choose a logistic model that describes the relationship between the dependent
variable and independent variable(s). A simple logistic model has one independent variable, while a
multiple logistic model has two or more independent variables.
3. Estimate model coefficients: Use a statistical software package to estimate the coefficients of the
logistic model that best fit the data. The most common method for doing this is maximum likelihood
estimation.
4. Evaluate model fit: Evaluate the fit of the model by examining the goodness-of-fit statistics, such as the
deviance, the Akaike Information Criterion (AIC), and the Bayesian Information Criterion (BIC). A good
model will have a low deviance and low values of AIC and BIC.
5. Test for significance: Test the significance of the model by calculating the p-value for the overall chi-
square test of the model. A low p-value indicates that the model is a good fit for the data.
6. Evaluate individual coefficients: Evaluate the significance of individual coefficients in the model by
calculating their Wald test statistics and p-values. A low p-value indicates that the coefficient is significant
and should be included in the model.
7. Validate the model: Validate the model by testing it on new data that was not used to estimate the
coefficients. This can be done by using a hold-out sample, or by using cross validation techniques. 8.
Refine the model: Refine the model by making adjustments to the model specification, such as adding or
removing variables, transforming variables, or adding interaction terms.
9. Interpret the results: Interpret the coefficients of the model in terms of the relationship between the
independent variable(s) and the probability of the binary outcome.

Program:
# importing libraries
import statsmodels.api as sm
import pandas as pd
# loading the training dataset
df = pd.read_csv('logit_train1.csv', index_col=0)
# Handling missing values if any
df.dropna(inplace=True)
# defining the dependent and independent variables
Xtrain = df[['gmat', 'gpa', 'work_experience']]
ytrain = df['admitted']
# building the model and fitting the data
log_reg = sm.Logit(ytrain, sm.add_constant(Xtrain)).fit()
# Print the summary of the logistic regression model
print(log_reg.summary())
Output:

Predicting on New Data :


# loading the testing dataset
df = pd.read_csv('logit_test1.csv', index_col=0)
# defining the dependent and independent variables
Xtest = df[['gmat', 'gpa', 'work_experience']]
ytest = df['admitted']
# performing predictions on the test dataset
yhat = log_reg.predict(sm.add_constant(Xtest))
predictions = list(map(round, yhat))
# comparing original and predicted values of y
print('Actual values:', list(ytest.values))
print('Predictions:', predictions)

Output:
Actual values: [1, 0, 1, 0, 1, 0, 1, 0, 0, 1]
Predictions: [1, 0, 1, 0, 1, 0, 1, 0, 0, 1]

Testing the accuracy of the model :


from sklearn.metrics import confusion_matrix, accuracy_score
# confusion matrix
cm = confusion_matrix(ytest, predictions)
print("Confusion Matrix:\n", cm)
# accuracy score of the model
print('Test accuracy =', accuracy_score(ytest, predictions))

Output:

Result:
Thus the python program for building and validating logistic models is executed successfully
Experiment No: 11 TIME SERIES ANALYSIS

Aim:
The aim of performing time series analysis is to model and forecast the behaviour of a time series data
over a period of time, using statistical methods, in order to identify patterns, trends, and seasonality in
the data.

Algorithm:
The algorithm for performing time series analysis involves the following steps:
1. Collect data: Collect data on the time series variable over a period of time.
2. Visualize the data: Plot the time series data to identify patterns, trends, and seasonality.
3. Decompose the time series: Decompose the time series into its components, which are trend,
seasonality, and residual variation. This can be done using techniques such as moving averages,
exponential smoothing, or the Box-Jenkins method.
4. Model the trend: Model the trend component of the time series using techniques such as linear
regression, exponential smoothing, or ARIMA models.
5. Model the seasonality: Model the seasonality component of the time series using techniques such as
seasonal decomposition, dummy variables, or Fourier series.
6. Model the residual variation: Model the residual variation component of the time series using
techniques such as autoregressive models, moving average models, or ARIMA models.
7. Choose the best model: Evaluate the fit of the different models using measures such as AIC, BIC, and
RMSE, and choose the model that best fits the data.
8. Forecast future values: Use the chosen model to forecast future values of the time series variable.
9. Validate the model: Validate the model by comparing the forecasted values with actual values from a
hold-out sample, or by using cross-validation techniques.
10. Refine the model: Refine the model by making adjustments to the model specification, such as adding
or removing variables, transforming variables, or adding interaction terms.
11. Interpret the results: Interpret the results of the time series analysis in terms of the patterns, trends,
and seasonality of the data, and use the forecasted values to make predictions and inform decision-
making.

Program:
import warnings
import itertools
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
# Ignore warnings
warnings.filterwarnings("ignore")
# Set plot style
plt.style.use('fivethirtyeight')
# Read the data from Excel file
df = pd.read_excel("Superstore.xls")
# Filter the data to focus only on furniture sales
furniture = df.loc[df['Category'] == 'Furniture']
# Display the range of dates for furniture sales
print("Furniture Sales Data Range:")
print("Start Date:", furniture['Order Date'].min())
print("End Date:", furniture['Order Date'].max())

Data Pre-processing:
# Define the columns to be dropped
cols_to_drop = ['Row ID', 'Order ID', 'Ship Date', 'Ship Mode', 'Customer ID',
'Customer Name', 'Segment', 'Country', 'City', 'State',
'Postal Code', 'Region', 'Product ID', 'Category',
'Sub-Category', 'Product Name', 'Quantity', 'Discount', 'Profit']
# Drop the unnecessary columns
furniture.drop(cols_to_drop, axis=1, inplace=True)
# Sort the data by 'Order Date'
furniture.sort_values('Order Date', inplace=True)
# Check for missing values
print("Missing values in furniture dataset:")
print(furniture.isnull().sum())
# Aggregate sales by date
furniture_sales_by_date = furniture.groupby('Order Date')['Sales'].sum().reset_index()
# Display the aggregated data
print("\nAggregated furniture sales by date:")
print(furniture_sales_by_date.head())

Indexing with Time Series Data


# Set 'Order Date' column as index
furniture = furniture.set_index('Order Date')
# Access the index
print(furniture.index)
Visualizing Furniture Sales Time Series Data
# Assuming 'y' contains the aggregated sales data by date
y.plot(figsize=(15, 6))
plt.show()

Result: Thus the python program for performing time series analysis is executed successfully.

You might also like