You are on page 1of 16

DATA ANALYSIS FOR MANAGERS

CIA -II

By

ARATHY SUJAYA VENU

REGISTER NUMBER

2227610

Under the Guidance of

PROF. ANCHAL PATIL

MBA PROGRAMME
SCHOOL OF BUSINESS AND MANAGEMENT
CHRIST (DEEMED TO BE UNIVERSITY), BANGALORE

OCTOBER 2022
INTRODUCTION

In this study variety of hypothesis testing approaches were used which examines the findings, inferences,
conclusions. The dataset is used Audi cars. Various hypothesis testing techniques will be applied to it to
provide insights on business issues faced by Audi.

Total number of fields in the dataset: 9

Total number of records in the dataset: 10668

Source link: https://www.kaggle.com/datasets/mysarahmadbhat/audi-used-car-listings

ABOUT THE SOURCE:

The data set includes details on the pricing, engine size, fuel type, transmission, mileage, and miles per
gallon (mpg).

METHOD

The original data that was used for the hypothesis testing consists of around 10668 records. To gain a
broad understanding of the sample dataset, the point and interval estimates of the sample as well as the
proportions of categorical data will be generated before delving extensively into the hypothesis testing.
For the hypothesis testing, a variety of methodologies will be used, including: -

➢ One-way ANOVA test


➢ One-tailed t-test
➢ Two-tailed t-test
➢ Regression and Correlation analysis
➢ Chi-Square Test
➢ Point and Interval Estimate
➢ Central limit theorem
Business Problem: The company sells cars. Using statistical techniques, we create a pattern
that develops during the sale of a car.
REGRESSION:

To ascertain whether an independent variable X and a dependent variable Y have a meaningful linear
relationship. The regression line's slope is the main focus of the test.

Y=a+bx

CHI-SQUARE

The difference between the expected and actual frequencies of a group of events or
variables is calculated using the chi-square statistic. It is helpful for analysing these
differences in categorical variables, especially those having nominal values.
Degrees of freedom, sample size, and the degree of disparity between real and observed
values all play a part. may be applied to establish the relationship between two variables.
Additionally, it can be used to assess how well an actual distribution fits a hypothetical
distribution of frequencies.

ANOVA:

A two-way ANOVA (analysis of variances) is an extension of the one-way ANOVA that exposes the
effects of two independent variables on a dependent variable. It is a statistical approach that examines the
influence of independent variables on the predicted outcome as well as their connection to the outcome.
ANOVA is used extensively in finance, economics, research, medical, and social science.

CORRELATION:

The degree of a relationship between two variables is expressed by their correlation coefficients. The most
used statistical measure is the correlation coefficient. This metric evaluates the direction and strength of a
linear relationship between two variables. Always, the values fall between -1 (a very weak connection)
and +1. (strong positive relationship). A weak or nonexistent linear relationship is indicated by values
between and Close to zero.

ONE SAMPLE T-TEST:

The one-sample t-test determines if the sample mean is significantly more or less significant than a
given value by comparing its mean to that value. Using an independent sample t-test, two groups' means
are compared.

TWO SAMPLE T-TESTS:

Analysis of the variance between two unknown population means is performed using a two-sample t-
hypothesis test, commonly referred to as an independent t-test.
POINT AND INTERVAL ESTIMATE

A point estimate is a single value estimate of a parameter. For instance, a sample mean is
a point estimate of a population mean. An interval estimate gives you a range of values where the
parameter is expected to lie. A confidence interval is the most common type of interval estimate.

TEST

1. ANALYSIS OF VARIANCE (ANOVA):

HYPOTHESIS
Null Hypothesis: All of the sample means of selling price are the same.
Alternate Hypothesis: At least one selling price sample mean is not equal.

Ho: μ1 = μ2 = μ3

Ha: Means are not all equal.


TEST

ACTION
We can see that, the Test Statistic is greater than the critical value. From the table,
we observe that the Test Statistic is equal to 135.32 and the F critical valueis 2.99
Since, F critical value < F calculated (test critic), we reject the null hypothesis at a
5% level of significance.
Also, the P value is 0.000000…… and the alpha value is 0.05. Since p-value < alpha
value, we reject the null hypothesis at a 5% level of significance.
So, the mean selling price for the year 2018,2019,2020 is found not to be equal.

BUSINESS INFERENCE

The mean of the years 2018, 2019, and 2020 is not equal. This indicates that the price and year are not
related.
2. ONE SAMPLE T-TEST

Null Hypothesis: The average mileage driven is less than or equal to 24827.244
Alternate Hypothesis: The average mileage driven is more than 24827.244

H0: 𝜇 ≤ 24827.244
Ha: 𝜇 > 24827.244

TEST
We use the lower tail test with a level of significance as 𝛼=0.05
A random sample of size n=30 was selected with 𝑋= 18335.36
The test statistic is calculated as T=𝑋−𝜇/(𝜎/√𝑛)
The t critical value is found to be 1.699 for 𝛼=0.05, df = 29
We check if the test statistic is greater than the critical value. If so, we reject the null
hypothesis.
We also check if the corresponding p-value is less than 𝛼
ACTION
T statistic -1.90 < critical value 1.699 and P value 0.033 > 𝛼 0.05, therefore we fail to
reject the null hypothesis at 5% level of significance

BUSINESS INFERENCE

The mileage is less than 24827.244 miles at the 5% level of significance. This data implies that the Car
Company sell CARS with average mileages less than 24827.244.

3. TWO SAMPLE T-TEST

Hypothesis Testing -Two Population Mean (Two Mean):


Null hypothesis: The average selling price of a diesel car is equal to petrol cars
Alternate Hypothesis: The average selling price of a diesel car is not equal to petrol cars
H0: 𝜇𝑑 = 𝝁𝒑
Ha: 𝜇𝑑 ≠ 𝝁𝒑
TEST

The level of significance is taken as 𝛼=0.05


The t critical value is found to be 2.007 for 𝛼=0.05
ACTION

The test statistic 2.72 > critical value 2.00, so we reject the null hypothesis 5% level of
significance
The p-value is 0.0087 < 0.05, so we reject the null hypothesis at a 5% level of significance.
This means that we reject the null hypothesis stating that the average selling price of diesel
cars is not equal to the average selling price of petrol cars.

BUSINESS INFERENCE

Diesel and gasoline vehicles sell for significantly different prices on average. This suggests that
the average selling price of diesel vehicles is higher than that of gasoline vehicles. In light of the
fact that diesel offers better gas mileage and costs less, more consumers are choosing to purchase
diesel cars overall even though diesel is more expensive to sell than gasoline.

4. CORRELATION:

Scatter Diagram: An great tool for understanding the nature of correlation is a scatter
diagram. An investigator can use a scatter diagram to assess whether a problem
circumstance has a positive, negative, or no association. Any two variables presented
graphically in the X-Y plane are called scatter diagrams.

Correlation coefficient:

For a sample of n observations selected on two variables X and Y, the sample correlation
coefficient r, measures the degree to which there is a linear association between two
interval-scaled variables.
From the following data set, the correlation between the Selling Price of the Cars and the
duration of the car used. Taking X as Years and Y as the Selling Price for a sample size
n=10668, the following scatter plot was obtained.
The correlation could be positive, negative, or zero. The correlation coefficient is always
between -1 and +1.
This means that the correlation coefficient r will have a negative value which is -0.592.
There is a negative correlation as one variable is increasing and another variable is
decreasing.

price
200000

150000

100000

50000

0
0 5 10 15 20 25 30
-50000

-100000

BUSINESS INFERENCE

According to the above inference, the selling price of the car decreases as the period of the car's usage
increases. This means that the most recent model car will be more expensive.
5. REGRESSION

Based on these findings, a variable is a linear function of a single independent variable. A model
that connects the dependent and independent variables using a linear equation is developed based
on the sample data collected for the dependent and independent variables. The sample regression
line is represented symbolically as follows:

Y = a + bx where,
Y is the dependent/response variable
X is the independent variable/ feature/ predictor
a and b are constants.
This is the line of best fit.

a and b are determined by the statistical least square method, b is called the regression
coefficient (slope), and a is the constant term (intercept).
Normal Probability Plot
200000

150000
price

100000

50000

20 40 60 80 100 120
Sample Percentile

The value of a and b is the slope of the regression line

The dependent variable of this test is the selling price, while the independent variable is
duration. In regression statistics, the R Square value represents the proportion of the
variance in the response variable of a regression model that the predictor variables can
explain. This value ranges from 0 to 1

The R Square value -0.355152 tells us the selling price can explain around 35.51%.

a= 38588.128, b= -3202.77, from this the line of best fit can be written as

Y= 38588.128 – 3202.77X
Since the R square value is not very high, this means that the regression model is not
very accurate

We can validate the model statistically by looking at R^2 as well as the F statistic in the
ANOVA that tests the null hypothesis of no linear relationship. After statistical
validation, the model can be used for estimation and prediction.

REGRESSION ANOVA:
We can check if there exists a linear relationship between the variables by using
ANOVA.
HYPOTHESIS
Null Hypothesis: There is no linear relationship between Y and X in the population
regression line
Alternative Hypothesis: There is a linear relationship between Y and X in the Population
Regression Line
H0: β1 = 0

Ha: β1 ≠ 0

F statistics computed is nothing but the ratio between the mean squares of regression
and residual. This is also given in the Excel ANOVA output.

ACTION
Here, a significant F value (P-value) is shown as 0.00. At a 5% level of significance,
we reject the null hypothesis since it is less significant than the level of 0.05.
BUSINESS INFERENCE

Business implications indicate a linear relationship between the selling price and the amount of time an
automobile is used. There is a linear relationship between the selling price and the year. The selling price
of an automobile decreases by 3202.77 units for every unit increase in the year that it is driven. The
calculation states that the selling price of the car should be 38588.12 while it is not in use.

6. POINT AND INTERVAL ESTIMATE


The point estimate for the price is 20619.6. Interval estimates for the price at a
confidence level of 95%
➢ Upper Confidence Level Marginal error (95%) = 20619.6 + 1085.16 = 19534.46
➢ Lower Confidence Level Marginal error (95%) = 20619.6 – 1085.16 = 21704.79

7. CENTRAL LIMIT THEOREM


The central limit theorem (CLT) states that the distribution of sample means
approximates a normal distribution as the sample size gets larger, regardless of the
population's distribution.

Sample sizes equal to or greater than 30 are often considered sufficient for the CLT to
hold.
So, we are going to test the CLT to Fuel Consumption in the city/litre.
In the first step, we are considering a random sample of 30 from the population using
the sampling function in the data analysis
This step is repeated 30 times which is in accordance with the definition of the Central
Limit Theorem.
After repeating this process30 times we are taking the average of each sample column.
Once this is done, we are able to see that the means of the sample are normally
distributed.

Histogram
10
Frequency

Frequency

2 per. Mov. Avg.


(Frequency)
Bin
BUSINESS INFERENCE

We took 10668 population samples for this test. The random sample of 30 is obtained 30 times, and the
mean result of all the averages is 23433.33, which is almost at the top of the histogram, and the data is
normally distributed.

8. CHI-SQUARE

Null Hypothesis: petrol & diesel are independent


Alternate Hypothesis: petrol & diesel are dependent
ACTION
From the above inference, the Test Statistic is greater than the critical value to reject
the null hypothesis.
We find that 170.7 > 5.991, since the Test Statistic is greater than the critical value,
we reject the null hypothesis at a 5% level of significance.

BUSINESS INFERENCE
This suggests that both the cost and type of fuel used in cars are impacted.
Therefore, we might contend that the fuel type, rather than the vehicle's quality and
maintenance, determines the price of the car.

You might also like