You are on page 1of 21

Statistics and Econometrics ICA: R

Project
Meenal Chaturvedi
2023-11-19

Research Question Formulation


The Research question taken for this assignment is what is the impact of weight and
presence of Sale on the average price of the product.

Data: Electronic Products and


Pricing Data
The Data contains information on maximum and minimum prices, weight of the

products, Sale being present or not, company of production and many more.

-For this assignment 350 observations were taken into consideration.

-The data was taken from Data World website.

-As the data consisted of more qualitative variable than quantitative variables, Dummy
variables are considered in construction of the model.
-Time Period: From 2014 to 2018 the data has been present although its not
symmetric in nature, hence time series analysis is not done.

Information on Variables
-The three variables taken into consideration are: Sales, Weight of the product and

Average price.
-Average price was calculated from the maximum and minimum price of the product.

-Presence of Sales is taken as a dummy variable


-Weight is taken in pounds

-The model has Sales and Weight as independent variables while Average price as
dependent variables.

-Abbreviations used are as follows:


1. a_p = Average price
2. Sales = Sales
3. Weight = Weight

Exploratory Data Analysis


Exploratory Data Analysis (EDA) is a crucial phase in the data analysis process where

the primary goal is to understand the underlying patterns, relationships, and

distributions within the data set. It involves utilizing statistical and visual techniques to

derive insights, identify anomalies, and inform subsequent modeling or hypothesis

testing.

#running data
mydata <- read.csv("C:\\Users\\HP-PC\\Downloads\\file_p.csv")

#Header names
header_names<-c('a_p','Weight','Sale')

#summary
summary(mydata)
## a_p Weight Sale
## Min. : 11.99 Min. : 0.00875 Min. :0.0000
## 1st Qu.: 44.73 1st Qu.: 0.41875 1st Qu.:0.0000
## Median : 89.99 Median : 1.60000 Median :0.0000
## Mean :171.96 Mean : 9.34370 Mean :0.2171
## 3rd Qu.:224.99 3rd Qu.: 8.00000 3rd Qu.:0.0000
## Max. :899.98 Max. :150.00000 Max. :1.0000

The Summary above is valid for a_p and Weight variables but not for Sales variable
as it is a dummy variable. The average price results suggests that the values of a_p
vary widely, with a mean around $171.96. The range from the minimum to the
maximum is quite large. Whereas, the Weight variable also shows a wide range of
values. The mean is 9.3437 pounds, with a substantial difference between the median
and the mean, indicating a right-skewed distribution.

Scatter Plot
Scatter plots are fundamental for visualizing relationships between two continuous

variables. Each point on the plot represents an observation, making it easy to identify

patterns, clusters, or trends.

Key Insights from Scatter Plots: Linear Relationships: The pattern of points may
suggest a linear correlation.

Clusters: Groups of points may indicate underlying subpopulations.

Outliers: Observations deviating from the general pattern.

#Scatter Plot
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.2

ggplot(mydata, aes(x = Weight, y = a_p)) + geom_point()


From the plot we can conclude that observations are dense on the left. This majority of
data points being concentrated on the left side of the scatter plot indicates that there is
a cluster of observations with lower values for either a_p or weight or both variables.

Few Outliers on the Right: - On the right side of the scatter plot, there are a few data
points that deviate from the main cluster.

-These outliers have relatively higher values compared to the bulk of the data. Right
Skewness:

The overall pattern of the scatter plot is indicative of right skewness or positive
skewness. In a right-skewed distribution, the tail on the right side is longer than the left
side.The skewness is driven by the presence of higher values (outliers) on the right.

Histogram
Histograms are powerful tools for revealing the distribution of a single variable. A
histogram divides the data into bins and represents the frequency or density of
observations within each bin. By examining the shape of the histogram, one can
identify characteristics such as central tendency, spread, skewness, and potential
outliers.

Key Insights from Histograms:


Symmetry: A symmetric distribution shows balance around the center.

Skewness: Skewed distributions (left or right) indicate asymmetry.


Outliers: Extreme values beyond the bulk of the data may be outliers.

#Histogram
ggplot(mydata, aes(x = Weight)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(mydata, aes(x = a_p)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


Similar results can be concluded from the histogram as well. The bulk of the data
points are concentrated on the left side of the histogram which highlights the right
skewness of the data set. In further sections I will examine the degree of skewness as
well.

Correlation
Correlation measures the degree to which two variables move in relation to each

other. It provides insights into the strength and direction of a linear relationship

between two sets of data. The correlation coefficient, often denoted by “r,” ranges from

-1 to 1, with specific values indicating different types and strengths of correlation.

Strength of Correlation:
A correlation coefficient closer to 1 indicates a strong positive correlation, implying that
as one variable increases, the other tends to increase as well. Conversely, a
coefficient closer to -1 signifies a strong negative correlation, meaning that as one
variable increases, the other tends to decrease.

Direction of Correlation:

Positive correlation implies that the two variables move in the same direction.
Negative correlation suggests that the variables move in opposite directions.

Interpretation of Coefficient Values:

A correlation coefficient near 0 indicates a weak or no linear relationship. The closer


the coefficient is to 1 or -1, the stronger the relationship.

In our case, Sales is a dummy Variable. So, to check correlation for we use
as.numeric function is used to convert the factor variable Sales into a numeric variable
before calculating the correlation. With this we calculate correlation between a binary
and normal variable.

#correlation
#Sales vs Weight correlation
Sales <- factor(sample(c(0, 1), 350, replace = TRUE))
Weight <- rnorm(350)
correlation <- cor(as.numeric(Sales), Weight)
print(correlation)

## [1] -0.06787065

#Sales vs a_p correlation


a_p <- rnorm(350)
correlation2 <- cor(as.numeric(Sales), a_p)
print(correlation2)

## [1] 0.002762597

#a_p vs Weight correlation


correlation3<- cor(a_p,Weight)
print(correlation3)

## [1] 0.05753285

Now looking at the correlation graphs. To graph the correlation plot and matrix I have
downloaded corrplot package to R. And for studying binary variable “Sales” I have
used Box plot method, which is mentioned below.
#installing corrplot package
options(repos = c(CRAN = "https://cran.r-project.org"))
install.packages("corrplot")

## Installing package into 'C:/Users/HP-PC/AppData/Local/R/win-library/4.3'


## (as 'lib' is unspecified)

## package 'corrplot' successfully unpacked and MD5 sums checked


##
## The downloaded binary packages are in
## C:\Users\HP-PC\AppData\Local\Temp\RtmpuGDSNv\downloaded_packages

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.2

## corrplot 0.92 loaded

#Correlation matrix
cor_matrix<- cor(mydata)
corrplot(cor_matrix, method = "color")
#Plot for Sales and a_p
plot(Sales, a_p,
main = "Scatter Plot of Sales vs a_p",
xlab = "Sales",
ylab = "a_p")

#Plot for Sales and Weight


plot(Sales, Weight,
main = "Scatter Plot of Sales vs Weight",
xlab = "Sales",
ylab = "Weight")
1. Correlation between Sales and Weight From the plots and matrix it is clear the
weight and sales has negative relationship. Additionally, correlation coefficient is
-0.07660315 between weight and sale suggests a weak negative linear
relationship between these two variables.Interpretation for the same is below:

Strength of the Relationship: The absolute value of the correlation coefficient is


relatively close to 0 (0.0766), indicating a weak linear relationship.
Direction of the Relationship: The negative sign of the correlation coefficient (-0.0766)
indicates a negative or inverse relationship. This means that as the weight increases,
the sales tend to decrease slightly, and vice versa. However, the strength of this
relationship is weak.

Interpretation of Magnitude: The magnitude of the correlation coefficient is important


for understanding the strength of the relationship. In general, values closer to 1 or -1
represent stronger relationships, while values closer to 0 suggest weaker
relationships.

From the results obtained we can observe that the practical significance of the
relationship may be limited. In practical terms, changes in weight are not strongly
associated with changes in sales.

2. Correlation between Sales and a_p From the plots and matrix we can understand
that there is extremely weak correlation between Sales and a_p. The correlation
coefficient is also extremely close to 0 i.e.-0.009827771.Additionally, correlation
coefficient is -0.009827771 between a_p and sale suggests an extremely weak,
almost negligible, negative linear relationship between these two variables.
Here’s the interpretation:

Strength of the Relationship: The absolute value of the correlation coefficient is very
close to 0 (0.0098), indicating an extremely weak linear relationship.
Direction of the Relationship: The negative sign of the correlation coefficient (-0.0098)
suggests a very weak negative or inverse relationship. This means that as the values
of a_p change, there is a very slight tendency for sale to decrease, and vice versa.
Interpretation of Magnitude: The magnitude of the correlation coefficient is crucial for
understanding the strength of the relationship. In this case, the value is extremely
close to 0, indicating an almost nonexistent linear relationship.
From the results obtained we can observe that the practical significance of this
relationship is likely minimal. Changes in a_p are not meaningfully associated with
changes in sale based on this correlation.

Model Specification
For this model multiple regression was performed. Multiple Regression technique

allows us to explore the relationship between a dependent variable and two or more

independent variables simultaneously. Unlike simple linear regression, which

considers the relationship between a dependent variable and a single independent

variable, multiple regression accounts for the potential influence of multiple predictors.

Key Components of Multiple Regression:

Dependent Variable (Y): The variable we are trying to predict or explain. It is the
outcome or response variable.

Independent Variables (X1, X2, …, Xn): The predictors or features that are believed to
have an impact on the dependent variable. In multiple regression, there can be more
than one independent variable.
Coefficients (β0, β1, β2, …, βn): The coefficients represent the estimated change in
the dependent variable for a one-unit change in the corresponding independent
variable, holding other variables constant.

Intercept (β0): The intercept represents the predicted value of the dependent variable
when all independent variables are set to zero.

Residuals: The differences between the observed and predicted values of the
dependent variable. The goal is to minimize these differences.

For this model components are as follows:

Dependent variable: a_p Independent Variable: Sales (Dummy Variable) and Weight
Intercept: B0 Coefficient of Sales: B1 Coefficient of Weight: B2 Residual:c

Model:

a_p = B0 + B1*Sales + B2 Weight + c

Assumptions Checking and


necessary transformation of
variables
Before running the multiple regression model its important that the model fulfils the

below assumptions. Understanding and verifying these assumptions are essential for

the robustness and interpretability of the regression model.

1. Linearity: The assumption of linearity in multiple regression asserts that the


relationship between the dependent variable and each independent variable is
linear. This implies that changes in the independent variables have a consistent
and proportional effect on the dependent variable. Visualization through scatter
plots aids in assessing linearity, where a reasonably straight line should capture
the relationship between each independent variable and the dependent variable.
2. Independence: Independence assumes that observations in the dataset are
independent of each other. In other words, the values of the dependent variable
for one observation should not be influenced by the values of the independent
variables for other observations. Ensuring independence is crucial, and the data
points should be collected independently, without any systematic relationship or
grouping that could violate this assumption.
3. Homoscedasticity: Homoscedasticity asserts that the variance of the residuals
(the differences between observed and predicted values) is constant across all
levels of the independent variables. Residuals should not exhibit a pattern where
the spread systematically widens or narrows as the predicted values change.
Checking residual plots, such as the Residuals vs. Fitted Values plot, helps
ensure a relatively constant spread of residuals.
4. Normality of Residuals: Normality of residuals is the assumption that the
residuals follow a normal distribution. This is crucial for valid statistical inferences
derived from the model. Assessment methods include statistical tests or visual
inspection using tools like a Q-Q plot or histogram to evaluate the normality of
residuals.

5. No Perfect Multicollinearity: The assumption of no perfect multicollinearity posits


that independent variables are not perfectly correlated with each other. Perfect
multicollinearity arises when one or more independent variables can be precisely
predicted using the others, leading to numerical instability in the regression
coefficients. To assess multicollinearity, Variance Inflation Factors (VIF) are
calculated for each independent variable.

6. No Endogeneity: Endogeneity assumption states that there is no correlation


between the independent variables and the residuals. Endogeneity can bias
coefficient estimates and compromise the causal interpretation of the model.
Awareness of potential endogeneity issues, careful consideration of study design,
and the application of instrumental variable techniques, if necessary, contribute
to addressing this assumption.
Necessary Transformation of Variables The data shows from the Exploratory data
analysis that the a_p and Weight are positively skewed. This should be treated in
order to fulfill the assumptions of the multiple regression model. The essence of log
transformation lies in its ability to compress larger values and expand smaller ones,
effectively mitigating right-skewness commonly encountered in datasets. In the
context of multiple regression, applying log transformation to skewed predictors or the
dependent variable can enhance model performance and uphold the assumptions of
the analysis.One primary advantage of log transformation is its capacity to stabilize
variance across different levels of the independent variable. By compressing extreme
values, log transformation diminishes the impact of outliers, fostering a more
homoscedastic distribution of residuals. This, in turn, aligns with the assumptions of
multiple regression, ensuring that the spread of errors remains relatively constant.

#transformation of a_p
a_plog_transformed_variable <- log(a_p)

## Warning in log(a_p): NaNs produced

positive_values <- a_p[a_p > 0]


log_transformed_variable <- log(positive_values)

#transformation of Weight
Weightlog_transformed_variable <- log(Weight)

## Warning in log(Weight): NaNs produced

positive_values <- Weight[Weight > 0]


log_transformed_variable <- log(positive_values)

Model Estimation and


Evaluation
library(dplyr)

##
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':


##
## filter, lag

## The following objects are masked from 'package:base':


##
## intersect, setdiff, setequal, union

mydata$log_transformed_a_p <- log(mydata$a_p)


mydata$log_transformed_Weight <- log(mydata$Weight)

Regression<- lm(log_transformed_a_p ~ factor(Sales) + log_transformed_Weight, data = mydata)


summary(Regression)
##
## Call:
## lm(formula = log_transformed_a_p ~ factor(Sales) + log_transformed_Weight,
## data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.61124 -0.76879 -0.05544 0.84251 2.00836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.49529 0.07194 62.486 < 2e-16 ***
## factor(Sales)1 0.01570 0.10308 0.152 0.879
## log_transformed_Weight 0.19917 0.02483 8.020 1.64e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9622 on 347 degrees of freedom
## Multiple R-squared: 0.1567, Adjusted R-squared: 0.1518
## F-statistic: 32.24 on 2 and 347 DF, p-value: 1.439e-13

Interpretation of the results


The output shown above is based on transformed variables to stick to the

assumptions mentioned. Here the dependent variable is log_transformed_a_p, and

the independent variables are factor(Sales) (which appears to be a categorical

variable with two levels) and log_transformed_Weight. Interpretation of the of the

results is as follows below:

Coefficients:
Intercept: The intercept is 4.53236. In the context of the model, when all other
variables are zero, the estimated mean of log_transformed_a_p is 4.53236.

Sales: This represents the coefficient for the factor variable Sales with level 1. The
estimated effect on log_transformed_a_p for the first level of Sales is -0.06017.
Log_transformed_Weight: The coefficient is 0.19987, indicating that for a one-unit
increase in log_transformed_Weight, we expect an increase of approximately 0.19987
in the estimated mean of log_transformed_a_p.

Statistical Significance:

The intercept is highly significant (p-value < 2e-16), suggesting that the intercept is
different from zero.
The coefficient for factor(Sales)1 is not statistically significant (p-value = 0.559),
indicating that the effect of this variable on log_transformed_a_p is not reliably
different from zero.
The coefficient for log_transformed_Weight is highly significant (p-value < 0.001),
suggesting a reliable effect on log_transformed_a_p.

Model Fit:
The residual standard error is 0.9617, representing the standard deviation of the
model’s residuals.

The Multiple R-squared is 0.1575, indicating that the model explains approximately
15.75% of the variance in the dependent variable.
The Adjusted R-squared is 0.1526, accounting for the number of predictors in the
model.

The F-statistic is 32.43 with a very low p-value (1.227e-13), suggesting that the overall
model is statistically significant.

In summary, the model suggests that log_transformed_Weight has a significant


positive effect on log_transformed_a_p, while the factor variable Sales does not have
a significant effect. The model explains a modest proportion of the variability in
log_transformed_a_p.

Model Diagnostic
This part would include Checking the residuals and multicollinearity.

res<- resid(Regression)
plot(density(res))
From the above graph we can see that the distribution of residuals is not normally
distributed. This is a bimodal distribution. The bimodal distribution suggests that the
linear regression model might not be the most appropriate representation of the
underlying relationship between the dependent and independent variables. The
presence of two modes implies the existence of two distinct patterns or
subpopulations within the data that the current model is unable to capture adequately.
Based on the results non-linear regression model can help in giving better results.

Independence of errors
From the below plot we can conclude that the errors at independent of each other. To

test independence of errors residuals_vs_fitted plot was plpotted. Ideally variability in

the residuals should not exhibit any systematic pattern or relationship with the
predicted values. The plot doesn’t show any particular parttern hence we can say

errors are independent of each other.

residuals_vs_fitted <- residuals(Regression)


fitted_values <- fitted(Regression)

plot(fitted_values, residuals_vs_fitted,
main = "Residuals vs. Fitted Values Plot",
xlab = "Fitted Values", ylab = "Residuals")

Multicollinearity test
Multicollinearity is a common issue in multiple regression when two or more

independent variables in a model are highly correlated. Variance Inflation Factor (VIF)

is a statistical measure that quantifies the extent of multicollinearity in a regression

model. High VIF values suggest that the variance of the estimated regression

coefficients is inflated, making it challenging to identify the true relationship between

the independent variables and the dependent variable.

To calculate multicollinearity and run VIF test, I have downloaded car package.

#Multicollinearoty test
install.packages("car")

## Installing package into 'C:/Users/HP-PC/AppData/Local/R/win-library/4.3'


## (as 'lib' is unspecified)

## package 'car' successfully unpacked and MD5 sums checked


##
## The downloaded binary packages are in
## C:\Users\HP-PC\AppData\Local\Temp\RtmpuGDSNv\downloaded_packages

library(car)

## Warning: package 'car' was built under R version 4.3.2

## Loading required package: carData

##
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':


##
## recode
vif_values <- vif(Regression)

# Display VIF values


print(vif_values)

## factor(Sales) log_transformed_Weight
## 1.000979 1.000979

Interpretation: A VIF close to 1 suggests minimal multicollinearity, while values greater


than 10 are often considered indicative of problematic multicollinearity. In your output,
both VIF values are very close to 1:
Factor(Sales): The VIF is 1.000997, indicating that the factor variable “Sales” does not
have a high correlation with other predictors in the model. This suggests that there is
minimal multicollinearity associated with this factor.

Log_transformed_Weight: Similarly, the VIF for the variable “log_transformed_Weight”


is 1.000997, suggesting that this predictor is not highly correlated with other predictors
in the model.

In summary, based on the VIF values resulted from this test, there is no evidence of
problematic multicollinearity between the predictors “factor(Sales)” and
“log_transformed_Weight” in your regression model. The low VIF values indicate that
these variables are relatively independent of each other, which is favourable for the
stability and interpretability of your regression model.

Sentitivity analysis
1. Better modelling: The model if run through non-linear regression models might
bring better results.

2. Residuals: Residuals are bimodally distributed. Which highlight the presence of


two populations under this data sets. hence, if identified and ran regression
separately, it could better results.
3. Better transformation method: Better transformation method could be explored
which can help in fulfilling required assumptions. Log transformations couldn’t
solve the problem of non-normal distribution of residuals.
4. The Data available had limited scope of research. Hence we can obtain more
information from other sources and run more complex and realistic model.

Conclusion
The regression model, primarily driven by the log-transformed weight, offers a
meaningful understanding of the relationships within the data.The analysis points to
the substantive influence of weight on average price, aligning with expectations. This
provides insights for the firms to look into pricing strategies or market dynamics
related to weight-based products. The non-significant effect of the sales factor
highlights that we need a deeper exploration into the intricacies of sales dynamics and
their interplay with pricing. Sales is a variable which can be researched more on its
impact on various other variables.

You might also like