Professional Documents
Culture Documents
Project
Meenal Chaturvedi
2023-11-19
products, Sale being present or not, company of production and many more.
-As the data consisted of more qualitative variable than quantitative variables, Dummy
variables are considered in construction of the model.
-Time Period: From 2014 to 2018 the data has been present although its not
symmetric in nature, hence time series analysis is not done.
Information on Variables
-The three variables taken into consideration are: Sales, Weight of the product and
Average price.
-Average price was calculated from the maximum and minimum price of the product.
-The model has Sales and Weight as independent variables while Average price as
dependent variables.
distributions within the data set. It involves utilizing statistical and visual techniques to
testing.
#running data
mydata <- read.csv("C:\\Users\\HP-PC\\Downloads\\file_p.csv")
#Header names
header_names<-c('a_p','Weight','Sale')
#summary
summary(mydata)
## a_p Weight Sale
## Min. : 11.99 Min. : 0.00875 Min. :0.0000
## 1st Qu.: 44.73 1st Qu.: 0.41875 1st Qu.:0.0000
## Median : 89.99 Median : 1.60000 Median :0.0000
## Mean :171.96 Mean : 9.34370 Mean :0.2171
## 3rd Qu.:224.99 3rd Qu.: 8.00000 3rd Qu.:0.0000
## Max. :899.98 Max. :150.00000 Max. :1.0000
The Summary above is valid for a_p and Weight variables but not for Sales variable
as it is a dummy variable. The average price results suggests that the values of a_p
vary widely, with a mean around $171.96. The range from the minimum to the
maximum is quite large. Whereas, the Weight variable also shows a wide range of
values. The mean is 9.3437 pounds, with a substantial difference between the median
and the mean, indicating a right-skewed distribution.
Scatter Plot
Scatter plots are fundamental for visualizing relationships between two continuous
variables. Each point on the plot represents an observation, making it easy to identify
Key Insights from Scatter Plots: Linear Relationships: The pattern of points may
suggest a linear correlation.
#Scatter Plot
library(ggplot2)
Few Outliers on the Right: - On the right side of the scatter plot, there are a few data
points that deviate from the main cluster.
-These outliers have relatively higher values compared to the bulk of the data. Right
Skewness:
The overall pattern of the scatter plot is indicative of right skewness or positive
skewness. In a right-skewed distribution, the tail on the right side is longer than the left
side.The skewness is driven by the presence of higher values (outliers) on the right.
Histogram
Histograms are powerful tools for revealing the distribution of a single variable. A
histogram divides the data into bins and represents the frequency or density of
observations within each bin. By examining the shape of the histogram, one can
identify characteristics such as central tendency, spread, skewness, and potential
outliers.
#Histogram
ggplot(mydata, aes(x = Weight)) + geom_histogram()
Correlation
Correlation measures the degree to which two variables move in relation to each
other. It provides insights into the strength and direction of a linear relationship
between two sets of data. The correlation coefficient, often denoted by “r,” ranges from
Strength of Correlation:
A correlation coefficient closer to 1 indicates a strong positive correlation, implying that
as one variable increases, the other tends to increase as well. Conversely, a
coefficient closer to -1 signifies a strong negative correlation, meaning that as one
variable increases, the other tends to decrease.
Direction of Correlation:
Positive correlation implies that the two variables move in the same direction.
Negative correlation suggests that the variables move in opposite directions.
In our case, Sales is a dummy Variable. So, to check correlation for we use
as.numeric function is used to convert the factor variable Sales into a numeric variable
before calculating the correlation. With this we calculate correlation between a binary
and normal variable.
#correlation
#Sales vs Weight correlation
Sales <- factor(sample(c(0, 1), 350, replace = TRUE))
Weight <- rnorm(350)
correlation <- cor(as.numeric(Sales), Weight)
print(correlation)
## [1] -0.06787065
## [1] 0.002762597
## [1] 0.05753285
Now looking at the correlation graphs. To graph the correlation plot and matrix I have
downloaded corrplot package to R. And for studying binary variable “Sales” I have
used Box plot method, which is mentioned below.
#installing corrplot package
options(repos = c(CRAN = "https://cran.r-project.org"))
install.packages("corrplot")
library(corrplot)
#Correlation matrix
cor_matrix<- cor(mydata)
corrplot(cor_matrix, method = "color")
#Plot for Sales and a_p
plot(Sales, a_p,
main = "Scatter Plot of Sales vs a_p",
xlab = "Sales",
ylab = "a_p")
From the results obtained we can observe that the practical significance of the
relationship may be limited. In practical terms, changes in weight are not strongly
associated with changes in sales.
2. Correlation between Sales and a_p From the plots and matrix we can understand
that there is extremely weak correlation between Sales and a_p. The correlation
coefficient is also extremely close to 0 i.e.-0.009827771.Additionally, correlation
coefficient is -0.009827771 between a_p and sale suggests an extremely weak,
almost negligible, negative linear relationship between these two variables.
Here’s the interpretation:
Strength of the Relationship: The absolute value of the correlation coefficient is very
close to 0 (0.0098), indicating an extremely weak linear relationship.
Direction of the Relationship: The negative sign of the correlation coefficient (-0.0098)
suggests a very weak negative or inverse relationship. This means that as the values
of a_p change, there is a very slight tendency for sale to decrease, and vice versa.
Interpretation of Magnitude: The magnitude of the correlation coefficient is crucial for
understanding the strength of the relationship. In this case, the value is extremely
close to 0, indicating an almost nonexistent linear relationship.
From the results obtained we can observe that the practical significance of this
relationship is likely minimal. Changes in a_p are not meaningfully associated with
changes in sale based on this correlation.
Model Specification
For this model multiple regression was performed. Multiple Regression technique
allows us to explore the relationship between a dependent variable and two or more
variable, multiple regression accounts for the potential influence of multiple predictors.
Dependent Variable (Y): The variable we are trying to predict or explain. It is the
outcome or response variable.
Independent Variables (X1, X2, …, Xn): The predictors or features that are believed to
have an impact on the dependent variable. In multiple regression, there can be more
than one independent variable.
Coefficients (β0, β1, β2, …, βn): The coefficients represent the estimated change in
the dependent variable for a one-unit change in the corresponding independent
variable, holding other variables constant.
Intercept (β0): The intercept represents the predicted value of the dependent variable
when all independent variables are set to zero.
Residuals: The differences between the observed and predicted values of the
dependent variable. The goal is to minimize these differences.
Dependent variable: a_p Independent Variable: Sales (Dummy Variable) and Weight
Intercept: B0 Coefficient of Sales: B1 Coefficient of Weight: B2 Residual:c
Model:
below assumptions. Understanding and verifying these assumptions are essential for
#transformation of a_p
a_plog_transformed_variable <- log(a_p)
#transformation of Weight
Weightlog_transformed_variable <- log(Weight)
##
## Attaching package: 'dplyr'
Coefficients:
Intercept: The intercept is 4.53236. In the context of the model, when all other
variables are zero, the estimated mean of log_transformed_a_p is 4.53236.
Sales: This represents the coefficient for the factor variable Sales with level 1. The
estimated effect on log_transformed_a_p for the first level of Sales is -0.06017.
Log_transformed_Weight: The coefficient is 0.19987, indicating that for a one-unit
increase in log_transformed_Weight, we expect an increase of approximately 0.19987
in the estimated mean of log_transformed_a_p.
Statistical Significance:
The intercept is highly significant (p-value < 2e-16), suggesting that the intercept is
different from zero.
The coefficient for factor(Sales)1 is not statistically significant (p-value = 0.559),
indicating that the effect of this variable on log_transformed_a_p is not reliably
different from zero.
The coefficient for log_transformed_Weight is highly significant (p-value < 0.001),
suggesting a reliable effect on log_transformed_a_p.
Model Fit:
The residual standard error is 0.9617, representing the standard deviation of the
model’s residuals.
The Multiple R-squared is 0.1575, indicating that the model explains approximately
15.75% of the variance in the dependent variable.
The Adjusted R-squared is 0.1526, accounting for the number of predictors in the
model.
The F-statistic is 32.43 with a very low p-value (1.227e-13), suggesting that the overall
model is statistically significant.
Model Diagnostic
This part would include Checking the residuals and multicollinearity.
res<- resid(Regression)
plot(density(res))
From the above graph we can see that the distribution of residuals is not normally
distributed. This is a bimodal distribution. The bimodal distribution suggests that the
linear regression model might not be the most appropriate representation of the
underlying relationship between the dependent and independent variables. The
presence of two modes implies the existence of two distinct patterns or
subpopulations within the data that the current model is unable to capture adequately.
Based on the results non-linear regression model can help in giving better results.
Independence of errors
From the below plot we can conclude that the errors at independent of each other. To
the residuals should not exhibit any systematic pattern or relationship with the
predicted values. The plot doesn’t show any particular parttern hence we can say
plot(fitted_values, residuals_vs_fitted,
main = "Residuals vs. Fitted Values Plot",
xlab = "Fitted Values", ylab = "Residuals")
Multicollinearity test
Multicollinearity is a common issue in multiple regression when two or more
independent variables in a model are highly correlated. Variance Inflation Factor (VIF)
model. High VIF values suggest that the variance of the estimated regression
To calculate multicollinearity and run VIF test, I have downloaded car package.
#Multicollinearoty test
install.packages("car")
library(car)
##
## Attaching package: 'car'
## factor(Sales) log_transformed_Weight
## 1.000979 1.000979
In summary, based on the VIF values resulted from this test, there is no evidence of
problematic multicollinearity between the predictors “factor(Sales)” and
“log_transformed_Weight” in your regression model. The low VIF values indicate that
these variables are relatively independent of each other, which is favourable for the
stability and interpretability of your regression model.
Sentitivity analysis
1. Better modelling: The model if run through non-linear regression models might
bring better results.
Conclusion
The regression model, primarily driven by the log-transformed weight, offers a
meaningful understanding of the relationships within the data.The analysis points to
the substantive influence of weight on average price, aligning with expectations. This
provides insights for the firms to look into pricing strategies or market dynamics
related to weight-based products. The non-significant effect of the sales factor
highlights that we need a deeper exploration into the intricacies of sales dynamics and
their interplay with pricing. Sales is a variable which can be researched more on its
impact on various other variables.