You are on page 1of 8

SEPTEMBER 15, 2020

BIG DATA AND


BUSINESS
ANALYTICS
EVALUATION- I

Submitted to
Dr. Gulnaz Banu

By Shruti Arora
Question 1: Develop a simple linear regression model between the sold price and
batting strike rate, is there a statistically significant relationship between sold
price and batting strike rate?
Taking Dependent Variable as Sold Price and Independent Variable to be SR-B, Following
Conclusion may be drawn:

SUMMARY OUTPUT

Regression Statistics
Multiple R 0.18426907
R Square 0.03395509
Adjusted R Square
0.02640786
Standard Error401399.957
Observations 130

ANOVA
df SS MS F Significance F
Regression 1 7.2489E+11 7.2489E+11 4.49901599 0.03584286
Residual 128 2.0624E+13 1.6112E+11
Total 129 2.1348E+13

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 289521.427 114770.006 2.52262274 0.01287373 62429.3614 516613.492 62429.3614 516613.492
SR -B 2086.39366 983.642957 2.1210884 0.03584286 140.088017 4032.69931 140.088017 4032.69931

RESIDUAL OUTPUT Dependent variable: SOLD PRICE

z Predicted SOLD
Residuals
PRICE 2000000

1 289521.427 -239521.43 1800000

2 289521.427 -239521.43 1600000


1400000
3 542005.297 -192005.3
1200000
4 448746.206 401253.794
1000000
5 541380.313 258619.687
800000
6 488677.185 -438677.19
600000
7 440205.413 59794.5867
400000
8 635612.407 64387.5928
200000
9 528885.083 421114.917
0
10 555553.499 -105553.5 0.00 50.00 100.00 150.00 200.00 250.00
11 554746.318 -354746.32
12 457768.211 -257768.21

Here R2 = 0.034

P-value = 0.036

• R2 value tells how much variation is explained by the model and the correlation
between x and y. The higher the value of R2 (closer to 1) better is the model. P-value
indicates a significant relationship described by the model. It helps in determining if
the independent variable is significant.

Conclusion: Despite the low R , P-value is 0.036 which is less than the significance level

0.05 which indicates that there is some level of (minimal) Statistically significant relation
between Sold price and Batting strike rate.
Question 2: What is the impact of ability to score “SIXERS” on the player’s
price?

Dependent Variable: Base price and

Independent Variable to be Sixers.

Equation used: Y= a+bx


SIXERS BASE PRICE SUMMARY OUTPUT
0 50000 Y= a+bx
0 50000 Regression Statistics Base Price= 173828.162677175+1040*Sixers
5 200000 Multiple R 0.16188909
0 100000 R Square 0.02620808
28 100000 Adjusted R Square
0.01860033
0 50000 Standard Error151666.788
1 100000 Observations 130
1 200000
3 200000 ANOVA
13 200000 df SS MS F Significance F
38 200000 Regression 1 7.9243E+10 7.9243E+10 3.44491862 0.06574617
0 200000 Residual 128 2.9444E+12 2.3003E+10
9 125000 Total 129 3.0236E+12
42 200000
36 100000 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
64 400000 Intercept 173828.163 16590.6798 10.4774587 6.2814E-19 141000.668 206655.657 141000.668 206655.657
24 150000 SIXERS 1040.14733 560.409351 1.8560492 0.06574617 -68.718321 2149.01297 -68.718321 2149.01297
0 100000
23 400000
35 300000 Dependent Variable:
0 150000 RESIDUAL OUTPUT BASE PRICE
0 150000
1600000
3 350000 Observation
Predicted BASE PRICE
Residuals 1400000
2 950000 1 173828.163 -123828.16
1200000
32 220000 2 173828.163 -123828.16
1000000
42 200000 3 179028.899 20971.1007
800000
129 250000 4 173828.163 -73828.163
600000
31 250000 5 202952.288 -102952.29
86 300000 6 173828.163 -123828.16 400000

5 50000 7 174868.31 -74868.31 200000


22 250000 8 174868.31 25131.69 0
0 20 40 60 80 100 120 140
3 200000 9 176948.605 23051.3953
44 225000 10 187350.078 12649.9221

CONCLUSION:

Here R2 = 0.026(Should be close to 1) , P-value = 0.066 (More than 0.05)

• For every increase in number of sixers by the player the base price increases by
1040.147 times.
• Low R2 does not disprove the importance of any significance variables.
• Hence, with R2 lower than 1, the model doesn’t explain much of variation. Also, P-
value / F significant is greater than 0.05, shows that it is statistically not significant
(Very Low R2 and P-value greater than 0.05 is the Worst Possible scenario).

This means that there is very less impact of ability to score sixes on the player’s price.
Question 3: Develop a multiple linear regression model between Sold price
and batting striking rate and Sixers? What do you conclude from this
model?

Dependent Variable: Sold price, Independent Variables: SR-B, Sixers

SIXERS SR -B SOLD PRICE SUMMARY OUTPUT


0 0.00 50000
0 0.00 50000 Regression Statistics
5 121.01 350000 Multiple R 0.4506833 Sold price=395337.53+7758(Sixers)+(-102.52)SR-B
0 76.32 850000 R Square 0.20311544
28 120.71 800000 Adjusted R Square
0.19056608
0 95.45 50000 Standard Error365998.658
1 72.22 500000 Observations 130
1 165.88 700000
3 114.73 950000 ANOVA
13 127.51 450000 df SS MS F Significance F
38 127.12 200000 Regression 2 4.3362E+12 2.1681E+12 16.1853185 5.4778E-07
0 80.64 200000 Residual 127 1.7012E+13 1.3396E+11
9 113.09 400000 Total 129 2.1348E+13
42 128.53 300000
36 122.32 300000 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
64 136.45 1500000 Intercept 395337.539 106613.879 3.70812452 0.00031071 184367.913 606307.165 184367.913 606307.165
24 117.83 250000 SIXERS 7758.80457 1494.3123 5.19222426 8.02E-07 4801.8302 10715.7789 4801.8302 10715.7789
0 33.33 375000 SR -B -102.52359 991.029657 -0.1034516 0.91776776 -2063.5924 1858.54526 -2063.5924 1858.54526
23 116.88 500000
35 119.27 300000
0 80.00 150000
0 133.33 150000 RESIDUAL OUTPUT
3 118.78 350000
2 116.98 1550000 Observation
Predicted SOLD PRICE
Residuals
32 128.90 725000 1 395337.539 -345337.54
42 106.81 400000 2 395337.539 -345337.54

CONCLUSION:

Here R2 = 0.203( Should be closer to 1),P-value/F significant = 0.00

From above, it can be observed that:

• R2 value, co-efficient of determination, is very low while it’s expected to be close to 1


for ideal scenario. It shows 20% variation in Sold Price is due to the variation in the
independent variables.

The linear regression’s F-test value is highly significant; thus, we can assume that the model
explains significant amount of the variance in Sold price.

From the co-efficient table, it can be seen that SIXER has a F significant of 0.00 (less than 0.05
& Highly significant) while SR-B has significance of 0.918 (much higher than 0.05 & No
significance).

The scenario which we have is low R2 and low P value which is not ideal. However, it means
that the model does not explain much variation of data but it is still significant (better than not
having a model at all; not the worst scenario).
QUESTION 4: Cricket in the T20 format is considered a young man’s sport, is
there evidence that the player’s price is influenced by age?

Dependent Variable: Base price

Independent variable: Ag

AGE BASE PRICE SUMMARY OUTPUT


2 50000
2 50000 Regression Statistics
2 200000 Multiple R 0.20927339
1 100000 R Square 0.04379535
2 100000 Adjusted R Square
0.036325
2 50000 Standard Error 150290.95
2 100000 Observations 130
2 200000
2 200000 ANOVA
2 200000 df SS MS F Significance F
2 200000 Regression 1 1.3242E+11 1.3242E+11 5.86255758 0.01686853
3 200000 Residual 128 2.8912E+12 2.2587E+10
1 125000 Total 129 3.0236E+12
2 200000
2 100000 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
2 400000 Intercept 75975.6098 49790.5769 1.52590338 0.12950109 -22543.553 174494.773 -22543.553 174494.773
2 150000 AGE 55563.1277 22947.9092 2.42127189 0.01686853 10156.7686 100969.487 10156.7686 100969.487
2 100000
3 400000
2 300000
2 150000 RESIDUAL OUTPUT
2 150000 BASE PRICE
3 350000 Observation
Predicted BASE PRICE
Residuals 1600000
2 950000 1 187101.865 -137101.87 1400000
2 220000 2 187101.865 -137101.87
1200000
3 200000 3 187101.865 12898.1349
2 250000 4 131538.737 -31538.737 1000000

3 250000 5 187101.865 -87101.865 800000


3 300000 6 187101.865 -137101.87 600000
2 50000 7 187101.865 -87101.865
400000
2 250000 8 187101.865 12898.1349
200000
2 200000 9 187101.865 12898.1349
3 225000 10 187101.865 12898.1349 0
0 0.5 1 1.5 2 2.5 3 3.5
3 100000 11 187101.865 12898.1349
1 50000 12 242664.993 -42664.993

CONCLUSION:

Here R2 = 0.044, P-Value = 0.017, R=0.209

Since P value is much less than 0.05 the Significance of independent variable ‘Age’ on the
dependant variable Price is very high.

R2 shows that the data variance is 4%. R value is so low at 0.209 showing there is no co-
relation between the variables.

Since R2 is low (and not close to 1) and P-value less than 0.05, it means that much of variation
in data cannot be explained, therefore, age cannot be considered act as an affecting factor for
the price of the player.
QUESTION 5: Are players of Indian origin paid more than players of other
countries?

Dependent Variable: Sold price, Independent Variable: Country, Dummy Variable is taken
with India =1 and rest taken 0.

COUNTRY Dummy Variable-Country SOLD PRICE Considering Dummy Country Variable ,India-1 and All other countries-0
SA 0 50000
BAN 0 50000
IND 1 350000 SUMMARY OUTPUT
IND 1 850000
IND 1 800000 Regression Statistics
AUS 0 50000 Multiple R 0.26843436 Y=B1+B2
IND 1 500000 R Square 0.072057
AUS 0 700000 Adjusted R Square
0.06480745
SA 0 950000 Standard Error 393404.49
SA 0 450000 Observations 130
WI 0 200000
WI 0 200000 ANOVA
IND 1 400000 df SS MS F Significance F
SA 0 300000 Regression 1 1.53831E+12 1.53831E+12 9.93950763 0.00201547
IND 1 300000 Residual 128 1.98102E+13 1.54767E+11
IND 1 1500000 Total 129 2.13485E+13
SL 0 250000
IND 1 375000 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
IND 1 500000 Intercept 430974.026 44832.60242 9.612960272 8.4429E-17 342265.062 519682.99 342265.062 519682.99
SA 0 300000 Dummy Variable-Country
221365.597 70214.64277 3.152698468 0.00201547 82433.9298 360297.264 82433.9298 360297.264
WI 0 150000
SL 0 150000
NZ 0 350000
ENG 0 1550000 RESIDUAL OUTPUT
IND 1 725000
IND 1 400000 Observation
Predicted SOLD PRICER esiduals
WI 0 800000 1 430974.026 -380974.026
SA 0 575000 2 430974.026 -380974.026

We know that, the regression equation for predicting an outcome variable Y on the basis of
predictor variable X can be written as:

Y = B1+ B2(X)

In order to understand if Indian players are getting paid more than players from other countries,
let us assume Dummy Variables

India = 1 ,Other Countries = 0

With the help of these variables ,we can arrive at the following model:

From the Coefficients table: B1=430974.026, B2=221366

The Co-efficients can be interpreted as follows:

B1 = Average Selling Price of other country players = 430974.026

B1 + B2 = Average Selling Price of Indian players = 652340.026

B2 = Average Difference in the selling price between Indian and Other Countries Selling Price

= 652340.026 - 430974.026

Hence, B2 = 221366
Hence, we can conclude that on an average, Indian players are paid higher compared to Other
country players during IPL.

The P-value/F significant for the dummy variable of India = 0.002 < 0.05 which is very
significant suggesting that there is a significant statistical evidence of a difference in average
selling price between Indian Players and Other Country Players.

QUESTION 6: Develop the model which can be used by franchises to predict


the Sold Price.

Step 1:Dependent Variable: Sold price

Others: Base price, SR -B, Runs-s, ODI-SR-B, T- Runs, Ave, Sixers, ODI-Runs-s, HS

ODI-RUNS-S ODI-SR-B T-RUNS RUNS-S HS AVE SR -B SIXERS Base Price Sold Price
0 0 0 0 0 0.00 0.00 0 50000 50000
657 71.41 214 0 0 0.00 0.00 0 50000 50000 SUMMARY OUTPUT
1269 80.62 571 167 39 18.56 121.01 5 200000 350000
241 84.56 284 58 11 5.80 76.32 0 100000 850000 Regression Statistics
79 45.93 63 1317 71 32.93 120.71 28 100000 800000 Multiple R 0.71612017
172 72.26 0 63 48 21.00 95.45 0 50000 50000 R Square 0.5128281
120 78.94 51 26 15 4.33 72.22 1 100000 500000 Adjusted R Square
0.4762902
50 92.59 54 21 16 21.00 165.88 1 200000 700000 Standard Error294397.516
609 85.77 83 335 67 30.45 114.73 3 200000 950000 Observations 130
4686 84.76 5515 394 50 28.14 127.51 13 200000 450000
2004 81.39 2200 839 70 27.97 127.12 38 200000 200000 ANOVA
8778 70.74 9918 25 16 8.33 80.64 0 200000 200000 df SS MS F Significance F
38 65.51 5 337 24 13.48 113.09 9 125000 400000 Regression 9 1.0948E+13 1.2165E+12 14.0355138 3.0438E-15
4998 93.19 5457 1302 105 34.26 128.53 42 200000 300000 Residual 120 1.04E+13 8.667E+10
69 56.09 0 1540 95 31.43 122.32 36 100000 300000 Total 129 2.1348E+13
6773 88.19 3509 1782 70 37.13 136.45 64 400000 1500000
6455 86.8 4722 1077 76 28.34 117.83 24 150000 250000 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
18 60 0 6 2 1.00 33.33 0 100000 375000 Intercept 219297.209 100457.576 2.18298328 0.03098381 20398.1999 418196.219 20398.1999 418196.219
10889 71.24 13288 1703 75 27.92 116.88 23 400000 500000 ODI-RUNS-S 47.7630784 18.0069401 2.65248166 0.00907078 12.1105902 83.4155665 12.1105902 83.4155665
2536 84 654 978 74 36.22 119.27 35 300000 300000 ODI-SR-B -659.17764 1159.21324 -0.5686423 0.5706616 -2954.3392 1635.98391 -2954.3392 1635.98391
73 45.62 380 4 3 4.00 80.00 0 150000 150000 T-RUNS -64.320895 19.4967236 -3.2990618 0.0012772 -102.92305 -25.718739 -102.92305 -25.718739
239 60.96 249 4 2 0.00 133.33 0 150000 150000 RUNS-S 389.81159 108.030984 3.60833139 0.00045053 175.917761 603.70542 175.917761 603.70542
8037 71.49 7172 196 45 21.77 118.78 3 350000 350000 HS -3362.2887 1843.02662 -1.82433 0.07059015 -7011.3532 286.775704 -7011.3532 286.775704
3394 88.82 3845 62 24 31.00 116.98 2 950000 1550000 AVE -148.06087 5469.70497 -0.0270693 0.97844946 -10977.696 10681.5743 -10977.696 10681.5743
4819 86.17 3712 2065 93 33.31 128.90 32 220000 725000 SR -B 282.103364 951.77783 0.29639623 0.76743986 -1602.3505 2166.55723 -1602.3505 2166.55723
11363 73.7 7212 1349 91 25.45 106.81 42 200000 400000 SIXERS 815.543884 2478.90349 0.3289938 0.74273385 -4092.5125 5723.60027 -4092.5125 5723.60027
8087 83.95 6373 1804 128 50.11 161.79 129 250000 800000 Base Price 1.47950713 0.20618189 7.17573763 6.4733E-11 1.07128134 1.88773292 1.07128134 1.88773292
8094 83.26 6167 886 69 27.69 109.79 31 250000 575000

P-Value is less than 0.05 for ODI-Runs, T-Runs, Runs-s.

• Base Price, Sr-Bl, T- Wkts, ODI-Sr-Bl, Wkts, ODI-Wkts, Runs-C, Ave-Bl


BASE SOLD
T-WKTS ODI-WKTS ODI-SR-BL RUNS-C WKTS AVE-BL SR-BL PRICE PRICE
0 0 0 307 15 20.47 13.93 50000 50000
18 185 37.6 29 0 0.00 0.00 50000 50000 SUMMARY OUTPUT
58 288 32.9 1059 29 36.52 24.90 200000 350000
31 51 36.8 1125 49 22.96 22.14 100000 850000 Regression Statistics
0 0 0 0 0 0.00 0.00 100000 800000 Multiple R 0.59468623
0 0 0 0 0 0.00 0.00 50000 50000 R Square 0.35365171
27 34 42.5 1342 52 25.81 19.40 100000 500000 Adjusted R Square
0.31091794
50 62 31.3 693 37 18.73 15.57 200000 700000 Standard Error337694.715
17 72 53 610 19 32.11 28.11 200000 950000 Observations 130
1 0 0 0 0 0.00 0.00 200000 450000
86 142 34.1 1338 47 28.47 21.11 200000 200000 ANOVA
9 14 52.8 0 0 0.00 0.00 200000 200000 df SS MS F Significance F
3 32 41 1819 73 126.30 100.20 125000 400000 Regression 8 7.5499E+12 9.4374E+11 8.2756963 6.7229E-09
2 0 0 0 0 0.00 0.00 200000 300000 Residual 121 1.3799E+13 1.1404E+11
0 0 0 66 4 16.50 12.00 100000 300000 Total 129 2.1348E+13
0 1 12 0 0 0.00 0.00 400000 1500000
32 67 58.3 356 5 71.20 53.00 150000 250000 Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
0 5 61.4 926 36 25.72 21.19 100000 375000 Intercept 194720.954 64593.6361 3.01455322 0.00313611 66840.8138 322601.094 66840.8138 322601.094
1 4 46.5 0 0 0.00 0.00 400000 500000 T-WKTS -709.05246 372.022132 -1.9059416 0.05903002 -1445.5684 27.4634752 -1445.5684 27.4634752
11 25 47.6 377 10 37.70 31.80 300000 300000 ODI-WKTS 572.394625 482.789441 1.18559889 0.23810383 -383.41442 1528.20367 -383.41442 1528.20367
157 60 35.6 154 5 30.80 28.00 150000 150000 ODI-SR-BL -986.00082 1239.95067 -0.7951936 0.42805799 -3440.8102 1468.80858 -3440.8102 1468.80858
97 187 34.7 298 17 17.53 13.76 150000 150000 RUNS-C 493.177075 213.203991 2.31317 0.02240173 71.0835402 915.27061 71.0835402 915.27061
0 1 29 0 0 0.00 0.00 350000 350000 WKTS -9528.9536 5269.66306 -1.8082662 0.07304868 -19961.642 903.734329 -19961.642 903.734329
226 169 33.2 105 2 52.50 33.00 950000 1550000 AVE-BL 9259.99221 8775.86792 1.0551654 0.29345199 -8114.1531 26634.1375 -8114.1531 26634.1375
0 0 0 0 0 0.00 0.00 220000 725000 SR-BL -11736.571 12148.5645 -0.9660871 0.33592756 -35787.859 12314.7169 -35787.859 12314.7169
32 100 45.6 363 10 36.30 27.60 200000 400000 BASE PRICE 1.47109812 0.20264544 7.25946827 4.0881E-11 1.06990803 1.87228821 1.06990803 1.87228821
72 156 44.4 606 13 46.62 34.85 250000 800000

Significant Bowler Variables: RUNS-C – Number of Runs Scored By a player (P-


Value:0.02)
Rest of the variables are not significant as the F significance <0.05
SUMMARY OUTPUT
SOLD PRICE =70184.54 + (50.81)(ODI-RUN-S) + (-64.46)(T-RUNS) + (136.67)(RUNS-C) + (262.69)(RUN-S) + (1.36) (BASE PRICE)
Regression Statistics
Multiple R 0.72061096
R Square 0.51928016
Adjusted R Square
0.4998963
Standard Error287686.066
Observations 130

ANOVA
df SS MS F Significance F
Regression 5 1.1086E+13 2.2172E+12 26.7893001 2.7409E-18
Residual 124 1.0263E+13 8.2763E+10
Total 129 2.1348E+13

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 70184.549 50638.8362 1.38598266 0.16823963 -30043.893 170412.991 -30043.893 170412.991
ODI-RUNS-S 50.8104732 17.0578821 2.97870936 0.00348313 17.0481463 84.5728001 17.0481463 84.5728001
T-RUNS -64.466029 17.6481175 -3.6528558 0.00038112 -99.396597 -29.535461 -99.396597 -29.535461
RUNS-C 136.673637 47.3109699 2.88883609 0.00456491 43.0319751 230.315299 43.0319751 230.315299
RUNS-S 262.696543 48.9130106 5.3706885 3.7273E-07 165.883994 359.509093 165.883994 359.509093
Base Price 1.36900935 0.18469117 7.41242433 1.6864E-11 1.00345378 1.73456492 1.00345378 1.73456492

FINAL ANALYSIS

R2=0.519, it means there is 51% variance.


The model is statistically significant as F Significant value is 0.000.

THE MODEL EQUATION:

SOLD PRICE=-64.466(T-RUNS) + 50.81(ODI-RUNS-S) + 262.697 (RUNS- S) + 1.369


(BASEPRICE) + 136.674 (RUNS-C)

You might also like