You are on page 1of 23
Summarizing Data & Probability Stonmarizing Data & Re Hl ‘Surnmnary Statistics © Summarizing data dlistibution. Central Lint Theorem ete, Work effectively with Colleagues (Ni Anvontuction to work effectively, Team > grass ticn graze 1 22 mow 245 mow SQ 3 37 mow— OK 411 mow 5 18 mow 6 8 unmow 7 9 unmow 8 7 unmow 9 9 unmow a) summary(): It gives the summary statistics of date values. > xe-€(i,2,3,4,5,6,7,8,9,10,11 > SuMNaTY GO edian Top 75 6.30 > summary(grass) Rich graze Min, :7.00 mow :5 4st Qu: 9.00 unmow:4 Median :41.00 424 3rd Qus15.00 Max, :17,0€ J > summary(graze) Length Class a character char 7 2 11 to Analytics nalytics - 1) nT 1 probabil) variables, P andor, Bivariate Rarwlor ¥ fy, Lispectel iective Communication skills te ay 7 a Ms of ri, mex," Quartile and 3% Quartle mean/medisn ae oe 12.00 | { 1 | if 1A > summary(grass$graze) mow unmow 5 14 ee b) str It gives the structure of data object fh tems sample data Example2: > str(mtcars) "data. frame': 32 obs. of 11 y, i ariables mpg : num 21 21 22.8 21.4 18 Dae nun 6646868446. TAT 4,3 24.4 22-8 19: le class and OF cas of ofject, No.of observations and each variab! a $ s \ s nun 160 160 108 258 369 © : nun 110 110 93 110 175 Aa : num 3.9 3.9 3.85 3.08 wh Ae 2S i 32 f s num 2162 2.88.2.32 3214 ay es num 16.5 17 18.6 19.4 17. s num 0011010111, h s nun 1110000000, 5 nun 4443333444, $ carb: num 4411214224.) > str(grass) ‘data.frame’: 9 obs. of 2 variables: if 12 151711158979 $ graze: Factor w/ 2 levels "mow","unmow": 1411122 ¢) Tail): It gives the last 6 observations of the given data object. Examples: > tail(iris) > tail(mecars) mpg cyl disp hp drat wt qsec vs am gear carb Porsche 914-2 26.0 4 120.3 91 4.432.140 16.7 0 1 5 a \ Lotus Europa 30.4 4 95.1 113 3.771.513 16.9 1 1 5 2 bs Ford pantera L 15.8 8 351.0 264 4.223.170 14.5 0 1 5 4 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 My 6 Maserati Bora 15.0 8 301.0 335 3.54 3.570 te ol 5) 8 a 4 2 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18. > tail (Hai reyecolor,2) : 7s > tail(state.x77,2) : Population Income TIliteracy Life &o wurder HS Grad Frost Area Wisconsin 4589 4468 0.72 72.48 3.0 54.5 149 54464 wyoming, 376 4566 0.6 70.29 6.9 62.9 173 97203 > tail(grass) rich graze 11 mow 15 mow 8 unmow 9 unmow 7 unmow gS unmow woyvans d) Head(): It displays the tae) 6 observations from dataset Example: ‘ » head(iris) === = S—-O—e— 0 oo a a a °° Summarizing Data & Probability ‘am Sepal.Length Sepal.width p, UNiT 2 1 5.1 3,5 ©“ sLengeh poral width species a 4.9 3.0 1 0.2 setosa 3 47 3.2 ai Deere oe 4 4.6 3B 13 2 eyecare 5 5.0 3.6 ls Htet ee 6 5.4 3.9 la sar sere > head(iris,2 a Ord sete? Sepal.Length Sepal width p ‘ 1 5.1 3.5 Stl Length petal width species 2 alg 3.0 M4 Des erea >head(grass) rich graze 1 12mow 2° 15 mow 3 17mow 4 11mow 5 15 mow 6 8unmow e) Names(): Tt returns the coloum names > names (mtcars) j 5 se (1) "mpg" "cI" "disp" “hp drags myer “gsec "vs" >names(grass) graze rich f) nrow(): It returns the number of observations in the given datacet > dim(mtcars) (] 32 11 > nrow(mtcars) (1) 32 > ncol(mtcars) (111 >nrow(iris) 9 9) fix(iris): To fix the data in the given dataset. > fix(mydFrame1) D Data Editor - File Edt Help =5 Name [Gender [Aos:egace| Passstace 2 [25561 _[Ragay [MALE [75.75 TRUE 2 [15562 [Modi _|MALE__| 90.0 FALSE, 3 [15563 |Mamacha |FEMALE |@9.29 0 [Taoe bi i 4. 15564 Srutl [7s Tieszeses| Rana é 7 8 "gear" "carb" & 1A h) WithQ): To replace $along with attribute names i) Aggregate(): To get the summary statistic of specific co} aggregate( x ~ y, data, mean Here x is numeric and y is factor type >aggregate(rich~graze, grass, mean) te. class attribu! ANN With respect to different levels in the graze rich 1 mow 14.00 2 unmow 8.25 j) Subset (): To subset the data based on condition, subset (data, x>7, select=c(x,y)) xis one of variable in data select: to get the subset in specified order, >subset(grass, rich>7, select=c(graze,rich)) graze rich 1 mow *12 2 mow 15 3 mow 17 4 mow 11 es 5 mow 15 6 unmow 8 7 unmow 9 4 9 unmow 9 Lab activity: BA researcher wants to understand the data collected by him about 3 species of flowers. | He wants the following:- 1. The summary of 150 flower data including Sepai Length, Sepal Width. He also wants the summary of Sepal Length vs petal length. Solu To summarize data in R Studio we use majorly two functions Summary and Aggregate. Using Summary command: > summary_iris<- summary (iris) > summary_iris ith, Petal Length and Petal Sepal.Length Sepal.width —-Petal.Length ~—petal.width species ratio_s epal_pétal Min. 24.300 Min. 22.000 min. 21.000 Min. 20.100 setosa 750 Min. . 21.050 - Tae Qu.:5.100 1st Qu.:2.800 1st QU-#1.600 ist qu.:0.300 versicolor:50 1st qu. 21.230 a nedian :5.800 Median :3.000 median :4.350 Median :1.300 virginica :50 median & 71.411 : Mean :5.843 Mean :3.057 Mean 3.758 Mean :1.199 Mean 12.018 : 3rd qu.:6.400 3rd qu.:3.300 3rd 475-200 3rd qu. :1. 800 arf que 13.176 6. max: GNe7ecOORaxEMUUCMErom °° "Ax. :2,500 Mae 14.833 We get Min, Max, 1* Quartile, 3° Quartile, Median, n Petal Length of each species, Mean 2s an output of summary() command, 2. He wants to understand the meat Solution: For getting detailed output of one or more Using Aggregate () command: functions We NSS 299recate() command. Summarizing Data & Probability > aggr egate(Sepal -Length_, species Sepal, Length a4 setosa UNIT 2 Species, iris,mean) . bs 2 versicolor stare 3 virginica 6.588 3. He wants to segregate the data of f ter than 7. In the above example, we have calculated ree having Sepal esas ‘Similarly we can calcul: other functions als lke frequency, megan Nea" sePal length of af For more details in terms of argument of tion ets. et help. ate command to g We also use subst (function form supe =I we use? Agareg using subset-Q-command: 7 > sepalsub<- subset (iris, > sepalsub late Sepal. Lengths) Sepal.Length Sepal. wide z i al_petal 103 Fe ag eral Lense pecal.width species Lose 106 7.6 310 2-5 virginica 1.151515 108 7.3 2.9 {is virginica 1.158730 110 Zee 3.6 2.5 virginica 1.180328 118 a7. 3.8 312 virginica 1.149254 119 7.7 2.6 3.3 virginica 1.115942 123 ard 2.8 210 virginica 1.149254 126 Zaz 3.2 {18 virginica 1.200000 1300 - 7.2 3.0 1.6 virginica 1.241379 231 * 7.4 2.8 219 virginica 1.213115 132 7.9 3.8 210 virginica 1.234375 136 7.7 3.0 213 virginica 1.262295 He wants to segregate the data of flowers having Sepal length greater than 7 and Sepal width greater than 3 simultaneously. Solution: When we have to use more than 1 condition then we use & as shown below > sepalsub<- subset (iris,Sepal.Length>7 & Sepal.width>3) > sepalsub Sepal.tength sepal.width petal.vength retal.width species ratio_sepal_petal 7.2 110 3.6 6.2 2.5 virginica 1.180328 iis 7 318 6.7 212 virginica 1.149254 126 at 312 6 1.8 virginica 1.200000 332 7.9 3.8 6.4 2.0 virginica 41234375 5. He wants to view 1* 7 rows of data. Soluti For getting only few columns of requirement we use select () command in the argument: > sepalsub<- subset (iris,Sepal.Length>7 & Sepal.Wicth>3,select=c(sepal. Length, sepal.widt po Ge Subsere ue dale > sepalsub i = . Sepal.tength Sepal.width » ie & to Ga- WU Wbovol> SLO ee 7 2: 118 . 7.7 fad" - ita Ader 126 7.2 a Spot 432 7.9 For subsetting data without ant condition just based On FOwS and columns we use square brackets . 1A > frist: tris01:7,] > irisaa : Sepal.Length Sepal.width per, ‘ epal_peta 1 5.1 3.5 "#T-Lengen pera.width species ratto-sePs cf, 957 2 a3) 3.0 a OE a ers 3, 500000 3 a7 3.2 Le 0.2 setosa 3.615385 4 16 31 3 0.2 setosa 3.066667 00 ; 6 FH af Ola setosa ne _ ant ui O13 setosa 3.285714 f ‘ 6. He wants to view 1* 3 rows and 1% Solution: 3 columns of data. & > driside- iris (a:3,1:3) > drisit Sepal.Length sepal.width Petal. 1 5 $25 Peta) Lengty 2 aio 3.0 a 3 a7 32 a % ' . 4 4 5 p ) ) : Summarizing Data & Probability — 2.2 Basics of Probability UNIT 2 . some termfnology Fe We shall introduce some of the basi random experfments (Le., experiments ee Sots of probability theory by definfh 2 21. Terminology SUtcomes are Not predictable). Def. ou The end result of an.experiment, Fj ar 7 the outco would be anyone of the six faces, FL cif the experiment consists of throwing & die H or T; ’ SLERESP FS 66, De aoeeeneeney te an Oe Random oxperirper BS xh, pute If an “experiment is conducted for numbe ~ ov suid I conditions, which has a set identical co (hie re outcomes iS possible outcomes associated with it “oF times, under essentially Sf th several po: “ny one of the 2 if the known as Random Experiment. In Simpie aut! Bot certain and if arene int whose outcomes are not know? 8 \dvance- srminals of a resistor, e1C- Ex: Throwing a fay die, tossing a hon iol lest coin, t the te! a re Measuring the noise voltage a Def. saatle pace Cour tsp Powible Gudemes HOT ye, You of The sample space of a random ex perime the experiment. We denote the sample space by see that were BXCTn a fandom experiment of tossing 2 cae a In the case adie, S= {1,2,3,4,5,6} (HH, HT, TH.TT) Each outcome of the experiment is represented by @ point in S and is called a 53 t. We uses (with or without a subscript), to denote a sample point. An event on the sample space Ts represented by an apPropue collection of sample point(s), a aaa oor whe lee ae arxe_e Def: Equally Likely rene aes Events are Said to be equally likely whe ‘when, oat ee no reascn to oe anyone o} ie "hp tow Ahan a + Def. Exhaustive Events ¢ A) tal- Of Queda ow Pn con 4 atta one of All possible events in any trial are knowit as Exhaustive a Ex: Mteets OuuHs Paes, On ee perpen I — vocar hn 2 Gegdahd Tall a Je outcomes ‘of ff them rather fetenyone of the Lh ie Cxatodti ge In tossing a coin. t austiye mentary Awing 3, bal lout 9 in a DOR, BOR RR Ere 4 Oe a, ee eet ame 6 ane yy wa Cubome Can efULdH be Waals ANE © whw a ost G tosed, Def. Mutually exclusive (disjoint) events } thal Two events A and Bare said to be mutually exclusive i ‘A and B are mutually exclusive, they cannot occur toze! ‘any of others. (Two or moré of the e trial excludes the happening of ‘Head’ ¢ ctlues the ogcurrence of Tail’. same trial.) E ce of the optcome Fans, odd member om ad& , Ex: Sr ape NERA 60 wit eee ical Definition of Probability: 4 en creme ee Gee be n mutually exCtusive an tikely élementary events. Let E be an nite experiment, let there Se of the Bae If m events are favorable (0 E, then the probability of E (Chance of occurrence of E) is defined as Noy ventsFavonrale m_ Ri Co - TotalNo.ofEverts @Gnrnot EC happ: ae f they have no common elements (or outcomes).Hence if ther. ic., if the happening of any one of the events ina vents can’t happen simultaneously in the ToE Note: pee 1 0s—< ri 2. O Probability Probability I Distributions Distributions 7 5 Eee Binomial + Normal j [| Hypergeometric Uniform L Poisson Exponential S: Random Discrete Random Variable Continuous Random Variable 9s eypty —eS oe ey 2.7.Probability Distribution Function It defines probability of outcomes based PDFS. ‘Types of Probability Distribution: * Binomial Distribution * Poisson Distribution Continuous Uniform Distribution Exponential Distribution ‘Normal Distribution Chi-squared Distribution ‘Student t Distribution F Distribution (PDF), ;, there are majorly 5 types OF certain conditions. B2sed on Conditions Co The binomial distribution is a discrete probabity distributon. It describes the outcome of 1 independent trials in an experiment. Each tral is assumed to have only two outcomes, either success or failure. If the probability of a successful trial isp, then the probability of having xsuccessful outcomes in an experiment ‘ of nindependent trials is as follows. F(x)= (re, P= py Where x= 0, 1,2, .-+,0 * Problem - fg Ex: Find the probability of getting 3 doublets when a pair of fir die are thrown for 10 times. Solution n=no. oftrials=10, p=probabilty of success ie., getting a doublet = 6/36=1/6 eprobabilty of falure=1-p=1-(1/6)=5/6 r=no, of successes expected=3 Poe3)= (ne, )P"= PY" = (ey) PD” 2 0c,) py” yr - w(@) (3) : ‘This can be computed in Ras: < Fae an EOD) CCA/E)AGA(S/EDAT). # 056 (20,3) 4 103 [a] 0.1550454 1550454 this binomial distribution can be found USIn9 the formula in R as > dbinom(3,size=10,prob=(1/6)) 7 111 0.150454 1 1 1 1 ! I 1 1 I Ml | Summarizing Data & Probabitity Problem: Solution: >choose(10,0)*C(1/6)40*(5/6)a19 choose (10,1) 30, i V6) a1 # (5/649) 2*(5/6)A8) + « Choad 012) oe 32a Jay a3 G76)47)) From the above problem fing the getting 3 or lesser doublets. Probability of [2] 0.9302722 This can be obtained using cumulative b ‘i >pbinom(3, size= © prob= (1/6), 1oqtal distribution func (2) 0.9302722 Pat) jon as : e ublets spbinom(3, size=10, prob=(1/6) Loner F) wprovavirity of getting 4 or more double =F) #probabi [2] 0.06972784 Note:> 0.9302722+ 0.06972784 = 9 Problem2 ice ques ji stion has five possible Suppose there are twelve multiple choice ques; jish dass quiz. Each quest aye a Bes icy of having four or less correct answers if a e prot answers, and only one of them is correct, Fj student attempts to answer every question at random. ae 7 tion correctly by Since only one out of five possible answers robability of answering a ques! random is 1/5=0.2. We can find the poss Fig exactly 4 correct answers by random attempts as follows. dbinom(x,size, prob) X: No. of successful outcomes (favourable) size: n no.of independent trials prob: probability of successful trial p > dbinom(4, size=12, prob=0.2) [1] 0.1329 To find the probability of having four or less correct answers by ‘random attempts, we apply the function dbinom with x = 0,...,4. > dbinom (0, size=12, prob=0,2) + 2, prob=0.2) + dbinom(2, size=12, prob=0,2) + dbinom(3, size=12, prob=0.2) + 3 dbinom(4, size=12, prob=0.2) [1] 0.92744 Alternatively, we can use the cumulative probability function for binomial distribution Pbinom. > pbinom (4, size=12, prob=0.2) [1] 0.92744 ' Answer: § The probability of four or less questions answered correctly by ‘endom in a twelve question multiple choice ee quiz is 92.7%. ‘ facgen 12inom gives the density, pbinom gives the cumulative distri, ic Hon of turction and rbinon generates random deve. ze anos st ‘toger nan isvekapag” 70> Me Hani _— J g 15 tae Ee rere wwe pour Pru te dsiaem () Abbe, 5, pd Pbmomcy Chron np) (ed | value Um / ° Ss vhinsmi) — Abinowm — pd J Shérom() es LB a Vee ae Arinome) finctiou can ce ed 16 tend UE fratenbelite Gr 4: Vedas elem (hi) atothom (K 50/P) diem (3, Az 213) prob ='/) Probability + dbtrom (x =€(0:10) ,'data-faov Cv, prota) Pot (0110, prvbobillia ,type = 1") ' binamey Function ; 5 tad Cumuatire ens “yp ag | P(x Fans” pixcels) inom (2, 7 P) mor (77) 7 va . hb m wandlem ema Noone Aecha bed’ | od oN Luk aa a Paha ee pop NT le t dco, Pha et ee “’ om (m & » Sd) Laucl~ val 0! dno.) 3 heme ef p Py pwn) > pow - 8; ea Tandon f C ra Fae poisson distribution Is the probabil If Ais the mean o* f@= Problem If there are twelve cars crossing a bridge per minute on average, find the probability of hi jecurrence per interval, then uNIT ? ty distribution the probabil Aes a x! more cars crossing the bridge in a particular minute, Solution The probability of having sixteen or lesscars crossing the bridge I 9 particular minu! function ppois. > ppois(16, lambda=12) # lower tail” [4] 0.89871 Hence the probability of having seventeen or more cars crossin: the probability density function. > ppois(16, 1] 0.10129 similarly we can find the following: > rpois(10, lambda=12) [1], 17 10 822 5101212 712 > dpois(16, Tambda=12) [1] 0.05429334 Answer If there are twelve cars crossing a bridge per minute on average, cars crossing the bridge in a particular minute is 10.1%. Normal Distribution The normal distribution is defined population mean and 0” is the variance. f= — oN2n by the following ee ee 120” If a random variable X follows the normal distribution, then In particular, the normal distribution with 1 = and o= 1is called the standard normal distribution, 2! as M0,1). It can be graphed as follows. Figure 1 shows the normal distribution shape of a normal curve is highly standard deviation. Importance of Normal Distribution: nd is denoted of sample data. The dependent on the of indepen lity of having x 0c jambda=12, lower=FALSE) # upper tail the probability of having seventeen or more 04 03 rorm(x) 02 ot 0.0 dent where x = 0, 1, 2 probability density function, Figure 1: Normal Distribution .5 in an interval. < ent occurrence! ven interval currences within 2 9! aving seventeen oF te is given by the .g the bridge in a minute is in the upper tail of where pis the ~ ® » Summarizing Data & Probability Normal distribution is @ continuous « Data are often assumed to be norma ution that : Normal distributions can estimate cies Probabil a continuous lties over ae spell-shaped” interval of data values. Properties: The normal distribution f(%), with any mean, «It is symmetric around the point x of the distribution. + It ig unimodal: its first derivative is + Its density has two inflection poin one standard deviation away from has the following properties: Hand any positive deviation pirates = is, which is at the same time @ at x =H. Positive for x < p, negative for x > By and Zero oe ee tS (Where the second derivative of is zero and < the mean as x = p - sand x =H +0 * Its density is log-concave. 4 «Its density is infinitely differentiable, j 2 : 7 table, indeed super smooth of order 2+ Its second derivative f”(X) Is equal to its derivative with respect to its variance 02. ¥ The Normal ' Distribution . ' é ' ‘ ‘ e Probability of Cases pues H - in pottions of the curve oss | ‘ r Standard Deviat a salpthysactats / 1A | m(84, mean=72, $d=15.2, loy > pnorm(! Wer. tai FALSE 1 [1] 0.21492 q Answer ) The percentage of students scoring 84 o, ce exam is 21.5%. ll sa More in the college ent’™ Usage dnorm(x, mean = 0, sd = 1, log = FALSEy pnorm(q, mean = 0, sd = 1, lower.tail gnorm(p, mean = 0, sd = 1, lower.t rnorm(n, mean = 0, sd = 1) ‘Arguments | — x, @wector of quantiles. MI _ P vector of probabilities, ' _~ Naumber of observations. If length(ny 1 the length is taken to be the number required. — Mean vector of means. 7 — Sd vector of standard deviations, RUE, | E) log.p = FALS! ~ TRUE, log.p = FALSE) given as log(P)- P[X = x] otherwise, P[X > x]. ies pare if TRUE (default), probabi 0, sd = 1) as default jes are — rmorm(n, mean The central limit theorem and the law of large“numbers are the two fividamental theorems of probability. I Roughly, the central limit theorem states that the distribution of the sum (or average) of a large number of independent, identically ‘distributed variables will be approximately normal, regardless of the underlying || distribution. The importance of the central limit theorem is hard to overstate; indeed it is the reason that many ¢ statistical procedures work. {> The CLT says that if you take many repeated samples from a population, and calculate the averages or sum of each ] ‘one, the collection of those averages will be normaliy distributed... and it doesn’t matter what the shape of the source distribution is! Lab Activity: To generates 20 numbers with a mean of 5 and a standard deviation of 1: > rnorm(20, mean = 5, sd = 1) [1] 5.610090 5.042731 5.120978 4.582450 5.015839 3.577376 5.159308 6.496983 [9] 3.071729 6.187525 5.027074 3.517274 4.393562 3.866088 4.533490 6.021554 [17] 5.359491 5.265780 3.817124 5.855315 > pnorm(5, mean = 5, sd = 1) (10.5 > dnorm(c(4,5,6), mean, = 5, sd = 1) [1] 0.2419707 0.3989423 0.2419707 Lab Activity 3: Probability Theories: A 1. If you throw a dice 20 times then what is the probability that you get following results: a. 3 sixes Solution: i c ce arizing Data & Probabili = ty > UNIT 2 > dbinom(x=3,20 Mit [1] 0.2378866 ae | > dbinom(x=6,20,probaqjp— SS || [1] 0.06470515 Supeuateyee *Proba1/ey Ee < ' Sa > pbinom(q=3, 20, prob=- iy (1] 0.5665456 1/6, lower. vai1=1) > : = eS —— i R pa ES Bets eats whether Sepal Length is normally distributed or not. * To find if the Sepal Length is normally distributed or not we use 2 commands- qqnorm() &qqline(). aqnormCiris$sepal. Length) gqlineCiris$sepal. Length, col=*red") vvVvV iff Tre aanorm() shows the actual distribution of data while galine() shows the line on which data would lie if the data is normally distributed. The deviation of plot from line shows that data is not normally distributed. Figure3: Normal distribution of Sepal Length Normal @-Q Plot =, 5 4’ & Bo 5 3. Prove that population mean of Sepal 3 © { lenath is different from mean of ist 2 10 data significantt é ica > T-Test of sample subset of Iris data set: & \ 7 , > mean(iris$Sepal.Length) Il [1] 5.843333 ; a > iris.sub<- iris[1:10,1:1] es i > t.test(iris. sub,alternative="less’ ,mi=5. 843) ) \ One Sample t-test data: iris.sub t = -10.669, df = 9, p-value = 1.041e-06 alternative hypothesis: true mean is Tess than 5.843 95 percent confidence interval: Inf 5.028894 4 ( sample estimates: \ mean of x 4.86 value is much less than 0.05. So we reject the null hypothesis and we accept the alternate hypothesis Here p- 4 Which says that mean of sample is less than the population mean, Us

You might also like