Statistics With R Programming - Unit - 5

Normal Distribution :
In probability theory, the normal (or Gaussian) distribution is a very common continuous
probability distribution. Normal distributions are important in statistics and are often used
in the natural and social sciences to represent real-valued random variables whose
distributions are not known.
The normal distribution is sometimes informally called the bell curve.
rnorm()
To draw random numbers from the normal distribution use the rnorm function, which,
optionally, allows the specification of the mean and standard deviation.
> rnorm(n = 10)
>rnorm(n = 10, mean = 100, sd = 20)

> y <- rnorm(50)
> png(file = "rnorm.png")
> hist(y, main = "Normal DIstribution")
> dev.off()
dnorm()
Density Function (or) Probability density function (PDF):
In probability theory, a probability density function (PDF), or density of a continuous

random variable, is a function that describes the relative likelihood for this random variable
to take on a given value.
Probability distribution function It is important to say that probability distribution function
is a probability (I.e., its value is a number between 0 and one), and it is defined for both
discrete and continuous random variables.
>x <- rnorm(30000)

> y<- dnorm(randNorm)
>plot(x,y,xlab = "random Variablers",ylab="densities")
Probability density function contains information about probability but it is not a probabilit
y since can have any value positive, even larger than one. It is defined only for continuous r
andom variables.
pnorm():
Similarly, pnorm calculates the distribution of the normal distribution; that is, the
cumulative probability that a given number, or smaller number, occurs. This is defined as
It is also refered as Cumulative Density Function (CDF)

pnorm() function returns the probability of a random number less than to that number
> pnorm(0)
> pnorm(1)
> pnorm(0,mean=2)
> pnorm(0,mean=2,sd=3)
> x <- seq(-20,20,by=.1)
> y <- pnorm(x)
> plot(x,y)
> y <- pnorm(x,mean=3,sd=4)
> plot(x,y)
If you wish to find the probability that a number is larger than the given number you can
use the lower.tail option:
> pnorm(0,lower.tail=FALSE)
> pnorm(1,lower.tail=FALSE)
> pnorm(0,mean=2,lower.tail=FALSE)
Examples:
Question:
Suppose widgit weights produced at Acme Widgit Works have weights that are normally
distributed with mean 17.46 grams and variance 375.67 grams. What is the probability that
a randomly chosen widgit weighs more then 19 grams?
Answer: What is P(X > 19) when X has the N(17.46, 375.67) distribution?
R wants the s. d. as the parameter, not the variance. We'll need to take a square root!
>pnorm(19, mean=17.46, sd=sqrt(375.67))
[1] 0.5316644
qnorm()
This function takes the probability value and gives a number whose cumulative value
matches the probability value
qnorm() which is the inverse of pnorm(). The idea behind qnorm is that you give it a
probability, and it returns the number whose cumulative distribution matches the
probability. For example, if you have a normally distributed random variable with mean
zero and standard deviation one, then if you give the function a probability it returns the
associated Z-score:
> qnorm(0.5)
> qnorm(0.5,mean=1)
> qnorm(0.5,mean=1,sd=2)
> qnorm(0.333)
> qnorm(0.333,sd=3)
> x <- seq(0,1,by=.01)
> y <- qnorm(x)
> plot(x,y)
> y <- qnorm(x,mean=3,sd=2)
> plot(x,y)
> y <- qnorm(x,mean=3,sd=0.1)
> plot(x,y)
Example:
Question: Suppose IQ scores are normally distributed with mean 100 and standard
deviation 15. What is the 95th percentile of the distribution of IQ scores?
Answer: What is F-1(0.95) when X has the N(100, 152) distribution?
>qnorm(0.95, mean=100, sd=15)
[1] 124.6728
Finally R has four in built functions to generate normal distribution.
dnorm(x, mean, sd) – density function – takes a number returns the height
pnorm(x, mean, sd) – distribution function – takes a number gives the probability
qnorm(p, mean, sd) - qunatile function – takes a probability and returns a number
rnorm(n, mean, sd) – random number generation function – takes a number and generate
the random numbers
BINOMIAL DISTRIBUTION:
The binomial distribution is a discrete probability distribution.
A binomial experiment is one that possesses the following properties:
1. The experiment consists of n repeated trials;

2. Each trial results in an outcome that may be classified as a success or a failure (hence the
name, binomial)
3. The probability of a success, denoted by p, remains constant from trial to trial and
repeated trials are independent.
The number of successes X in n trials of a binomial experiment is called a binomial random

variable.
The probability distribution of the random variable X is called a binomial distribution, and is
given by the formula:
where
and n is the number of trials and p is the probability of success of a trial. The mean is np and the
variance is np(1 – p). When n = 1 this reduces to the Bernoulli distribution.
The binomial distribution model deals with finding the probability of success of an event which
has only two possible outcomes in a series of experiments.
In probability theory and statistics, the binomial distribution with parameters n and p is
the discrete probability distribution of the number of successes in a sequence
of n independent experiments, each asking a yes–no question, and each with its own boolean-
valued outcome: a random variable containing single bit of
information: success/yes/true/one (with probability p) or
failure/no/false/zero(with probability q = 1 − p).
R has four in-built functions to generate binomial distribution. They are described below
dbinom(x, size, prob)
pbinom(x, size, prob)
qbinom(p, size, prob)

rbinom(n, size, prob)
• x is a vector of numbers.
• p is a vector of probabilities.
• n is number of observations.
• size is the number of trials.
• prob is the probability of success of each trial.
rbinom()
Generating random numbers from the binomial distribution is not simply generating random
numbers but rather generating the number of successes of independent trials.
Random numbers with n=1(only one trial), size=10 (trial size of 10) and prob=0.4 (probability of
success is 0.4) is
> rbinom(n=1,size=10,prob=0.4)
> rbinom(n=8,size=150,prob=.4)
> rbinom(n=10,size=150,prob=.4)
> rbinom(8,150,.4)
Setting size to 1 turns the numbers into a Bernoulli random variable, which can take on only the
value 1 (success) or 0 (failure).
> rbinom(n = 1, size = 1, prob = 0.4)

> rbinom(n = 5, size = 1, prob = 0.4)
>rbinorm(n = 10, size = 1, prob = 0.4)
To visualize the binomial distribution we randomly generate 10,000 experiments, each with 10
trials and 0.3 probability of success
dbinom()
the density function gives the probability of an exact value in the distruibution
probability density function for binomial distribution is given as follows
binomial distribution probability mass function is
where
> dbinom(x = 3, size = 10, prob = 0.3)

> dbinom(x = 1:10, size = 10, prob = 0.3)
> x<-1:50
> y<-dbinom(x,50,0.5)
> plot(x,y,main="dbinom()",xlab="randiom values",ylab="probability of an exact value")
Question: Suppose widgits produced at Acme Widgit Works have probability 0.005 of being
defective. Suppose widgits are shipped in cartons containing 25 widgits. What is the probability
that a randomly chosen carton contains exactly one defective widgit?
Answer: What is P(X = 1) when n=25, prob=0.005?
>dbinom(1,25,.005)
[1] 0.1108317
pbinom()
the cumulative distribution function of a binomial distribution is given as
The cumulative distribution function gives the cumulative probability of a given value
> pbinom(q = 3, size = 10, prob = 0.3)
> pbinom(q = 1:10, size = 10, prob = 0.3)
>x<-1:50
> y<-pbinom(x,10,.3)
> plot(x,y,main="pbinom()",xlab="random values",ylab="cumulative probability CDF")
Question: Suppose widgits produced at Acme Widgit Works have probability 0.005 of being
defective. Suppose widgits are shipped in cartons containing 25 widgits. What is the probability
that a randomly chosen carton contains no more than one defective widgit?
Answer: What is P(X <= 1) when n=25, prob=0.005?
>pbinom(1,25,.005)
[1] 0.9930519
qbinom()
This function takes the probability value and gives a number whose cumulative value
matches the probability value.
Given a certain probability, qbinom returns the quantile, which for this distribution is the number
of successes
> qbinom(p = 0.3, size = 10, prob = 0.3)

> qbinom(p = c(0.3, 0.35, 0.4, 0.5, 0.6), size = 10, prob = 0.3)
> x<-seq(0,1,.01)
> y<-qbinom(x,50,0.3)
> plot(x,y,main="qbinom()",xlab="probabilities",ylab="random values")
Question: What are the 10th, 20th quantiles of the Bin(10, 1/3) distribution?
Answer:
> qbinom(0.1,10, 1/3)
[1] 1
> qbinom(0.2,10, 1/3)
[1] 2
POISSON DISTRIBUTION:
In probability theory and statistics, the Poisson distribution is a discrete probability distribution
that expresses the probability of a given number of events occurring in a fixed interval of time
and/or space
The Poisson distribution is popular for modelling the number of times an event occurs in an
interval of time or space.
A discrete random variable X is said to have a Poisson distribution with parameter λ > 0, if,
for k = 0, 1, 2, ..., the probability mass function of X is given by
where
• e is Euler's number (e = 2.71828...)
r supports four functions to work with poisson distribution
rpois(n,lambda) – random numbers generation from poisson distribution
dpois(x,lambda) – probability mass or density function frtom passion distribution
ppois(x,lambda) – cumulative probability function from poisson distribution
qpois(p,lambda) – quantile function from poisson distribution
where
• x is a vector of numbers.
• p is a vector of probabilities.
• n is number of observations.
• lambda probability of a given number of events occurring.
rpois()
generates random numbers from the poisson distribution
> rpois(1,lambda=1)
> rpois(10,lambda=1)
dpois()
the poisson distribution probability mass function or probability density function is given by the f
ollowing equation
> dpois(1,lambda=1)
> dpois(10,lambda=1)
> dpois(1:10,lambda=1)
> dpois(c(1,5,9),lambda=1)
Question: If there are twelve cars crossing a bridge per minute on average, find the probability of
that exactly sixteen cars crossing the bridge in a particular minute.
Answer: The probability of having sixteen cars crossing the bridge in a particular minute is given
by the function dpois.
> dpois(16,lambda = 12)
[1] 0.05429334
ppois()
the cumulative density function of a poisson distribution is given as
> ppois(1,lambda=1)
> ppois(10,lambda=1)
> ppois(1:10,lambda=1)
> ppois(c(1,5,9),lambda=1)
Question: If there are twelve cars crossing a bridge per minute on average, find the probability of
having seventeen or more cars crossing the bridge in a particular minute.
Answer: The probability of having sixteen or less cars crossing the bridge in a particular minute is
given by the function ppois.
> ppois(16,lambda = 12)
[1] 0.898709
qpois()
this takes a probability as an input and gives a number
> qpois(1,lambda=1)
> qpois(0.5,lambda=1)
> qpois(seq(0,1,.1),lambda=1)
> qpois(c(0.1,0.5,0.9),lambda=1)
OTHER DISTRIBUTIONS:
R supports many distributions, some of which are very common, while others are quite obscure.
They are
the mathematical formulas for the distributions supported by R are
BASIC STATISTICS:
The sample mean is the average and is computed as the sum of all the observed
outcomes from the sample divided by the total number of events. We use x as the symbol
for the sample mean. In math terms,
where n is the sample size and the x correspond to the observed valued.
The mode of a set of data is the number with the highest frequency.
The median is the middle score. If we have an even number of events we take the average of
the two middles.
The variance and the closely-related standard deviation are measures of how spread out a
distribution is. In other words, they are measures of variability.
We define the variance to be
and the standard deviation to be
Variance and Standard Deviation: Step by Step

1. Calculate the mean, x.
2. Write a table that subtracts the mean from each observed value.
3. Square each of the differences.
4. Add this column.
5. Divide by n -1 where n is the number of items in the sample This is the variance.
6. To get the standard deviation we take the square root of the variance.
Example:
x x - 49.2 (x - 49.2 )2
44 -5.2 27.04
50 0.8 0.64
38 11.2 125.44
96 46.8 2190.24
42 -7.2 51.84
47 -2.2 4.84
40 -9.2 84.64
39 -10.2 104.04
46 -3.2 10.24
50 0.8 0.64
492 2600.4
MATHEMATICAL FUNCTIONS R FUNCTIONS

mean mean()
median median()
variance var()
standard deviation sd()
minimum min()
maximum max()
> x<-1:5
> mean(x)
> median(x)
> sd(x)
> var(x)
> min(x)
> max(x)
> summary(x)
Correlation
Correlation is used to test relationships between quantitative variables or categorical
variables. In other words, it’s a measure of how things are related. The study of how variables
are correlated is called correlation analysis.
Some examples of data that have a high correlation:
• Your caloric intake and your weight.
• Your eye color and your relatives’ eye colors.
• The amount of time your study and your GPA.
Some examples of data that have a low correlation (or none at all):
• The cost of a car wash and how long it takes to buy a soda inside the station.
Correlations are useful because if you can find out what relationship variables have, you can
make predictions about future behavior. Knowing what the future holds is very important
A correlation coefficient is a way to put a value to the relationship. Correlation coefficients
have a value of between -1 and 1. A “0” means there is no relationship between the
variables at all, while -1 or 1 means that there is a perfect negative or positive
correlation (negative or positive correlation here refers to the type of graph the
relationship will produce).
Graphs showing a correlation of -1, 0 and +1
The correlation is formulated as
where and are the means of x and y, and sx and sy are the standard deviations of x and y.
It can range between -1 and 1, with higher positive numbers meaning a closer relationship
between the two variables, lower negative numbers meaning an inverse relationship and
numbers near zero meaning no relationship
In R we have a predefined function called cor()

> cor(df$x,df$y)
To compare multiple variables at once, use cor on a matrix
>cor(df[,2])
To visualize the corelations between variuables use pairs() instead of plot() function
>pairs(df)
Ex: consider the example

> tips<-read.csv("tips.csv")
> tips
total_bill.. tip Gender smoker day time size
1 16.99 1.01 Female No Sun Dinner 2
2 10.34 1.66 Male No Sun Dinner 3
5 24.59 3.61 Female No Sun Dinner 4
> pairs(tips,main="corealtion of tips data")
Covariance:
Covariance is a measure of the degree to which returns on two risky assets move in tandem. A
positive covariance means that asset returns move together, while a negative covariance
means returns move inversely.
Similar to correlation is covariance, which is like a variance between variables
In R its supports a predefined function cov()
> cov(df$x, df$y)
Working with missing values:

> x<-c(1,2,3,NA,4,5)
>x
[1] 1 2 3 NA 4 5
> mean(x)
[1] NA
> median(x)
[1] NA
> var(NA)
[1] NA
> sd(NA)
[1] NA
> min(x)
[1] NA
> max(x)
[1] NA
> mean(x,na.rm = TRUE)
[1] 3
> median(x,na.rm = TRUE)
[1] 3
> var(x,na.rm=TRUE)
[1] 2.5
> sd(x,na.rm = TRUE)
[1] 1.581139
> min(x,na.rm = TRUE)
[1] 1
> max(x,na.rm = TRUE)
[1] 5
T Test (Student’s T-Test):
The t test (also called Student’s T Test) compares two averages (means) and tells
you if they are different from each other. The t test also tells you how significant the
differences are; In other words it lets you know if those differences could have happened by
chance.
A very simple example: Let’s say you have a cold and you try a naturopathic remedy. Your
cold lasts a couple of days. The next time you have a cold, you buy an over-the-counter
pharmaceutical and the cold lasts a week. You survey your friends and they all tell you that
their colds were of a shorter duration (an average of 3 days) when they took the
homeopathic remedy. What you really want to know is, are these results repeatable? A t test
can tell you by comparing the means of the two groups and letting you know the probability
of those results happening by chance.
Another example: Student’s T-tests can be used in real life to compare means. For example,
a drug company may want to test a new cancer drug to find out if it improves life expectancy.
In an experiment, there’s always a control group (a group who are given a placebo, or “sugar
pill”). The control group may show an average life expectancy of +5 years, while the group
taking the new drug might have a life expectancy of +6 years. It would seem that the drug
might work. But it could be due to a fluke. To test this, researchers would use a Student’s t-
test to find out if the results are repeatable for an entire population.
The T Score:
The t score is a ratio between the difference between two groups and the difference
within the groups. The larger the t score, the more difference there is between groups.
The smaller the t score, the more similarity there is between groups. When you run a t
test, the bigger the t-value, the more likely it is that the results are repeatable.
• A large t-score tells you that the groups are different.

• A small t-score tells you that the groups are similar.
T-Values and P-values
Every t-value has a p-value to go with it. A p-value is the probability that the results from
your sample data occurred by chance. P-values are from 0% to 100%. They are usually
written as a decimal. For example, a p value of 5% is 0.05. Low p-values are good; They
indicate your data did not occur by chance. For example, a p-value of .01 means there is only
a 1% probability that the results from an experiment happened by chance. In most cases, a
p-value of 0.05 (5%) is accepted to mean the data is valid.
Calculating the Statistic / Test Types
There are three main types of t-test:

• An Independent Samples t-test compares the means for two groups.
• A Paired sample t-test compares means from the same group at different times (say, one
year apart).
• A One sample t-test tests the mean of a single group against a known mean.
One and Two-sample t-tests
The R function t.test() can be used to perform both one and two sample t-tests on
vectors of data.
The function contains a variety of options and can be called as follows:
> t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired =

FALSE, var.equal = FALSE, conf.level = 0.95)
Here x is a numeric vector of data values and y is an optional numeric vector of data
values. If y is excluded, the function performs a one-sample t-test on the data contained
in x, if it is included it performs a two-sample t-tests using both x and y.
The option mu provides a number indicating the true value of the mean (or difference in
means if you are performing a two sample test) under the null hypothesis. The option
alternative is a character string specifying the alternative hypothesis, and must be one
of the following: "two.sided" (which is the default), "greater" or "less" depending on
whether the alternative hypothesis is that the mean is different than, greater than or less
than mu, respectively. For example the following call:
> t.test(x, alternative = "less", mu = 10)
performs a one sample t-test on the data contained in x where the null hypothesis is that
=10 and the alternative is that <10.
The option paired indicates whether or not you want a paired t-test (TRUE = yes and
FALSE = no). If you leave this option out it defaults to FALSE.
The option var.equal is a logical variable indicating whether or not to assume the two
variances as being equal when performing a two-sample t-test. If TRUE then the pooled
variance is used to estimate the variance otherwise the Welch (or Satterthwaite)
approximation to the degrees of freedom is used. If you leave this option out it defaults
to FALSE.
Finally, the option conf.level determines the confidence level of the reported confidence
interval for in the one-sample case and 1- 2 in the two-sample case.
A. One-sample t-tests
Ex. An outbreak of Salmonella-related illness was attributed to ice cream produced at a

certain factory. Scientists measured the level of Salmonella in 9 randomly sampled
batches of ice cream. The levels (in MPN/g) were:
0.593 0.142 0.329 0.691 0.231 0.793 0.519 0.392 0.418

Is there evidence that the mean level of Salmonella in the ice cream is greater than 0.3
MPN/g?
Let be the mean level of Salmonella in all batches of ice cream. Here the hypothesis
of interest can be expressed as:
H0: = 0.3
Ha: > 0.3
Hence, we will need to include the options alternative="greater", mu=0.3. Below is the
relevant R-code:
> x = c(0.593, 0.142, 0.329, 0.691, 0.231, 0.793, 0.519, 0.392, 0.418)
> t.test(x, alternative="greater", mu=0.3)
One Sample t-test
data: x
t = 2.2051, df = 8, p-value = 0.02927
alternative hypothesis: true mean is greater than 0.3
From the output we see that the p-value = 0.029. Hence, there is moderately strong
evidence that the mean Salmonella level in the ice cream is above 0.3 MPN/g.
B. Two-sample t-tests
Ex. 6 subjects were given a drug (treatment group) and an additional 6 subjects a
placebo (control group). Their reaction time to a stimulus was measured (in ms). We
want to perform a two-sample t-test for comparing the means of the treatment and
control groups.
Let 1 be the mean of the population taking medicine and 2 the mean of the untreated
population. Here the hypothesis of interest can be expressed as:
H0: 1- 2=0
Ha: 1- 2<0
Here we will need to include the data for the treatment group in x and the data for the
control group in y. We will also need to include the options alternative="less", mu=0.
Finally, we need to decide whether or not the standard deviations are the same in both
groups.
Below is the relevant R-code when assuming equal standard deviation:
> Control = c(91, 87, 99, 77, 88, 91)

> Treat = c(101, 110, 103, 93, 99, 104)
> t.test(Control,Treat,alternative="less", var.equal=TRUE)
Two Sample t-test
data: Control and Treat

t = -3.4456, df = 10, p-value = 0.003136
alternative hypothesis: true difference in means is less than 0
Below is the relevant R-code when not assuming equal standard deviation:
> t.test(Control,Treat,alternative="less")
Welch Two Sample t-test
data: Control and Treat

t = -3.4456, df = 9.48, p-value = 0.003391
alternative hypothesis: true difference in means is less than 0
Here the pooled t-test and the Welsh t-test give roughly the same results (p-value =
0.00313 and 0.00339, respectively).
C. Paired t-tests
There are many experimental settings where each subject in the study is in both the
treatment and control group. For example, in a matched pairs design, subjects are
matched in pairs and different treatments are given to each subject in the pair. The
outcomes are thereafter compared pair-wise. Alternatively, one can measure each
subject twice, before and after a treatment. In either of these situations we can’t use
two-sample t-tests since the independence assumption is not valid. Instead we need to
use a paired t-test. This can be done using the option paired =TRUE.
Ex. A study was performed to test whether cars get better mileage on premium gas than
on regular gas. Each of 10 cars was first filled with either regular or premium gas,
decided by a coin toss, and the mileage for that tank was recorded. The mileage was
recorded again for the same cars using the other kind of gasoline. We use a paired t-
test to determine whether cars get significantly better mileage with premium gas.
Below is the relevant R-code:
> reg = c(16, 20, 21, 22, 23, 22, 27, 25, 27, 28)
> prem = c(19, 22, 24, 24, 25, 25, 26, 26, 28, 32)
> t.test(prem,reg,alternative="greater", paired=TRUE)
Paired t-test
data: prem and reg

t = 4.4721, df = 9, p-value = 0.000775
alternative hypothesis: true difference in means is greater than 0
The results show that the t-statistic is equal to 4.47 and the p-value is 0.00075. Since
the p-value is very low, we reject the null hypothesis. There is strong evidence of a
mean increase in gas mileage between regular and premium gasoline.
ANOVA
A t test is used to compare two groups and if we want to compare two or more groups then
use ANOVA (ANalysis Of VAriance)
ANOVA is used for comparing multiple groups.
We are often interested in determining whether the means from more than two populations
or groups are equal or not. To test whether the difference in means is statistically significant
we can perform analysis of variance (ANOVA) using the R function aov(). If the ANOVA F-test
shows there is a significant difference in means between the groups we may want to perform
multiple comparisons between all pair-wise means to determine how they differ.
Ex. A drug company tested three formulations of a pain relief medicine for migraine
headache sufferers. For the experiment 27 volunteers were selected and 9 were randomly
assigned to one of three drug formulations. The subjects were instructed to take the drug
during their next migraine headache episode and to report their pain on a scale of 1 to 10
(10 being most pain).
Drug A 4 5 4 3 2 4 3 4 4
Drug B 6 8 4 5 4 6 5 8 6
Drug C 6 7 6 6 7 5 6 5 5
Ho: Equal means for all three drug groups
Ha: Means of all three groups are not equal
To make side-by-side boxplots of the variable pain grouped by the variable drug we must
first read in the data into the appropriate format.
> pain = c(4, 5, 4, 3, 2, 4, 3, 4, 4, 6, 8, 4, 5, 4, 6, 5, 8, 6, 6, 7, 6, 6, 7, 5, 6, 5, 5)
> drug = c(rep("A",9), rep("B",9), rep("C",9))
> migraine = data.frame(pain,drug)
Note the command rep("A",9) constructs a list of nine A‟s in a row. The variable drug is
therefore a list of length 27 consisting of nine A‟s followed by nine B‟s followed by nine C‟s.
If we print the data frame migraine we can see the format the data should be on in order to
make side-by-side boxplots and perform ANOVA
> migraine
pain drug
14A
25A
34A
43A
52A
64A
…
25 6 C
26 5 C
27 5 C
We can now make the boxplots by typing:
> plot(pain ~ drug, data=migraine)
The output of this program is shown below:
From the boxplots it appears that the mean pain for drug A is lower than that for drugs B and
C.
R supports a function aov() which is used to fit ANOVA models.
The general form is
aov(response ~ factor, data=data_name)

where response represents the response variable and factor the variable that separates the
data into groups. Both variables should be contained in the data frame called data_name.
Once the ANOVA model is fit, one can look at the results using the summary() function. This
produces the standard ANOVA table.
> aov1<- aov(pain ~ drug, data=migraine)

> summary(aov1)
Df Sum Sq Mean Sq F value Pr(>F)
drug 2 28.22 14.111 11.91 0.000256 ***
Residuals 24 28.44 1.185
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Studying the output of the ANOVA table above we see that the F-statistic is 11.91 with a p-
value equal to 0.0003. We clearly reject the null hypothesis of equal means for all three drug
groups.

Statistics With R Programming - Unit - 5

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics With R Programming - Unit - 5

Uploaded by

Copyright:

Available Formats

Normal Distribution :

The normal distribution is sometimes informally called the bell curve.

> rnorm(n = 10)

>rnorm(n = 10, mean = 100, sd = 20)

Density Function (or) Probability density function (PDF):

In probability theory, a probability density function (PDF), or density of a continuous

>x <- rnorm(30000)

It is also refered as Cumulative Density Function (CDF)

Finally R has four in built functions to generate normal distribution.

The binomial distribution is a discrete probability distribution.

A binomial experiment is one that possesses the following properties:

1. The experiment consists of n repeated trials;

The number of successes X in n trials of a binomial experiment is called a binomial random

dbinom(x, size, prob)

pbinom(x, size, prob)

qbinom(p, size, prob)

> rbinom(n = 1, size = 1, prob = 0.4)

probability density function for binomial distribution is given as follows

binomial distribution probability mass function is

> dbinom(x = 3, size = 10, prob = 0.3)

the cumulative distribution function of a binomial distribution is given as

Answer: What is P(X <= 1) when n=25, prob=0.005?

> qbinom(p = 0.3, size = 10, prob = 0.3)

r supports four functions to work with poisson distribution

rpois(n,lambda) – random numbers generation from poisson distribution

dpois(x,lambda) – probability mass or density function frtom passion distribution

ppois(x,lambda) – cumulative probability function from poisson distribution

qpois(p,lambda) – quantile function from poisson distribution

generates random numbers from the poisson distribution

We define the variance to be

and the standard deviation to be

Variance and Standard Deviation: Step by Step

MATHEMATICAL FUNCTIONS R FUNCTIONS

Graphs showing a correlation of -1, 0 and +1

The correlation is formulated as

In R we have a predefined function called cor()

Ex: consider the example

In R its supports a predefined function cov()

> cov(df$x, df$y)

Working with missing values:

• A large t-score tells you that the groups are different.

T-Values and P-values

Calculating the Statistic / Test Types

There are three main types of t-test:

The function contains a variety of options and can be called as follows:

> t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired =

> t.test(x, alternative = "less", mu = 10)

Ex. An outbreak of Salmonella-related illness was attributed to ice cream produced at a

0.593 0.142 0.329 0.691 0.231 0.793 0.519 0.392 0.418

One Sample t-test

> Control = c(91, 87, 99, 77, 88, 91)

Two Sample t-test

data: Control and Treat

Welch Two Sample t-test

data: Control and Treat

data: prem and reg

ANOVA is used for comparing multiple groups.

> pain = c(4, 5, 4, 3, 2, 4, 3, 4, 4, 6, 8, 4, 5, 4, 6, 5, 8, 6, 6, 7, 6, 6, 7, 5, 6, 5, 5)

> drug = c(rep("A",9), rep("B",9), rep("C",9))

> migraine = data.frame(pain,drug)

> plot(pain ~ drug, data=migraine)

The output of this program is shown below: