Rintroduction

Sensory & Marketing Research Co., Ltd.
Introduction
GV: Lê Minh Tâm

Website: http://tamleftc.googlepages.com
Lê Minh Tâm 1
Contents
1. R introduction
2. Installation and some simple applications
3. Reading data
4. Using χ2, T-test, anova, PCA in sensory data analysis
Lê Minh Tâm 2
1
References
1. Nguyễn Văn Tuấn, Phân tích số liệu và tạo biểu đồ

bằng R, NXB KHKT, 2007, 340 trang
2. Michael O´Mahony, Sensory Evaluation of Food:

Statistical Methods and Procedures, Marcel Dekker,
New York, 1986, 487 p.
3. Douglas C. Montgomery, Geoge C. Runger, Applied

Statistics and Probability for engineers, John Wiley
& Son, 2003, 706p.
n www.ykhoanet.com/r, kỹ thuật thống kê
n www2.hcmut.edu.vn/~dzung/Rworkshop2006
n www.r-project.org , download Phần mềm R.
Lê Minh Tâm 3
1. R language_ Introduction
n Tính toán đơn giản, toán học giải trí (recreational

mathematics), tính toán ma trận (matrix).
n Được phát triển thành các phần mềm chuyên môn
cho một vấn đề tính toán cá biệt: các packages.
n R là một phần mềm sử dụng cho phân tích thống
kê và vẽ biểu đồ.
Lê Minh Tâm 4
2
2. Installation & Some simple applications
n “Comprehensive R Archive Network” (CRAN):

http://cran.R-project.org.
n http://cran.r-project.org/bin/windows/base/R-
2.6.1-win32.exe
Lê Minh Tâm 5
2. Cài đặt và một số ứng dụng đơn giản
n Setup file
n Icon on desktop
n Window Screen
Lê Minh Tâm 6
3
Lê Minh Tâm 7
n Prompt : >
n Getting help: ?lm or help(lm)
Lê Minh Tâm 8
4
R syntax
n object <- function(arguments)

Ví dụ: reg <- lm(y ~ x)
n Phép toán (Operations)

x == 5 x equals to 5
x != 5 x is not equal to 5
y < x y is less than x
x > y x is greater y
z <= 7 z is less than or equal to 7
p >= 1 p is greater than or equal to 1
is.na(x) Is x a missing value?
A & B A and B
A | B A or B
! not
Lê Minh Tâm 9
Application - R as a calculator
n Arithmetic calculations
> -27*12/21 Permulation: 3!
[1] -15.42857 Øprod(3:1)
> sqrt(10)
[1] 6
[1] 3.162278
# 10.9.8.7.6.5.4
> log(10) > prod(10:4)
[1] 2.302585 [1] 604800
> log10(2+3*pi)
[1] 1.057848 > prod(10:4)/prod(40:36)
[1] 0.007659481
> exp(2.7689)
[1] 15.94109 > choose(5, 2)
> (25 - 5)^3
[1] 10
[1] 8000
> 1/choose(5, 2)
> cos(pi) [1] 0.1
[1] -1
Lê Minh Tâm 10
5
Application - R as a number generator
n Sequence – seq(from, to, by= )

n Generate a variable with numbers ranging from
1 to 12:
> x <- (1:12)
> x
[1] 1 2 3 4 5 6 7 8 9 10 11 12
> seq(12)
[1] 1 2 3 4 5 6 7 8 9 10 11 12
> seq(4, 6, 0.25)

[1] 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00
Lê Minh Tâm 11
n Repetition – rep(x, times, …)

> rep(10, 3)
[1] 10 10 10
> rep(c(1:4), 3)
[1] 1 2 3 4 1 2 3 4 1 2 3 4
> rep(c(1.2, 2.7, 4.8), 5)

[1] 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8 1.2 2.7 4.8
1.2 2.7 4.8
Lê Minh Tâm 12
6
n Generating levels – gl(n, k, length = n*k)
> gl(2,4,8)
[1] 1 1 1 1 2 2 2 2
Levels: 1 2
> gl(2, 10, length=20)

[1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
Levels: 1 2
Lê Minh Tâm 13
Application - R as a probability calculator
Normal probability
∫ f ( x ) dx
b
P(a ≤ X ≤ b) = a
∫ f ( x ) dx
a
pnorm (a, mean, sd) =
−∞
= P(X ≤ a | mean, sd)
Probability of height less than or equal to 150 cm, given that the distribution
has mean=156 and sd=4.6
> pnorm(150, 156, 4.6)
[1] 0.0960575
Lê Minh Tâm 14
7
Application - R as a simulator
n In a population, 20% have a disease, if we do 1000

studies; each study selects 20 people from the
population. In each study, we observe the number of
people with disease. Let this number be x. What is the
distribution of 1000 values of x ? Histogram of x
200
x <- rbinom(1000, 20, 0.20)
hist(x)
150
Frequency
100
50
0
0 2 4 6 8 10
Lê Minh Tâm 15
x
R as a simulator – Normal distribution

n Average height of Vietnamese women is 156 cm, with
standard deviation being 4.6 cm. If we randomly take
1000 women from this population, what is the
distribution of height?
Histogram of height
height <- rnorm(1000, mean=156, sd=4.6)

150
hist(height)
100
Frequency
50
0
145 150 155 160 165 170

Lê Minh Tâm 16
height
8
R as a sampler
n We have 40 people (1,2,3,…,40). If we randomly
select 5 people from the group, who would be
selected?
sample(1:40, 5)
[1] 32 26 6 18 9
sample(1:40, 5)
[1] 5 22 35 19 4
sample(1:40, 5)
[1] 24 26 12 6 22
sample(1:40, 5)
[1] 22 38 11 6 18
Lê Minh Tâm 17
R syntax
n Phân biệt HOA và THƯỜNG
a <- 5
A <- 7
B <- a+A
n Tên biến KHÔNG có khoảng trắng

var a <- 5
n Nhưng có thể gán thêm bằng cách thêm dấu “.”

var.a <- 5
var.b <- 10
var.c <- var.a + var.b
Lê Minh Tâm 18
9
3. Reading data
age <- c(50,62,60,40,48,47,57,70,48,67)

bmi <- c(17,18,18,18,18,18,19,19,19,19)
tam <- data.frame(age,bmi)
tam
a <- c(1,2,3,4,5,6,7,8,9)
A <- matrix(a,nrow=3)
A
a <- c(1,2,3,4,5,6,7,8,9)
A <- matrix(a,nrow=3, byrow=TRUE)
A
Lê Minh Tâm 19
3. Reading data
Lê Minh Tâm 20
10
3. Reading data
Lê Minh Tâm 21
3. Reading data
Lê Minh Tâm 22
11
3. Reading data
Lê Minh Tâm 23
4. Application in taking & presenting samples
sample(0:999,10,replace=FALSE)
[1] 667 926 888 511 475 889 404 184 713 770
williams(4)
[,1] [,2] [,3] [,4]
[1,] 1 2 4 3
[2,] 2 3 1 4
[3,] 3 4 2 1
[4,] 4 1 3 2
Lê Minh Tâm 24
12
χ2
Step 1: reading data ifg.xls
Step 2: data analysis in R environment

attach(tam)
names(tam)
table(sex,ethnicity)
Lê Minh Tâm 25
χ2
Lê Minh Tâm 26
13
One-sample T-test
Two-samples T-test
- Independent-samples T-test
- Paired T-test
Wilcoxon ???
Lê Minh Tâm 27
One-sample t-test: Is overall likings of R1,R2 is more than 5 ?
STT R1 R2 STT R1 R2
1 7 8 11 7 9
2 8 9 12 7 5
3 6 5 13 8 9
4 8 9 14 9 10
5 7 8 15 7 7
6 7 9 16 7 9
7 7 7 17 8 7
8 6 7 18 7 9
9 8 7 19 6 6
10 6 8 20 8 8
Lê Minh Tâm 28
14
r1 <- c(7,8,6,8,7,7,7,6,8,6,8,9,5,9,8,9,7,7,7,8)
t.test(r1,tb=5)
One Sample t-test
data: r1
t = 30.1721, df = 19, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
6.840134 7.859866
sample estimates:
mean of x
7.35
Lê Minh Tâm 29
T-test (paired = TRUE)

Ex: yêu thích 1 sản phẩm trước cải tiến và sau cải tiến
STT R1 R2 STT R1 R2
Before After Before After
1 7 8 11 7 9
2 8 9 12 7 5
3 6 5 13 8 9
4 8 9 14 9 10
5 7 8 15 7 7
6 7 9 16 7 9
7 7 7 17 8 7
8 6 7 18 7 9
9 8 7 19 6 6
10 6 8 20 8 8
Lê Minh Tâm 30
15
r1 <- c(7,8,6,8,7,7,7,6,8,6,8,9,5,9,8,9,7,7,7,8)
r2 <- c(8,9,5,9,8,9,7,7,7,8,9,5,9,10,7,9,7,9,6,8)
tam <- data.frame(r1,r2)
t.test(r1,r2, paired=TRUE)
Paired t-test
data: r1 and r2
t = -1.2289, df = 19, p-value = 0.2341
alternative hypothesis: true difference in means is not equal to 0
-1.2163983 0.3163983
sample estimates:
mean of the differences
Lê Minh Tâm 31
-0.45
T-test (independent samples t-test)

Ex: placebo and stimulus experiment
STT G1 G2 STT G1 G2
(placebo) (drug) (placebo) (stimulus
)
1 7 8 11 7 9
2 8 9 12 7 5
3 6 5 13 8 9
4 8 9 14 9 10
5 7 8 15 7 7
6 7 9 16 7 9
7 7 7 17 8 7
8 6 7 18 7 9
9 8 7 19 6 6
10 6 8 20 8 Lê Minh Tâm832
16
r1 <- c(7,8,6,8,7,7,7,6,8,6,8,9,5,9,8,9,7,7,7,8)
r2 <- c(8,9,5,9,8,9,7,7,7,8,9,5,9,10,7,9,7,9,6,8)
tam <- data.frame(r1,r2)
t.test(r1,r2)
Welch Two Sample t-test
data: r1 and r2
t = -1.1348, df = 35.845, p-value = 0.2640
alternative hypothesis: true difference in means is not equal to 0
-1.2543229 0.3543229
sample estimates:
mean of x mean of y
7.35 7.80
Lê Minh Tâm 33
Anova
Model S(A)
G1 G2 G3 G4 G5
9 7 8 4 7
8 9 5 3 8
6 6 6 6 7
8 6 7 5 6
10 6 3 4 9
Lê Minh Tâm 34
17
x <- c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4,5,5,5,5,5)
y <- c(9,8,6,8,10,7,9,6,6,6,8,5,6,7,3,4,3,6,5,4,7,8,7,6,9)
x <- as.factor(x)
result <- data.frame(x,y)
result <- aov(y~x)
summary(result)
Df Sum Sq Mean Sq F value Pr(>F)

x 4 43.44 10.86 5.3235 0.004373 **
Residuals 20 40.80 2.04
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
plot(TukeyHSD(result))
Lê Minh Tâm 35
Anova
Model S(A*B)
Công thức
Giới tính R1 R2 R3
8 7 5
Male 6 6 6
9 7 4
8 6 7
Female 10 8 3
8 7 4
Lê Minh Tâm 36
18
gender <- c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2)

recipe <- c(3,3,3,4,4,4,5,5,5,3,3,3,4,4,4,5,5,5)
gender <- as.factor(gender)
recipe <- as.factor(recipe)
y <- c(8,6,9,7,6,7,5,6,4,8,10,8,6,8,7,7,3,4)
result <- data.frame(gender,recipe,y)
result <- aov(y~gender+recipe+gender:recipe)
summary(result)
TukeyHSD(result)
Plot(TukeyHSD(result))
Lê Minh Tâm 37
19
PCA
Lê Minh Tâm 39
Lê Minh Tâm 40
20
Variables factor map (PCA)
1.0
Hong Dau
Ngot
Dimension 2 (28.89%)
0.5
Chua.M Chua.V
Sanh
Vang
Dongnhat
Beo
0.0
Kem
Bamdinh
Chat
-0.5
Bo
-1.0
Nau.kem
-1.0 -0.5 0.0 0.5 1.0

Dimension 1 (32.4%)
Lê Minh Tâm 41
Individuals factor map (PCA)
Ancom
2
Dimension 2 (28.89%)
Izzi
Vinamilk
Daisy
0
Milky Dutch
-2
Nuti
-4
-4 -2 0 2 4 6
Dimension 1 (32.4%)
Lê Minh Tâm 42
21
Lê Minh Tâm 43
Summary
n R is an interactive statistical language

n Extremely flexible and powerful
n Data manipulation and coding
n Can be used as a calculator, simulator
and sampler
n FREE!
Lê Minh Tâm 44
22
Standby Slides
Lê Minh Tâm 45
Chương 1: Một số khái niệm thống kê cơ bản
1. Measure of central tendency: arithmetic mean,

median, mode and quartiles
2. Measures of Variation: range, variance and

standard deviation
3. The shape of distribution
Lê Minh Tâm 46
23

sum _ of _ the _ values

arithmetic _ mean =
number _ of _ values
X 1 + X 2 + ... + X n
X =
n
n
∑X i
X = i =1
n
Lê Minh Tâm 47
Example:
Day 1 2 3 4 5 6 7 8 9 10
Time 39 29 43 52 39 44 40 31 44 35
396
X = = 39.6
10
Lê Minh Tâm 48
24

The median is the value such that 50% of the values are
smaller and 50% of the values are lager
n +1
median = ranked .value
2
where n= sample size
Lê Minh Tâm 49

Time 29 31 35 39 39 40 43 44 44 52
Ranks 1 2 3 4 5 6 7 8 9 10
Median = 39.5
Extreme value do not affect the

median
Lê Minh Tâm 50
25

The mode is the value in a set of data that appears

most frequenctly
There are two modes:

39 & 44
Extreme value do not affect the mode as

well
Lê Minh Tâm 51

Q1: the first quartile, is the value such that 25.0%

of the values are smaller and 75% are lager.
n +1
Q1 = ranked .value
4
Q3: the third quartile, is the value such that 75.0%
of the values are smaller and 25% are lager.
3.(n + 1)
Q3 = ranked .value
4
where n= sample size
Lê Minh Tâm 52
26

Q1 Q3
Lê Minh Tâm 53
2. Measures of Variation: range, variance and

standard deviation
Time 29 31 35 39 39 40 43 44 44 52
Range = largest value – smallest value
Range = 52 – 29 = 23
Although the range is a measure of the total spread, it does not

consider HOW the values distribute around the mean.
Lê Minh Tâm 54
27
2. Measures of Variation: range, variance (s2) and

standard deviation (s)
Computing s2 , the sample variance, do the following:

§ Compute the difference between each value and the mean
§ Square each difference
§ Add the squared differences
§ Divide this total by (n-1)
To compute s, take the square root of variance
Lê Minh Tâm 55
Time
(X)
δ = Xi − X δ 2 = ( X i − X )2
39 -0.6 0.36
29 -10.6 112.36
43 3.4 11.56
52 12.4 153.76
39 -0.6 0.36
44 4.4 19.36
40 0.4 0.16
31 -8.6 73.96
44 4.4 19.36
35 -4.6 21.16
Mean=39.6 Sum of dif = 0 Sum of squared dif = 412.4
Lê Minh Tâm 56
28
412.4
Sample variance: s2 = = 45.82
9
Sample standard deviation: s = 45.82 = 6.77
Lê Minh Tâm 57
n n
∑ ( X 1 − X )2 ∑ (X 1 − X )2
s2 = i =1
s= i =1
n −1 n −1
For almost all sets of data that have a single mode, most of the
values lie within an interval of plus or minus 3 standard
deviations above or below the mean
M ± (3).SD
Lê Minh Tâm 58
29
Lê Minh Tâm 59
Left-Skewed Symmetrical Right-Skewed
Lê Minh Tâm 60
30
Bimodal Multimodal
Lê Minh Tâm 61
Đọc dữ liệu từ file excel

n Có bao nhiêu cột (hay variable = biến số) và dòng số liệu
(observations) trong dữ liệu này? Chúng ta dùng lệnh dim(arg) với
arg là tên của dữ liệu. (dim viết tắt chữ dimension). Ví dụ (kết quả
của R trình bày ngay sau khi chúng ta gõ lệnh):
n > dim(chol)
n [1] 50 8
n Như vậy, chúng ta có 50 dòng và 8 cột (hay biến số). Vậy những
biến số này tên gì? Chúng ta dùng lệnh names(arg) với arg là tên
của dữ liệu. Ví dụ:
n > names(chol)
n [1] "id" "sex" "age" "bmi" "hdl" "ldl" "tc" "tg"
n Trong biến số sex, chúng ta có bao nhiêu nam và nữ? Để trả lời
câu hỏi này, chúng ta có thể dùng lệnh table(arg) với arg là tên của
biến số. Ví dụ:
n > table(sex)
n sex
n Nam Nu
n 21 28
n Kết quả cho thấy dữ liệu này có 21 nam và 28 nữ. Lê Minh Tâm 62
31
Dữ liệu cấp dưới
setwd(“c:/works/r”)
chol <- read.table(“chol.txt”,
header=TRUE)
attach(chol)
nam <- subset(chol, sex==”Nam”)

nu <- subset(chol, sex==”Nu”)
old <- subset(chol, age>=60)
n60 <- subset(chol, age>=60 &
sex==“Nam”)
Lê Minh Tâm 63
Sampling with Replacement
n Sampling with replacement: If we want to

sample 10 people from a group of 50 people.
However, each time we select one, we put the id
back and select from the group again.
sample(1:50, 10, replace=T)

[1] 31 44 6 8 47 50 10 16 29 23
Lê Minh Tâm 64
32

Rintroduction

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Rintroduction

Uploaded by

Copyright:

Available Formats

Sensory & Marketing Research Co., Ltd.

GV: Lê Minh Tâm

2. Installation and some simple applications

4. Using χ2, T-test, anova, PCA in sensory data analysis

1. Nguyễn Văn Tuấn, Phân tích số liệu và tạo biểu đồ

2. Michael O´Mahony, Sensory Evaluation of Food:

3. Douglas C. Montgomery, Geoge C. Runger, Applied

n Tính toán đơn giản, toán học giải trí (recreational

n “Comprehensive R Archive Network” (CRAN):

2. Cài đặt và một số ứng dụng đơn giản

2. Cài đặt và một số ứng dụng đơn giản

n Getting help: ?lm or help(lm)

n object <- function(arguments)

n Phép toán (Operations)

n Sequence – seq(from, to, by= )

> seq(4, 6, 0.25)

Application - R as a number generator

n Repetition – rep(x, times, …)

> rep(c(1.2, 2.7, 4.8), 5)

n Generating levels – gl(n, k, length = n*k)

> gl(2, 10, length=20)

Application - R as a probability calculator

n In a population, 20% have a disease, if we do 1000

R as a simulator – Normal distribution

height <- rnorm(1000, mean=156, sd=4.6)

145 150 155 160 165 170

n Tên biến KHÔNG có khoảng trắng

n Nhưng có thể gán thêm bằng cách thêm dấu “.”

age <- c(50,62,60,40,48,47,57,70,48,67)

4. Application in taking & presenting samples

Step 2: data analysis in R environment

4. Using χ2, T-test, anova, PCA in sensory data analysis

4. Using χ2, T-test, anova, PCA in sensory data analysis

One-sample t-test: Is overall likings of R1,R2 is more than 5 ?

One Sample t-test

4. Using χ2, T-test, anova, PCA in sensory data analysis

T-test (paired = TRUE)

4. Using χ2, T-test, anova, PCA in sensory data analysis

T-test (independent samples t-test)

Welch Two Sample t-test

4. Using χ2, T-test, anova, PCA in sensory data analysis

Df Sum Sq Mean Sq F value Pr(>F)

4. Using χ2, T-test, anova, PCA in sensory data analysis

gender <- c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2)

4. Using χ2, T-test, anova, PCA in sensory data analysis

-1.0 -0.5 0.0 0.5 1.0

Individuals factor map (PCA)

n R is an interactive statistical language

Chương 1: Một số khái niệm thống kê cơ bản

1. Measure of central tendency: arithmetic mean,

2. Measures of Variation: range, variance and

3. The shape of distribution

1. Measure of central tendency: arithmetic mean,

sum _ of _ the _ values

Chương 1: Một số khái niệm thống kê cơ bản

1. Measure of central tendency: arithmetic mean,

Chương 1: Một số khái niệm thống kê cơ bản

1. Measure of central tendency: arithmetic mean,

Extreme value do not affect the

1. Measure of central tendency: arithmetic mean,

The mode is the value in a set of data that appears

There are two modes:

Extreme value do not affect the mode as

Chương 1: Một số khái niệm thống kê cơ bản