You are on page 1of 58

MINISTRY OF EDUCATION AND TRAINING

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY AND


EDUCATION FACULTY FOR HIGH QUALITY TRAINING
FACULTY OF CHEMICAL AND FOOD TECHNOLOGY

ESSAY
DETAILS ON USE OBJECTIVES, DATA ANALYSIS,
AND RESULTS ASSESSMENT USING ANOVA, BOX-
PLOT, PCA, AND CLUSTER ANALYSIS

Subject: Applied Mathematics in Food Technology


Instructor: NGUYEN THAI ANH, PhD.

Student: Vu Thi Thao Trang – 20116002

Nguyen Thi Thu Ha – 20116138

Khong Chon Thuc – 20116160

Bui Duc An – 20116131

Ho Chi Minh City, December 2022


TEAM LIST OF TERM END SUBJECTS

SUBJECT APPLIED MATHEMATICS IN FOOD TECHNOLOGY

SEMESTER 1 2022-2023

1. Instructor: Nguyen Thai Anh, PhD.

2. Essay: Details on use objectives, data analysis, and results assessment using Anova,

Box-plot, PCA, and Cluster analysis.

3. List of final essay writing groups

Name of student ID

Vu Thi Thao Trang 20116002

Nguyen Thi Thu Ha 20116138

Khong Chon Thuc 20116160

Bui Duc An 20116131


MARK

POINTS FOR Content Layout Present Sum


INGREDIENTS

COMMENTS OF THE TEACHER

12th of December, 2022

Grading by the instructor

Nguyen Thai Anh, PhD.


CONTENT TABLE

ANALYSIS OF VARIANCE (ANOVA) .............................................................................................................. 1


1. Describe the intended usage .................................................................................................................... 1
1.1 Introduction ....................................................................................................................................... 1
1.2 Overview of Analysis of Variance (ANOVA) ..................................................................................... 1
1.3 Terminology ....................................................................................................................................... 1
1.4 Target use based on each type of ANOVA ......................................................................................... 3
2. Data analysis and evaluate the results .................................................................................................... 4
2.1 Data analysis and evaluate the results by using formula. ................................................................. 4
2.2 Using Anova in some scientific reports ............................................................................................. 6
WHISKER BOX PLOT ...................................................................................................................................... 18
1. Describe the intended usage .................................................................................................................. 18
1.1 Introduction ..................................................................................................................................... 18
1.2 Overview of Whisker Box Plots ...................................................................................................... 18
1.3 Box plot reading techniques ........................................................................................................... 19
1.4 Example of a box plots chart ........................................................................................................... 21
2. Data Analysis .......................................................................................................................................... 23
3. Box plot evaluation of the outcomes ..................................................................................................... 27
CLUSTER ANALYSIS ....................................................................................................................................... 28
1. Describe the intended usage .................................................................................................................. 28
2. Data Analysis .......................................................................................................................................... 29
2.1 General conderation ........................................................................................................................ 29
2.2 The devision of a cluster .................................................................................................................. 29
2.3 The calculation of the sums of squares ............................................................................................ 30
2.4 Taxonomic considerations ............................................................................................................... 32
2.5 Research application ....................................................................................................................... 33
3. Evaluation of the outcomes .................................................................................................................... 36
PRINCIPLE COMPONENT ANALYSIS (PCA) ............................................................................................. 38
1. Describe the intended usage .................................................................................................................. 38
1.1 Introduction ..................................................................................................................................... 38
1.2 Usage of PCA .................................................................................................................................. 38
1.3 Target uses....................................................................................................................................... 40
2. Data Analysis .......................................................................................................................................... 43
2.1 Step by steps .................................................................................................................................... 43
3. Evaluation of the output ........................................................................................................................ 49
REFERENCES .................................................................................................................................................... 51
CONCLUSION .................................................................................................................................................... 53

1|Page
ANALYSIS OF VARIANCE (ANOVA)
1. Describe the intended usage
1.1 Introduction

Currently, the most used statistical method for evaluating hypotheses is called
ANOVA (Analysis of Variance). It covers a wide range of topics and is adaptable
enough to accommodate more experimental designs.

ANOVA, created by Ronald Fisher in 1918, is a crucial statistical tool for testing
mathematical hypotheses by examining variation between several data sets.

If an adequate computer software is available, it is also chosen for its adaptability


and accuracy.

1.2 Overview of Analysis of Variance (ANOVA)

The t- and z-test procedures developed in the 20th century were used for statistical
analysis up until 1918, when Ronald Fisher created the analysis of variance
approach.

The analysis of variance (ANOVA), also referred to as the Fisher analysis of


variance, extends the t- and z-tests.

With the help of the statistical analysis approach known as ANOVA, apparent
aggregate variability within a data set is explained by separating systematic
factors from random factors.

Random factors are regarded to have no statistical significance, whereas


systematic factors are thought to have statistical relevance.

The ANOVA test is used by analysts to evaluate the impact of independent factors
on the dependent variable in a regression analysis. ANOVA allows a researcher
to establish if the variability of the results is caused by the components in the study
or by chance.

1.3 Terminology
1|Page
T-Test

A common statistical method for comparing two groups' means (averages) or the
difference between one group's mean and a standard value is the t-test, often
known as the t-statistic or occasionally the t-distribution. You can determine
whether differences are statistically significant (i.e., they weren't the result of
chance) by doing a t-test.

Statistically Significant

"Statistical significance helps determine the likelihood that a result is caused by


chance or by a factor of interest," Redman, Thomas.

Simply said, if your results are noteworthy, it is likely that they were not the result
of chance. However, as your experimental data was collected by chance and does
not match the criteria for science if you don't have a statistically significant result,
you will have to dismiss it.

Random effects and fixed effects

In the context of ANOVA and regression models, the phrases "random" and
"fixed" are used to describe a certain kind of statistical model. Researchers almost
never utilize random effects analyses; instead, they almost exclusively use fixed
effects regression or ANOVA.

In a fixed-effects ANOVA, assumptions are made regarding the independent


variable and its error distribution. The simplest use of the idea is in an
experimental design. Typically, the researcher is only concerned in generalizing
the findings to the study's experimental values.

For instance, a chemical study might use 15 mg, 20 mg, and 25 mg of an


experimental drug. This is a circumstance when a fixed effects ANOVA would
be appropriate. In this example, the extrapolation is to other studies or treatments
that might use the same values of the chemical (i.e., 15 mg, 20 mg, and 25 mg).

2|Page
A random effects model is utilized, nevertheless, if the researcher intends to draw
conclusions from data other than the specific values of the independent variable
used in the study.

Because we are taking into consideration some more expected random variation
on the independent variable, such a generalization is more of an inferential leap,
which makes the random effects model less effective.

1.4 Target use based on each type of ANOVA

ANOVA aids in determining whether differences between groups of data are


statistically significant, just like the t-test does. It functions by examining the
levels of variance present within each group using samples drawn from each.

There are several ways to use ANOVA for your data analysis, ranging from the
straightforward one-way ANOVA to modifications for particular instances, like
the ranked ANOVA for non-categorical variables. Here is an overview of some
of the most popular ones.

One-way ANOVA

To ascertain whether there are any statistically significant differences between the
means of three or more independent (unrelated) groups, the one-way analysis of
variance (ANOVA) is utilized.

Specifically, it tests the null hypothesis:

where µ = group mean and k = number of groups. [1]

The alternative hypothesis (HA), which states that there are at least two group
means that are statistically significantly different from one another, is accepted if
the one-way ANOVA yields a statistically significant result.

At this point, it's crucial to understand that the one-way ANOVA is an omnibus
test statistic and can only show that at least two groups were statistically different

3|Page
from one another, not which specific groups were statistically different from one
another.

Two-way ANOVA

A two-way ANOVA is an extension of the one-way ANOVA.

A statistical test called the two-way ANOVA is designed to ascertain how two
nominal predictor variables would affect a continuous result variable.

Two independent factors are tested for their effects on one dependent variable
using a two-way ANOVA. This test examines the relationship between the
independent factors and the actual outcome as well as their impact on the expected
result.

An even greater number of independent factors are added by using a factorial


ANOVA.

Welch’s F Test ANOVA

Two means are compared to check if they are equal using Welch's ANOVA.

It serves as a substitute for the conventional ANOVA and can be applied even
when the data do not conform to the homogeneity of variances presumption.

If you have normally distributed data that deviates from the assumption of
homogeneity of variance, you should do Welch's test.

Welch's has the highest power and the lowest type I error rate for balanced data
(i.e. same-size samples) that are normal, different-variance, and balanced.
However, normal, equal-variance, balanced, or unbalanced data continue to yield
the best results for the traditional ANOVA.

However, normal, equal-variance, balanced, or unbalanced data continue to yield


the best results for the traditional ANOVA.

2. Data analysis and evaluate the results


2.1 Data analysis and evaluate the results by using formula. [2]

4|Page
An ANOVA uses the following null and alternative hypotheses:

− H0: All group means are equal.

− HA: At least one group mean is different from the rest.

where:

• SSTotal is the total sums of squares, or summed for all N observations.

• Y is the value of the jth observation in group i,

• G is the overall total (ΣY),

• N is the total number of observations (Σni)

• SSB is sums of squares between groups, or summed for all i groups.

• Ti is the total of group i, (ΣYi)

• ni are the number of observations in group i,

• SSW is sums of squares within groups, or residual sums of squares.

These values are then inserted into the ANOVA table (see below), along with the
degrees of freedom, and mean squares obtained by dividing the sums of squares
by their respective degrees of freedom.

• k is the number of groups.

5|Page
• N is the total number of observations (= an).

• P is the proportion of that F-distribution which exceeds the observed value


of F.

The F value in one way ANOVA is a tool to help you answer the question “Is the
variance between the means of two populations significantly different?”

− The larger the F-statistic, the greater the variation between sample means
relative to the variation within the samples.
− Thus, the larger the F-statistic, the greater the evidence that there is a
difference between the group means.

Meanwhile, to determine if the difference between group means is statistically


significant, we can look at the p-value that corresponds to the F-statistic.

The p-value serves as an alternative to rejection points to provide the smallest


level of significance at which the null hypothesis would be rejected. A smaller p-
value means that there is stronger evidence in favor of the alternative hypothesis.

− A p-value of 0.05 or lower is generally considered statistically significant.


2.2 Using Anova in some scientific reports
Statistics for foodscience V: ANOVA and multiple comparisons (Part B)
[3]

Example 1

6|Page
A process development team was trying to determine the best way to blanch a
vegetable material so that the vitamin C would be retained. To determine the
ascorbic acid concentration of four distinct batches of the material, duplicate
sub-samples from each replication treatment were chosen and treated according
to four different specifications. A preliminary summary analysis was conducted
after the vitamin content was determined (Table III).

(1) Experimental objective: To establish if one blanch method is superior for


vitamin C retention than the others.

(2) Experimental design: Completely randomised design (CRD).

(3) Experimental units: 10kg pilot scale batches of vegetable.

(4) Factors: 1 (blanch conditions).

(5) Levels: 4 (the different conditions).

(6) Treatments: 4.

(7) Replicates: Duplicate sub-samples of 50g (randomly selected) from each of


three replicate batches per treatment.

7|Page
(8) Response: Ascorbic acid content by chemical analysis (a ratio scale).

(9) Total no. of response measures: 12 (means of the duplicate sub-samples).

(10) Assumptions: as for parametric ANOVA (SFS Va); independent measures;


fixed effect model.

(11) Checks: Homogeneity of variance test.

The status of the data is adequate in terms of assumptions, etc. and the statistical
analysis proceeds on this basis:

(1) H0: the population means of the four treatments are equal

(2) H1: at least two of the population means differ

(3) One-tail/two-tail: one-tail

(4) α: 5 per cent; p = 0.05

(5) Analysis method: one-way ANOVA; multiple comparison by Duncan’s MR

(6) Tabulated F: 4.07 (df 3,8; p = 0.05, 1-tail)

(7) Calculated F: 2.42 (Table IV)

(8) Conclusion: H0 is retained

The calculated F ratio of the experiment is less than the critical (tabulated) F. A
nonsignificant result is obtained, as confirmed by the p value being greater than
8|Page
the specified significance level of 0.05. Thus, the null hypothesis is retained and
no further analysis is required.

The conclusion is that the conditions of blanching selected for the experiment do
not appear to differ in respect of retained vitamin C content.

If a comparison of all possible pairs of means was incorrectly performed before


ANOVA, then the LSD and Duncan’s tes would reveal a significant difference
between treatment A and D – this would be an incorrect conclusion.

Example 2

A dairy dessert sample's intensity of creaminess was graded on a 9-point scale by


a freshly hired sensory panel that had undergone training. Different formulas of
cream, emulsifier, and stabiliser were developed in three small-scale production
batches.

(1) Experimental objective: to establish the effect of 3 formulation on the intensity


of sensory creaminess

(2) Experimental design: RCB

(3) Experimental units: 5 kg pilot scale batches of each formulation

9|Page
(4) Factors: 1 (formulation)

(5) Levels: 3 (the different formulations)

(6) Blocks: 10 (the panellists)

(7) Treatments: 3

(8) Replicated measures: 1 assessment of 25g sub-sample of each treatment by


each of 10 panellists

(9) Response: intensity of sensory attribute (creaminess) on a 9-point interval


scale

(10) Total no. of response measures: 30

(11) Analysis method: two-way ANOVA

(12) Assumptions: as for parametric ANOVA and repeated measures; fixed effect
model

(13) Checks: Homogeneity of variance; normality test

(1) H0: the treatments effects are equal

(2) H1: at least 2 of the treatment differ in effect

(3) One-tail/two-tail: one-tail

10 | P a g e
(4) α: 5 per cent P = 0.05

(5) Analysis method: two-way ANOVA without interaction (no replication)

(6) Calculated F: Treatment effect = 5.33 (Table VII)

(7) Tabulated F: Treatment effect (df 2,18 P = 0.05, 1-tail) = 3.55 (8) Conclusion:
H0 is rejected

As a significant F ratio is obtained, the null hypothesis is rejected – there is


evidence to suggest that the treatment means are from different populations. Thus
at least one pair of means differs significantly at the 5 per cent level.

It is now possible to perform multiple comparisons using various tests such as


Tukey's, Duncan's, LSDc to determine the desired result.

Effect of acidic food and drinks on surface hardness of enamel, dentine,


and tooth-coloured filling materials [4]

This study sought to ascertain how acidic foods and beverages affected the surface
hardness of various substrates.

Methods: For ten cycles, specimens were alternately submerged for five seconds
each in artificial saliva and meals or liquids. A paired t-test was used to assess the
Vickers hardness at baseline and after immersion. A one-way ANOVA analysis
of the hardness difference between the groups was followed by a least significant
difference (LSD) test.

11 | P a g e
The softening effect of Cola was higher than any other tested food or drink on
enamel, dentine, microfilled composite, and resin-modified glass ionomer
(p<0.05). The sports drink significantly reduced hardness of enamel more than the
drinking yogurt or Tom-yum soup (p<0.05).

There was no significant difference in hardness changes between orange juice,


drinking yogurt, and Tom-yum soup in all substrates under the present test
condition (p> 0.05).

Hardness changes from the drinking yogurt and Tom-yum soup were minute
under the testing conditions used in this study.

12 | P a g e
Foodborne Pathogens Recovered from Ready-to-Eat Foods from Roadside
Cafeterias and Retail Outlets in Alice, Eastern Cape Province, South
Africa: Public Health Implications [5]

This study evaluated the microbiological standards of several prepared foods


offered for sale in Alice, South Africa. 252 samples, including vegetables,
potatoes, rice, pies, beef and chicken stew, underwent microbiological screening.

The API 20E, API 20NE, and API Listeria kits were used to identify the isolates,
and the data were analyzed using the one-way-ANOVA test.

The bacterial count of vegetables, rice, potatoes, beef and chicken stew was
statistically significant when compared with pies (p < 0.05). A similar comparison
was made for food from hygienic and unhygienic sources.

The results revealed that there was statistically significant difference in the
bacterial load of beef stew and rice from hygienic and unhygienic cafeterias (p <
0.05).

However, no significance was observed for vegetables, chicken and potatoes


samples (p > 0.05) (Table 2).
13 | P a g e
Weight Gain Is Associated with Reduced Striatal Response to Palatable
Food [6]

Using repeated measurements functional magnetic resonance imaging, the study


investigated whether overeating in people results in decreased striatal
responsiveness to appetizing food consumption.

26 young ladies who were overweight or obese participated. The sample included
77% European Americans, 5% Native Americans, 7% Asian/Pacific Islanders,
2% African Americans, and 9% people of mixed ethnic origin.

They compared the blood oxygenation level-dependent (BOLD) response during


intake of a milkshake to consumption of a tasteless solution in order to pinpoint
the brain areas that are stimulated by the ingestion of appetizing food.

To compare the activations within each participant for the contrast milkshake
receipt-tasteless receipt, individual maps were created. After accounting for
participant heterogeneity with random-effects models, between group
comparisons were carried out.

Parameter estimates were entered into a second-level 2x2 random-effects


ANOVA (milkshake receipt– tasteless receipt) by weight-gain group versus
weight-stable group, weight-gain group versus weight-loss group, or weight-
stable group versus weight-loss group.

14 | P a g e
As hypothesized, the weight-gain group showed significantly less activation in
right caudate in response to milkshake intake at 6 months follow-up compared
with baseline relative to changes observed in weight-stable participants (12, -6,
24; Z= 3.44; FDR-corrected p=0.03; r=-0.35; 9, 0, 15; Z=2.96, FDR corrected
p=0.03, r=-0.26) (Fig. 3). The weight-loss group did not show significant changes
in activation in the caudate in response to milkshake intake compared with the
weight-gain or weight-stable groups (Fig. 3).

15 | P a g e
To illustrate the relation between the continuous measure of degree of weight gain
and the magni tude of the reduction in striatal responsivity to palatable food, they
regressed change in BMI against change in right caudate (12, -6, 24) activation
for all participants in SPSS, controlling for baseline BMI and scan-time difference
(Fig. 4).

They conducted an ANOVA testing the interaction between hemi sphere, time,
and group for the contrast between activation in response to receipt of milkshake
versus tasteless solution. There was no significant interaction (F (1,18) =0.91,
p=0.35). Thus, although our analyses revealed a significant time-by-group
interaction in the right caudate, but not the left caudate, we cannot conclude that
the observed effect was significantly lateralized.

Influence of food-simulating solutions and surface finish on susceptibility


to staining of aesthetic restorative materials [7]

Analyze the extent of surface discoloration caused by immersion in various stains


and food-simulating solutions for resin-based composites (RBCs) and glass-
ionomer cements (GICs) (FSS).

There were six tooth-colored restorations employed. The preparation and testing
of disk-shaped specimens involved either a matrix finish or polishing with wet
silicon carbide sheets up to 2000 grit. All samples were submerged in 37 8C
distilled water for one week, then in three different FSS (water, 10% ethanol,
Crodamol GTCC), and finally in five stains (red wine, coffee, tea, soy sauce, and
cola) for two weeks. For each stain, three samples of each substance were
evaluated. A spectrophotometer was used to assess color coefficients (CIE L* a*
b*) following each treatment.

The data set was split into six groups, according to the materials used, and separate
analyses of variance (ANOVA) were carried out for each material, evaluating the
effect of surface finish, FSS and stain. Possible differences among FSS, stain,

16 | P a g e
surface finish and material were explored by a multifactorial analysis of variance
(MANOVA).

If the differences were not significant at the 1% level, interaction terms were
ignored and the main effects of each factor were fitted to each material. Pairwise
differences were then assessed using Tukey’s test for each material in the terms
of any factor (surface finish, FSS and stain) found to be significant at the 5% level
with respect to other factors.

The interaction between combinations of factors is shown in Fig. 2. There was no


significant interaction between surface finish and FSS or between surface finish
and stain (P>0.01). There was no significant between FSS and stain or FSS and
material. There was a strong interaction (box A) between surface finish and
material (P<0.001), and (box B) between stain and material (P<0.001).

17 | P a g e
WHISKER BOX PLOT
1. Describe the intended usage
1.1 Introduction

The scientific method requires an understanding of datasets. However, it is


difficult to determine the importance of data by focusing simply on their values.
By condensing the distribution through the use of a limited number of parameters,
descriptive statistics are an efficient and simple technique to extract the key
features of a dataset. For this reason, the median, mode, mean, variance, and
quantiles are frequently utilized. The basic objective of descriptive statistics is to
quickly characterize, using a condensed set of values, the features of the
underlying distribution of a dataset. These characteristics frequently provide data
insights that would be otherwise concealed. These data summaries also make it
easier to compare different datasets.

Tables, charts, and graphic plots are a few techniques for graphically presenting
summary facts. Graphical plots are fascinating because they concisely and
pictorially represent a lot of information, making it easy to quickly grasp and
comprehend the data. Descriptive statistics may be displayed graphically in a
variety of ways, thus it would be unrealistic to include them all here. One of the
most often used methods for summarizing data, the box plot, will be the subject
of this study. Modifications that add more information to the plot will be
addressed along with other approaches to create the traditional box plot.

1.2 Overview of Whisker Box Plots [8]

Box-and whisker plots were introduced by the eminent statistican John W. Tukey
(with his colleagues and student) publicized them energetically in the 1970s, for
the purpose of giving visualization of batches of data. These displays have been
found to be useful in numberous areas. The author has found that such displays
can be effectively used for the presentation of the diverse collections of data such
as melting points of elements, heats of vaporization, ionization energies, covalent
radii, bond energies, heats of solution, use of a box plots in food technology etc.
18 | P a g e
Box plots fall into the realm of “exploratory data analysis,” the objective of which
is to obtain a feeling for how a data set as a whole behaves. Exploratory data
analysis is numerical detective work which, it is hoped, uncovers the possibility
of quantitative interrelationships among the data; analysis of such relationship is
the subject matter of confirmatory data analysis. Box plot displays force the
recognition of interconnections between members of a batch; often such
relationships are unexpected and surprising. The author believes that presentation
of data in this form provides considerable motivation for the “explannations” that
we traditional give for chemical and physical trends. Such explanation are, of
whereas, a box plot is a low-information representation of the data. However, the
box plot accurately reflects the actual data-something which an “explanation” or
model may not completely do. Because high and low values stand out in these
plots they demand interpretation. Moreover, a better feeling for “typical” values
of physical and chemical properties (and their orders of magnitude) are obtained
with these displays.

1.3 Box plot reading techniques [10]

Boxplots are particularly useful in presenting data in graphical ways that facilitate
making comparisons, finding tendencies, and providing additional insights.

19 | P a g e
Readers are guided to identify through graphics (boxplots) key features present in
some foods associated with inorganic elements and molecules and visualized
through medians, quartiles, and outliers, which together describe the shape,
central tendency, and variability of a distribution.

The median is a measure that indicates the midpoint of the distribution.


Computation of the median depends on the number of data. If the data set has an
odd number of observations, then the median is the middle observation,
considering all data ranked. But if the number of data is even, in this case the
median is the mean of the two middle observations.

Quartiles are values that divide data into quarters, that is, groups containing 25%
of the samples. The lower quartile (Q1) is the value such that 25% (1/4) of the
data lie at or below it; the second quartile (Q2) is the median of the data, and the
upper quartile (Q3) is the value such that 75% (3/4) of the data lie at or below it.

The interquartile range (IQR) is the length of the box and envelops all of the data
between Q3 and Q1, that is, the middle 50% of the ranked samples. IQR is little
affected by the presence of outliers and is a measure that is very useful for
comparing two data sets.

Fences are the limits above and below the box (generally not visualized) that are
used to flag possible outliers. The upper fence is the upper limit computed as Q3
+ 1.5 × IQR. Lower fence is the lower limit computed as Q1 − 1.5 × IQR.

Whiskers are the lines that extend from Q1 and Q3, respectively, in the direction
of the minimum and maximum values of the data set. In this work, the whiskers
are extended to the data value just before the fences since this strategy is
implemented in the computational package used by the authors. Moreover,
sometimes the whiskers are represented ending in a small horizontal line. In this
work, all boxplots are drawn without this horizontal bar.

The outlier is the value beyond the fence. There are many reasons to show an
extreme value such as error while collecting data or while making measurements.

20 | P a g e
However, we must be careful since extreme values may be correct, as in the
examples presented, and in this case they are indeed very different from the rest
of the data.

Boxplots are mainly constructed to give an overview of the distribution of a data


set. In the picture, we can see the correspondence between the boxplot shape and
the distribution.

Minitab software, widely used in educational and scientific research. The program
has common statistical, plotting, and modeling functions available. Minitab shows
a very common form of boxplot, showing the whiskers extending from Q1 to the
minimum value before the lower fence and from Q3 to the maximum value before
the upper fence.

1.4 Example of a box plots chart [9]

The steps to take while creating a box plot

• Computing the quartiles (Q1, Q2, and Q3);


• Computing the IQR;
• Drawing the box limited by Q1 and Q3;
• Drawing the line inside the box indicating the median;
• Computing the fences;
• Flagging outliers (if present);
• Drawing the whiskers.

21 | P a g e
The steps to take while interpreting box plots

• Examining the length of the box;


• Examining the length of the whiskers;
• Examining the position of the median;
• Examining the reasons for the presence of outliers (if present).

We should reinforce the idea that outliers may result from error while collecting
data or an observation whose parameter investigated is indeed very different from
the other values.

Here's an example of how to make a box plot using provided data.

1; 11.5; 6; 7.2; 4; 8; 9; 10; 6.8; 8.3; 2; 2; 10; 1


Ordered from smallest to largest:

1; 1; 2; 2; 4; 6; 6.8 ; 7.2 ; 8; 8.3; 9; 10; 10; 11.5

The median is between the 7th value, 6.8, and the 8th value 7.2. To find the
6.8+7.2
median, add the two values ogether and divide by 2 ( = 7). The median is
2

7 (Q2 =7). Half of the values are smaller than 7 and half of the values are larger
than 7.

The lower half of the data is 1, 1, 2, 2, 4, 6, 6.8. The middle value of the lower
half is 2.The number 2, which is part of the data, is the first quartile (Q1=2). One-
fourth of the values are the same or less than 2 and three-fourths of the values are
more than 2.

The upper half of the data is 7.2, 8, 8.3, 9, 10, 10, 11.5. The middle value of the
upper half is 9. The number 9, which is part of the data, is the third quartile
(Q3=9). Three-fourths of the values are less than 9 and one-fourth of the values
are more than 9.

Interquartile Range: IQR = Q3 − Q1 = 9 – 2 = 7

Smallest value = 1

22 | P a g e
Largest value = 11.5

To construct a box plot, use a horizontal number line and a rectangular box. The
smallest and largest data values label the endpoints of the axis. The Q1 marks one
end of the box and the Q3 marks the other end of the box. The middle fifty percent
of the data fall inside the box. The "whiskers" extend from the ends of the box to
the smallest and largest data values. The box plot gives a good quick picture of
the data.

The two whiskers extend from the Q1 to the smallest value and from the Q3 to
the largest value. The median is shown with a dashed line.

2. Data Analysis
Relative Sweetness of Sugars and Sugar Alcohols with Respect to Sucrose
[10]

Boxplot for the relative sweetness of sugars and sugar alcohols with respect to sucrose.

Mono- and oligosaccharides and their corresponding sugar alcohols, with a few
exceptions, are sweet. Sucrose is distinguished from the other sugars by its
pleasant taste even at high concentrations and is the reference substance usually
chosen to compare sweeteners. Sugar alcohols occur in some fruits and are
23 | P a g e
produced industrially as food ingredients. The relevance of some sugar alcohols
as sweeteners to a diet lies in the fact that they are only slowly absorbed and can,
therefore, be used in diabetic foods, possess reduced physiological calorie value,
and are noncarcinogenic.

A boxplot can show how sugars differ in intensity of sweetness. This figure ranks
sugars and sugar alcohols with respect to sucrose according to their relative
sweetness, whose distribution of values follows a slightly symmetric shape with
short whiskers. D-Galactose, whose relative sweetness is 63 (the median of the
data set), indicates the midpoint of the sweetness scale. It is possible to categorize
the sweeteners into three main classes in relation to the quartiles. A group of
sweeteners with high relative sweetness power (≥Q3) comprises D-fructose,
xylitol, sucrose, and invert sugar. All of them are sweeteners of importance in
food processing,7 and a boxplot points out this characteristic, that is, their high
power to sweeten. Xylitol is as sweet as sucrose, and a boxplot ranks it just above
sucrose. However, xylitol produces a cooling effect in the mouth when it
dissolves. This effect is used in some candies. For these reasons, xylitol has been
used as sugar substitute. A gap in the scale separates this group of important
sweeteners from the other compounds with less sweetness power. The substances
with intermediate sweetness power (Q1 < relative sweetness < Q3) include D-
glucose, D-mannitol, D-xylose, D-galactose, D-mannose, D-sorbitol, and
maltose. The third class of sweeteners comprises the substances with the lowest
relative sweetness (≤Q1), whose substances are galactitol, lactose, D-rhamnose,
and raffinose. In all three classes of sweeteners, we can find mono- and
oligosaccharides and sugar alcohols. No sample was classified as outlier, neither
above the box nor below it.

In this example, the boxplot was employed with the ultimate objective of ranking
substances based on how sweet they are. From D-fructose to raffinose, the sugars
are ranked in a decreasing power of relative sweetness. The advantages of
boxplots in comparison to simply seeking highs and lows and particular groupings

24 | P a g e
in a table, as supported by Larsen, are more evident when dealing with many
samples. A boxplot both ranks the samples and shows the distribution of values,
which is one of the very useful sides of the graphic.

Amino Acid Composition of Egg White and Egg Yolk [10]

Boxplots for the amino acid content present in egg white (A) and egg yolk (B)

Chicken eggs are one of nature’s perfect protein foods. Moreover, proteins of
animal origin, such as egg proteins, are widely used in fabricated foods. Figure
A,B show the boxplots for the amino acid content present in egg white and yolk.
We clearly visualize that glutamine is by far the major constituent in both parts of
the egg. It is interesting to analyze this amino acid specifically. The upper fence
in the boxplot for egg white (dashed line above the box) lies at Q3 + 1.5 × IQR
(0.77 + 1.5 × 0.36 = 1.31). However, the score for glutamine lies at 1.52. Then
this amino acid is outside the fence and must be classified as an outlier. Now let’s
consider the boxplot for egg yolk. The upper fence lies at Q3 + 1.5 × IQR (1.18 +
1.5 ×0.65 = 2.16), but the score for glutamine lies at 1.95, within the fence. Then,
this amino acid is not classified as outlier, even though the amino acid content in
this case is larger than in the previous one. So being an outlier or not depends on
the data set. The fact that glutamine is an outlier in the egg white means that
glutamine is present in a much greater amount than the other amino acids.

The boxplot for egg white is slightly negatively skewed whereas for egg yolk the
shape looks a little bit positively skewed. Amino acids in egg yolk are present
slightly in a larger amount than in egg white as we can note by comparing the
relative position of the boxes, the whiskers, and medians (0.64 g, egg white, and

25 | P a g e
0.83 g, egg yolk). Moreover, the orders in which they are ranked do not differ
greatly in position between egg white and yolk.

The authors propose, with this example, a classification for amino acids according
to the content they present in egg white and egg yolk. In egg white (Figure A),
amino acids present in low content (≤ Q1) are histidine, tryptophan, glycine,
cysteine, and proline; intermediate content (between Q1 and Q3 + 1.5 ×IQR)
include valine, leucine, asparagine, phenylalanine, lysine, isoleucine, alanine,
serine, threonine, methionine, tyrosine, and arginine; high content (> Q3 + 1.5 ×
IQR) include only glutamine. In egg yolk (Figure B), amino acids present in low
content (≤ Q1) are tryptophan, methionine, histidine, and cysteine; intermediate
content (between Q1 and Q3) includes valine, lysine, isoleucine, arginine,
threonine, phenylalanine, tyrosine, proline, glycine, and alanine. High content (>
Q3) includes glutamine, leucine, serine, and asparagine.

Rapid diagnosis of Enterobacteriaceae in vegetable soups by a metal oxide


sensor based electronic nose [11]

The best sensor for E.hormaechei is TGS2611. The sensor box plot (sensor
response as a function of the sample incubation time) is shown in Fig. 9a where it
is clearly presented the capability of this sensor to screen the E. hormaechei

26 | P a g e
contamination at 21h: the box of such samples is significantly distant from the
negative controls (NC). Fig. 9b shows ananalogous box plot for E. coli data: in
this case the sensor SD0610 response values enable to detect E. coli contamination
starting from 18 h, but values are rather close to NC samples, and perfectly
discriminate the samples at 21 h.

3.

3. Box plot evaluation of the outcomes

Boxplots are graphics that are easy to be constructed and interpreted. They are
very useful in food chemistry because they can give a good overview of our data,
helping to better understand their characteristics, especially, but not only, when
we are dealing with a large data set. Boxplots were useful in making comparisons
and ranking samples of how the contents of animo acids are distributed in egg
white and yolk, relative sweetness of sugars and sugar alcohols with respect to
sucrose, etc. The advantages of applying boxplots in that example to study
problems of chemical interest, making data analysis easier, visualizing data in a
different way and, the most important, revealing hidden information.

27 | P a g e
CLUSTER ANALYSIS
1. Describe the intended usage

In the 1930’s, cluster analysis was first talked about in the social science, but
during that decade, Cluster analysis was considered to be “poor man’s factor
analysis”. It was not until the late of 1950’s that this method gained a foothold in
the data analysis industry. The publication of Principles of Numerical Taxonomy
by Sskal and Sneath was the main catalyst for this renewed interest (1963). The
availability of computers made cluster analysis feasible, whereas prior to 1968,
cluster analysis could only be performed efficiently on very small data sets.

Cluster analysis, also known as clustering, is the task of grouping a set of objects
so that objects in the same group (called a cluster) are more similar (in some ways)
to those in other groups (clusters). It is a primary task of exploratory data analysis
and a common statistical data analysis technique used in a variety of fields such
as pattern recognition, image analysis, information retrieval, bioinformatics, data
compression, computer graphics, and machine learning.

Cluster analysis is not a specific algorithm, but rather the general problem to be
solved. It can be accomplished by a variety of algorithms that differ significantly
in their understanding of what constitutes a cluster and how to find them
efficiently. Clusters are commonly defined as groups with small distances
between cluster members, dense areas of data space, intervals, or specific
statistical distributions. As a result, clustering can be formulated as a multi-
objective optimization problem. The best clustering algorithm and parameter
settings (such as the distance function to use, a density threshold, or the number
of expected clusters) are determined by the individual data set and intended use
of the results. Cluster analysis is not a fully automated task, but rather an iterative
process of knowledge discovery or interactive multi-objective optimization
involving trial and error. It is frequently necessary to tweak data preprocessing
and model parameters until the desired properties are achieved.

28 | P a g e
Other terms with similar meanings to clustering include automatic classification,
numerical taxonomy, botryology (from Greek "grape"), typological analysis, and
community detection. The subtle differences are frequently in the application of
the results: whereas in data mining, the resulting groups are of interest, the
resulting discriminative power is of interest in automatic classification.

2. Data Analysis
2.1 General conderation

The relationships that exist between points in space need to be taken into account
for a number of the issues with which we are concerned. In taxonomy, n species
that have been scored for m characters each can be represented by n points in a
space of m dimensions, and these points can tell you about the natural groupings
of the species. Thus, with patients who display the presence or nonattendance of
various side effects.

The first goal of cluster analysis is to find groups of points in space. Naturally, a
method for presenting the results is necessary for a successful analysis, and we
have found the "tree" diagram useful. The advantage of this representation is that
it allows for the mapping of a multi-dimensional tree that spans the entire space.

The first bifurcation, which will begin at the base of the tree, will represent the
division of the points into two clusters: As the two clusters themselves resolve
into two additional clusters, each branch formed thus will split again, and the
process will continue until the individual points are reached. Therefore, the
primary issue to be resolved is how to divide a group of points into two segments
in the most acceptable manner, as repeating this procedure will provide an
exhaustive analysis.

2.2 The devision of a cluster [12]

It goes without saying that, for our purposes, the best division of a cluster will be
one in which the two clusters that result are as dense as possible, according to
some criterion. We have discovered that the analysis of variance is a great

29 | P a g e
criterion that is easy to use, meaningful, and useful for testing significance. It is
known from the analysis of variance that the sum of the squared distances of
points on a line from their mean can be divided into two within-group sums of
squares and a between-group sum of squares when the points are divided into two
groups. In addition, since all of the variables are squared distances, it follows that
this holds true for points in any number of dimensions as well. This is because
these squared distances can all be divided into squared distances along the
Cartesian axes, so if the partition is possible along each axis, it also applies to the
points as a whole. As a result, we can see that when points are divided into two
clusters, the sum of quared distances from their mean can be divided into the sum
of squared distances from the mean of the points in one cluster, the same sum for
the other cluster, and the sum of squares between clusters; This is merely a
multidimensional one-classification variance analysis. The between-clusters sum
of squares is clearly the natural criterion for division, and the best split will be one
in which this sum reaches its maximum and the within-clusters sum of squares
reaches its minimum. A tree diagram will result from continued splitting in
accordance with this criterion, and a measure of the "importance" of the split will
be the between-clusters sum of squares associated with each branching.
Additionally, since each cluster contains only one point at the conclusion of this
procedure, there are no more within-cluster sums of squares, and the sums of
squares associated with each branching must exactly match the initial sum of
squares:Every single original variation is taken into account.

Discriminant Analysis employs this criterion of maximising the between-clusters


sum of squares criterion (Fisher, 1936), but whereas we employ it to determine
the natural clustering of objects from raw observations, Discriminant Analysis
employs it to determine the linear function of the observations that best
distinguishes between two a priori clusters.

2.3 The calculation of the sums of squares [13]

30 | P a g e
In practice, this method is just as straightforward as it is in theory. Since n points
can be divided into two clusters in 2n-1 - 1 ways, and the sum of squares must be
calculated in each case. In point of fact, it is much simpler to minimize the sum
of squares between clusters than it is to maximize them. Our approach is
predicated on the fact that the variance of n points along a line is equal to the sum
of all squared distances between points taken two at a time (each distance being
used once) divided by n2. Since we are only dealing with squared distances, the
result must hold for points in any number of dimensions because the sum of
squares about the mean equals the sum of squared distances divided by n. As a
result, the squared pair-distances half-matrix contains all of the relevant
information for each split. This is reasonable because, in terms of the relative
positions of the points, the half-matrix clearly contains all of the information
required to reconstruct the cluster; It lacks information about the axes, which have
obviously been eliminated because they are no longer relevant. Such a matrix is
given in Table 1, the numbers (which are in reality the squares of distances) being
taken as integers for the sake of the example:

31 | P a g e
TABLE 1
THE HALF-MATRIX OF SQUARED DISTANCES (DERIVED FROM
THE DATA OF TABLE 3)
A B C D E F Point
5 11 11 14 14 A
10 6 13 15 B
6 17 21 C
13 15 D
16 E
F
Thus, the distance between A and B is √5 units. This nmatrix may be taken as an
arbitrary numerical example for the moment; in fact it is derived from real data
which will be mentioned below. The sum of the numbers is 177 so that the sum
1
of squares is 177/6, or 29 . Investigating the split ABC: DEF, the sum of squares
2
2
for the first cluster is (5 + 11 + 10)/3, or 8 , and for the second cluster is (13 + 15
3
11
+ 6)/3, or 11 . The total within-clusters sum of squares is therefore 20. All the
3

other thirty possible splits can be similarly treated, and the best one chosen, the
between-clusters sum of squares measuring the 'importance' of the split. In this
1
case the best split is ABCD: EF with a within-clusters sum of squares of 15 ,
4
1
leaving a between-clusters sum of squares of 14
4

2.4 Taxonomic considerations

All the coordinates of the points are either 0 or 1, making the calculations even
simpler in "taxonomic" situations where a number of objects have been scored for
the presence or absence of various characters (each given equal weight):The
squared distance between two objects is equal to the number of characters in
regard to which the two objects differ because these numbers remain unchanged

32 | P a g e
when squaring them. Since this measure of "distance" is already in use in
taxonomy, our methods are particularly appropriate;Naturally, they can also be
used effortlessly with metric characters.

In fact, the preceding illustration was derived from Lysenko and Sneath's
bacteriological data [1959].We have omitted the results of the tests "gas from
glucose," "sucrose," "dulcitol," "inositol," "salicin," "raffinose," and "trehalose,"
which gave variable results in many cases, because the six organisms were
Escherichia coli, Salmonella, Klebsiella, Hafnia, Proteus vulgaris, and
Morganella, in that order. The tests "gas from glucose," "sAlthough we want to
keep this example as straightforward as possible, information gleaned from such
tests can actually be utilized with ease.Table 3 displays the resulting table of
scores, which served as the foundation for the half-matrix of squared distances.

TABLE 2
MATRIX OF SCORES FOR SIX SPECIES OF BACTERIA
A 11101111100011001000111
B 10101111101001011000111
C 01111111100000111101100
D 10100111101000111101101
E 10001010111110111100001
F 10000000001111000110001

2.5 Research application


Research application 1 [14]

Cluster analysis was used by Larson and Tanner (1974) to classify the lizard genus
Sceloporous. They chose specimens from this genus as their chosen materials.
Eighty attribute measurements were used to describe each specimen's skull, some
of which are depicted in Figure 5.1. The majority of these characteristics were

33 | P a g e
standardised for variations in skull size by being reported as ratios of dimensions.
Larson and Tanner divided the tree's collection of specimens into three clusters
after cluster-analyzing the data matrix. They suggested that Sceloporous, which
was then thought of as a single genus, be replaced by three new unidentified taxa
on the grounds that these represented three previously undiscovered genera.

In a related study, 12 species from six sparrow genera were cluster-analyzed by


Robins and Schnell (1971). The attributes were 48 skeleton measurements, and
the objects were 12 different species of sparrows. Thus, the data matrix had 12 48
= 576 values. There wasn't a single specimen for each species. Instead, each was
represented by a sample of between five and ten skeletons of that species, and
their measurements were averaged. The cluster analysis's tree was divided into
two clusters by the researchers. They maintained that there were actually just two
genera, as opposed to the traditionally accepted six.

Herrin and Oliver (1974) grouped 117 tick specimens from different recognized
species within the same genus to create a categorization of ticks. They employed
measurements of tick body sections for characteristics. When they divided the tree
into clusters, they discovered that each cluster contained examples of the same
recognized species. This reaffirmed the genus' traditional taxonomic
categorization. The lizard study divided an existing genus into three new ones
using cluster analysis, whereas the sparrow study combined six existing genera
into two new ones. The tick research confirmed the categorisation but made no
changes.

When the objects being classified are uncommon in everyday life, cluster
analysis's power increases. You can hold and handle ticks, sparrows, and lizards.
Because of how well-known they are, their traditional classifications are fairly
obvious. However, taxonomists find cluster analysis to be an increasingly reliable
technique in worlds with sizes beyond what is typically experienced—the tiny
world of bacteria, for instance.

34 | P a g e
Research application 2 [15]

Because they may produce useful chemicals like detergents, meat-tenderizing


enzymes, and antibiotics, bacteria of the genus Bacillus are of interest to industry.
Bonde (1975) conducted a cluster analysis on 460 Bacillus strains. He did not list
the characteristics of each strain (attributes of size and shape, or physical and
chemical properties). Instead, he gave each person's functional characteristics
(attributes that revealed the effects of strains acting on other substances). For
instance, he used functional characteristics such as how strains affected the
production of pigments, the fermentation of sugars, and other processes. Bonde
divided the 460 branches of the tree into seven clusters following the cluster
analysis. Based on this, he recommended that the descriptions of some existing
genera and species be amended. He suggested changing the descriptions of several
current genera and species in light of this.

Similar research was conducted by Tsukamura et al. (1979), who grouped 369
strains of bacteria from the genera Mycobacterium, Rhodococcus, Nocardia, and
Corynebacterium. The 88 characteristics mixed descriptive and functional
elements. The strains' tolerances to various substances were used for a number of
35 | P a g e
the functional qualities. The existing categorization, which was based on
conventional taxonomic techniques, was confirmed by the strains being divided
in accordance with it.

Remember from the previous Research Application that a functional attribute can
be an object's impact on a material or an object's tolerance of a material. If the
objects are given descriptive properties, they will measure themselves; but, if the
items are given functional attributes, they will not measure themselves unless in
touch with another material.

The microscopic universe of a cell is even more condensed than that of bacteria.
Instead than being categorized by species or strains, cells are instead categorized
by lines. Olsen et al. (1980), for instance, classified mammalian cells using cluster
analysis. The lines were in agreement with lines obtained using a conventional
classification.

3. Evaluation of the outcomes

Hierarchical cluster analysis (HCA), as applied to the data obtained on the


composition of foods in restaurants, was a useful guide for obtaining an overview
and conducting a comparative analysis of the several types of food. Because it is
easy to carry out with computers, with the use of SPSS, it is recommended that
HCA be employed even as an exploratory tool for helping intuition in the analysis
of the set of data.

French fries are the most caloric preparation, with considerable total fiber content.
White rice is a food that is rich in carbohydrates, with a high caloric value and
low fiber content. Of the vegetables analyzed, arugula has the highest protein and
total fiber content, whereas lettuce is the one with the smallest amount of these
nutrients. Beans constitute the chief source of dietary fiber and have a low caloric
value. The more caloric preparations are French fries and fried zucchini Milanese,
both from restaurant III. The type of processing used in the preparation of the
foods found in the four restaurants may be responsible for the variability of results

36 | P a g e
between the establishments that were studied regarding lipid contents, the caloric
value of rice and beans, and the protein and lipid content of ‘‘feijoada’’ beans and
meat stew.

37 | P a g e
PRINCIPLE COMPONENT ANALYSIS (PCA)
1. Describe the intended usage
1.1 Introduction

A statistical technique called principle component analysis (PCA) transforms a


set of observations of potential variables into a set of observations of potentially
correlated variables and then into a set of values of linearly uncorrelated variables,
or principal components. The smaller of the number of original variables or the
number of observations less one is the number of different main components. The
initial principle component of this transformation is defined to have the biggest
variance feasible, and each subsequent principal component has the highest
variance possible while adhering to the requirement that it be orthogonal to the
previous components. An uncorrelated base set of orthogonal vectors is the
outcome. The relative scaling of the original variables affects PCA. Karl Pearson
(LI, 1901) developed PCA, defining it as the process of locating "lines and planes
of closest fit to systems of points in space." Fisher and MacKenzie briefly
highlighted PCA as being superior to analysis of variance for the modeling of
response data. The Nonlinear Iterative Partial Least Squares (NIPALS) algorithm
was also described by Fisher and Mackenzie and subsequently rediscovery by
Wold. PCA was subsequently improved by Hotelling to reach its current state.
[16]

Data reduction and categorization can both benefit from the principal component
analysis (PCA) method. The goal is to identify a new set of variables, smaller than
the original set of variables, that nevertheless captures the majority of the sample's
information while reducing the dimensionality of a data set (sample). [2]

When we talk about information, we're talking about the variation found in the
sample as shown by the correlations between the original variables. Principal
components (PCs), the new uncorrelated variables, are ranked according to how
much of the total information they each retain. [17]

1.2 Usage of PCA


38 | P a g e
When to use PCA?
− Latent features driving the patterns in data.
− Dimensionality reduction => Visualize high-dimensional data.
− Visualize high-dimensional data. [17]
Steps for PCA?
− Data standardization should come first.
− For each desired dimension, make a covariance or correlation matrix.
− Determine the primary component's eigenvectors and the corresponding
eigenvalues that represent the variance's size.
− Choose the eigenvalue that has the highest value out of the eigenpairs in
decreasing order of the individual eigenvalues; this is the first principal
component that protects the most information from the original data. [18]
PCA for dimensionality reduction
− Additionally, PCA is employed to minimize the dimensions.
− Arrange the eigenvectors in descending order based on the corresponding
eigenvalues.
− Plot the cumulative eigen values graph.
− For the analysis, eigen vectors that don't significantly contribute to the
overall eigenvalues can be eliminated. [17]

Plot of PCA and Variance Ratio

PCA for visualization

39 | P a g e
The sheer amount of data and the variables/features that describe it present a
barrier when attempting to solve any data-related problems in the modern world.
You require in-depth data investigation, such as determining the correlation
between the variables or comprehending the distribution of a few variables, to
solve problems where data is the key. Visualization can be difficult and nearly
impossible because there are so many different variables or dimensions along
which the data is spread.

Therefore, PCA can help you with that since it projects the data into a lower
dimension, enabling you to see the data with your eyes alone in a 2D or 3D area.
[19]

1.3 Target uses


− The goal of PCA is to find a new space (with a smaller number of
dimensions than the old space). The coordinate axes in the new space are
constructed so that on each axis, the variability of the data on it is as large
as possible.
− Using PCA, you may easily explore the data to understand its major
variables and detect outliers. PCA is a tool for identifying the main axes of
variance within a data set. It is one of the most effective techniques in the
data analysis toolbox when used properly.

40 | P a g e
Figure 1: Example 2D data analysis

In this example, PCA is implemented to project one hundred of 2-D data


X∈R2×100 on 1-D space. Figure 1 shows elliptical distribution of X with
principal component directions u⃗ 1 and u⃗ 2. The principal directions are
extracted from covariance matrix of original data set using SVD method:
V=[u⃗1u⃗2]∈R2×2.

41 | P a g e
Figure 2

As shown in Figure 2, the data matrix X can be rotated to align principal axes with
x and y axis: X' = VTX where X' represents rotated data matrix. In Figure 3 and
4, the matrix X is projected on the primary and secondary principal direction.
Euclidean distances between original and projected 2D points are computed and
summed up to quantitatively show reliability in data representation. Errors for
each principal axis projection are 97.9172 (primary axis) and 223.0955
(secondary axis). As a result of PCA, it is observed that selection of proper
eigenvector is important for the effective representation of higher dimensional
data with lower dimensions while the loss of information is minimized. [20]

42 | P a g e
Figure 3

Figure 4

2. Data Analysis
2.1 Step by steps

43 | P a g e
Principal component analysis can be broken down into five steps. I’ll go through
each step, providing logical explanations of what PCA is doing and simplifying
mathematical concepts such as standardization, covariance, eigenvectors and
eigenvalues without focusing on how to compute them.

Step 1: Standardization

− In order for each continuous beginning variable to contribute equally to the


analysis, this phase standardizes the range of the variables.
− Standardization is crucial to complete before PCA, notably because the
latter is quite sensitive to the variances of the starting variables. That is, if
there are significant disparities in the initial variable ranges, the bigger
ranged variables will predominate over the smaller ranged variables. For
instance, a variable with a range of 0 to 100 will predominate over a
variable with a range of 0 to 1, resulting in biased findings. Therefore,
converting the data to equivalent scales can solve this issue.
− Mathematically, this can be done by subtracting the mean and dividing by
the standard deviation for each value of each variable:

− Once the standardization is done, all the variables will be transformed to


the same scale.

Step 2: Covariance matrix computation

− The purpose of this stage is to determine the relationship -if any -between
the variables in the input data set and how they differ from the mean in
relation to one another. Because variables can occasionally be highly
connected to the point where they include redundant data. We compute the
covariance matrix in order to find these correlations.

44 | P a g e
− The covariance matrix is a p × p symmetric matrix (where p is the number
of dimensions) that has as entries the covariances associated with all
possible pairs of the initial variables. For example, for a 3-dimensional data
set with 3 variables x, y, and z, the covariance matrix is a 3×3 matrix of this
from:

− Since the covariance of a variable with itself is its variance


(Cov(a,a)=Var(a)), in the main diagonal (Top left to bottom right) we
actually have the variances of each initial variable. And since the
covariance is commutative (Cov(a,b)=Cov(b,a)), the entries of the
covariance matrix are symmetric with respect to the main diagonal, which
means that the upper and the lower triangular portions are equal.

Step 3: Compute the eigenvectors and eigenvalues of the covariange matrix to


identify the principal components

− To identify the primary components of the data, we must compute the linear
algebra concepts of eigenvectors and eigenvalues from the covariance
matrix. Let's first define major components before moving on to the
discussion of these notions.
− The new variables created as a result of the basic variables' linear
combinations or mixtures are known as principal components. These
combinations are made in a way that most of the information included in
the original variables is condensed or squeezed into the first components,
which are the new variables (i.e., principal components), which are
uncorrelated. The concept is that 10-dimensional data provides you 10
principle components, but PCA seeks to place as much information as
possible in the first component, then as much information as is left in the

45 | P a g e
second component, and so on, until you have something that looks like the
scree plot below.

Step 4: Feature vector

− As we saw in the previous phase, finding the major components in order of


importance requires computing the eigenvectors and sorting them by their
eigenvalues in decreasing order. In this stage, we decide whether to keep
all of these components or toss out any that have low eigenvalues and create
a matrix of vectors that we refer to as the "Feature vector" using the ones
that are left.
− Therefore, the feature vector is just a matrix with the eigenvectors of the
components that we choose to maintain as columns. This makes it the first
step towards dimensionality reduction since the final data set will only have
p dimensions if we decide to keep only p eigenvectors (components) out of
n.

Step 5: Recast the data along the principal components axes

− The input data set is always seen in terms of the original axes in the
preceding processes, with the exception of standardization, when you just
choose the principal components and create the feature vector (i.e, in terms
of the initial variables).
− The goal of this final step is to reorient the data from the original axes to
those represented by the principal components using the feature vector
created using the eigenvectors of the covariance matrix (hence the name
Principal Components Analysis). To achieve this, multiply the feature
vector's transpose by the original data set's transpose. [21]

Apply PCA into sensory evaluation of rice milk

46 | P a g e
− PCA is used in QDA data (prepare attribute with attribute descriptor points)
to reduce the collection of dependent variables (attribute) into a smaller
data of basic variables (called factors) base on correlation model between
the original variables. Data is collected from the sensory evaluation staff
after take the point that is followed the magnitude of factors (QDA). The
collected datas of different factors will be arranged in ascending order or
descending order and treated through a statistical software XLSTAT
Version 2014.1.01 and STATGRAPHIC Centurion 16.1. After that, datas
will be reduced by data analysis, independent and dependent variables will
be selected and the 2-axis graph of samples will be received.

Table 1: List of terms are used to evaluate sensory features of rice milk
Sensories Terms
Terms of tastes (04) Sweet, Fatty, Bitter, Strange
Terms of status (03) Whey, Sedimentation, Smoothness
Terms of aromas (05) Rice, Coat milk, Soy bean, Animal,
Strange
Terms of clours (02) White milk, Brown

Table 2: Apperance frequency of rice milk sensories


Sensories Frequency Compacted factors
Sweet 9 X
Fatty 9 X
Bitter 1 -
Strange 0 -
Whey 6 X
Sedimentation 5 X
Smoothness 2 -
Rice 8 X

47 | P a g e
Coat milk 5 X
Soy bean 1 -
Animal 0 -
Strange 0 -
White milk 5 X
Brown 2 -
Note: X is selected and – is not

− The frequency of occurrence of attributes shows the sensory attributes of


major interest to rice milk products include sweetness and fatness.

Table 3: Principal Components Analysis base on 7 compacted sensories


F1 F2 F3 F4 F5 F6 F7
Eigenvalue 3,826 2,464 0,353 0,255 0,061 0,032 0,009
Variability(%) 54,657 35,199 5,036 3,649 0,869 0,458 0,132
Cumulative(%) 54,657 89,855 94,891 98,540 99,410 99,868 100,000
Note: Fi is the official component I, Variability(%) is the percentage of variance
and Cumulative(%) is the cumulative percentage of variance

− The result of evaluating the attributes of rice milk products according to 7


shortened sensory attributes and 9 rice milk samples were processed
according to the main component analysis method. The results of the
analysis are shown in Table 3. This process was carried out to analyze the
main ingredients with the aim of determining the quantity of the main
components required to represent the data (including 7 sensory attributes of
rice milk, ). The “Scree plot” shows the descending order of the magnitude
of the eigenvalues and the cumulative percentage of variance. In terms of
factor analysis or principal component analysis, the screen plot helps the
analyst visualize the relative importance of the components. Components
need to describe at least 80% of the cumulative percent of variance (Shi et

48 | P a g e
al., 2002). In this case, the two components 1 and 2 have an eigenvalue
greater than 1 and account for the cumulative 89,855 % of the variance.
− The magnitude of components 3rd to 7th (F3 to F7) is very small compared
to components 1 and 2, so it is not necessary to use components from 3rd
onwards to represent the number of sensory data sets. authorities have
collected. [22]
3. Evaluation of the output
− Principal component analysis (PCA) is an unsupervised machine learning
algorithm that plays a vital role in reducing the dimensions of the data in
building an appropriate machine learning model. It is a statistical process
that transforms the data containing correlated features into a set of
uncorrelated features with the help of orthogonal transformations.
Unsupervised machine learning is a concept of self-learning method that
involves unlabelled data to identify hidden patterns.
− PCA converts the data features from a high dimensional space into a low
dimensional space. PCA also acts as a feature extraction method since it
transforms the ‘n’ number of features into ‘m’ number of principal
components (PCs; m < n).
Evaluate results of soy milk through PCA
− The objective of this work was to use a multivariate statistical method, like
principal component analysis (PCA), along with quantitative descriptive
analysis (QDA) to analyze the variations of physical and sensory properties
of fermented foods after fermentation. PCA operation makes it possible to
distinguish the food samples and also to identify the most important
variables in a multivariate data matrix.
− PCA of the soy milk samples was performed by the following methods:
➢ Data were collected from the panellists after scoring through Hedonic
rating scale. +The data of various attributes stated above were arranged
in ascending or in a descending order and put into the software SPSS
16 in data view mode.

49 | P a g e
➢ Then data were reduced by data analysis and the independent and
dependent variables were selected and a two dimensional figure of the
analysed sample was obtained.

− From the result, we can see that maximum variance was found to be
obtained at 98% and from this PCA score, it can be concluded that this type
of fermented food products are acceptable for consumption. Two principal
components namely PC 1 and PC 2 were extracted that accounted for 57.6%
of the variance and 11.6% of variance respectively in six variable systems
incase of cow milk curd prepared from 2%; 57.2% of the variance and
12.5% of variance respectively in six variable systems incase of soy milk
curd prepared from 2%. Maximum weightage was found incase of PC 1, in
both case of cow milk and soy milk curd i.e., 57.6 % and 57.2% respectively
of variation. A one way ANOVA test was performed to determine whether
any significant difference occurred in the food sample during fermentation
and storage or not from difference in the mean values occurred in the same
row. [23]

50 | P a g e
REFERENCES
[1]. One-way ANOVA. One-way ANOVA - An introduction to when you should
run this test and the test hypothesis | Laerd Statistics. (n.d.). Retrieved December
5, 2022, from https://statistics.laerd.com/statistical-guides/one-way-anova-
statistical-guide.php

[2]. Dransfield R.D., & Brightwell R. (n.d.). One-way fixed effects ANOVA
(Model I). Retrieved December 5, 2022, from
https://influentialpoints.com/Training/one-way_fixed_effects_anova-principles-
properties-assumptions.htm

[3]. Bower, J. A. (1998). Statistics for food science V: Anova and multiple
comparisons (Part B). Nutrition & Food Science, 98(1), 41–48.
https://doi.org/10.1108/00346659810196309

[4]. Wongkhantee, S., Patanapiradej, V., Maneenut, C., & Tantbirojn, D. (2006).
Effect of acidic food and drinks on surface hardness of enamel, dentine, and tooth-
coloured filling materials. Journal of Dentistry, 34(3), 214–220.
https://doi.org/10.1016/j.jdent.2005.06.003

[5]. Nyenje, M. E., Odjadjare, C. E., Tanih, N. F., Green, E., & Ndip, R. N. (2012).
Foodborne pathogens recovered from ready-to-eat foods from roadside cafeterias
and retail outlets in alice, eastern cape province, South Africa: Public health
implications. International Journal of Environmental Research and Public Health,
9 (8), 2608–2619. https://doi.org/10.3390/ijerph9082608

[6]. Stice, E., Yokum, S., Blum, K., & Bohon, C. (2010). Weight gain is associated
with reduced striatal response to palatable food. Journal of Neuroscience, 30(39),
13105–13109. https://doi.org/10.1523/JNEUROSCI.2105-10.2010

[7]. Bagheri, R., Burrow, M. F., & Tyas, M. (2005). Influence of food-simulating
solutions and surface finish on susceptibility to staining of aesthetic restorative
materials. Journal of Dentistry, 33(5), 389–398.
https://doi.org/10.1016/j.jdent.2004.10.018

51 | P a g e
[8]. Russell D. Larsen, Texas Tech University, Lubbock, TX 79409, Box – and –
Whisker plots.

[9]. Susan Dean, Barbara Illowsky, Ph.D, Descriptive Statistics: Box Plot.

[10]. Joao E. V. Ferreira, Ricardo M. Miranda, Antonio F. Figueiredo, Jardel P.


Barbosa, and Edykarlos M. Brasil, Box-and-Whisker Plots Applied to Food
Chemistry.

[11]. Emanuela Gobbi, Matteo Falasconi, Giulia Zambotti, Veronica Sberveglieri,


Andrea Pulvirenti, Giorgio Sberveglieri, Rapid diagnosis of Enterobacteriaceae in
vegetable soups by a metal oxide sensor based electronic nose.

[12]. Charles Romesburg, “Cluster analysis for researchers”.

[13]. Laura R.Peck, “Using cluster analysis in program Evaluation”

[14]. A. W. F. Edwards and L. L. Cavalli-Sforza, “A Method for Cluster Analysis”

[15]. Elizabeth Aparecida Ferraz da Silva Torres, Maria Lima Garbelotti, Jose´
Machado Moita Neto. “The application of hierarchical clusters analysis to the
study of the composition of foods”.

[16]. Daoliang Li, Shuangyin Liu. Water Quality Monitoring and Management,
Retrieved 2019, from https://www.sciencedirect.com/topics/agricultural-and-
biological-sciences/principal-component-
analysis#:~:text=PCA%20was%20invented%20in%201901,the%20modeling%2
0of%20response%20data.

[17]. Rohit Dwivedi. Introduction To Principal Component Analysis In Machine


Learning. Retrieved May 7/2020, from
https://www.analyticssteps.com/blogs/introduction-principal-component-
analysis-machine-learning

[18]. Data Analytics, Retrieved August 18/2020, from


https://www.sartorius.com/en/knowledge/science-snippets/what-is-principal-
component-analysis-pca-and-how-it-is-used-507186
52 | P a g e
[19]. Adltya Sharma. Principal Component Analysis (PCA) in Python Tutorial,
Retrieved January 2020, from https://www.datacamp.com/tutorial/principal-
component-analysis-in-python

[20]. Mireille Boutin. "ECE662: Statistical Pattern Recognition and Decision


Making Processes," Purdue University, Retrieved Spring 2014.

[21]. Zakaria Jaadi. A Step-by-Step Explanation of Principal Component Analysis


(PCA), Updated September 26/2022, from https://builtin.com/data-science/step-
step-explanation-principal-component-analysis

[22]. Nguyen Minh Thuy. Application of key components, analysis methods,


logistic recovery and favorite schedule in sensitive assessment of rice milk
products, April 27/2015

[23]. J Food Sci Technol. Journal of Food Science and Technology, Published
online Febuary 18/2011.

CONCLUSION
Anova, Boxplot, Cluster analysis, and PCA are all vital tools in the process of
assessing and displaying data sources in scientific research. Each instrument has
distinct advantages in data evaluation.
53 | P a g e
We chose to pick the issue for future study on how to utilize and use the analytic
tools on our group "Detailing usage objectives, analyzing data, and assessing
findings Anova, Boxplot, Cluster analysis, and PCA ". As the foundation for this
essay, we used the information and resources that we received under the
supervision of Mr. Nguyen Thai Anh. Furthermore, we conducted additional
research on scientific publications and lecture videos supplied by the teacher to
obtain additional knowledge and research materials for this essay.

However, while we work on the above material, it is certain that the information
we get will be insufficient; we hope you would remark on the work to help us
better.

Finally, we would want to express our heartfelt gratitude to instructor Nguyen


Thai Anh for accompanying and sharing information not only in the topic but also
outside of society, allowing us to have a deeper understanding of the world. Thank
you for transforming theoretical and dry knowledge into lectures with particular
and vivid examples that make the class engaging and leave us motivated at the
end of each course. For the last time, we would want to express our heartfelt
gratitude, and we wish you good health as you continue to guide the next
generation of students.

54 | P a g e

You might also like