Professional Documents
Culture Documents
Zhang 2017
Zhang 2017
www.jcimjournal.com/jim
www.elsevier.com/locate/issn/20954964
Available also online at www.sciencedirect.com.
Copyright © 2017, Journal of Integrative Medicine Editorial Office.
E-edition published by Elsevier (Singapore) Pte Ltd. All rights reserved.
● Methodology
A data-driven method for syndrome type
identification and classification in traditional
Chinese medicine
Nevin Lianwen Zhang1, Chen Fu2, Teng Fei Liu1, Bao-xin Chen2, Kin Man Poon3, Pei Xian Chen1, Yun-
ling Zhang2
1. Department of Computer Science and Engineering, the Hong Kong University of Science and
Technology, Hong Kong, China
2. Department of Neurology, Dongfang Hospital, Beijing University of Chinese Medicine, Beijing 100078, China
3. Department of Mathematics and Information Technology, the Education University of Hong Kong, Hong
Kong, China
ABSTRACT
The efficacy of traditional Chinese medicine (TCM) treatments for Western medicine (WM) diseases
relies heavily on the proper classification of patients into TCM syndrome types. The authors developed
a data-driven method for solving the classification problem, where syndrome types were identified and
quantified based on statistical patterns detected in unlabeled symptom survey data. The new method is
a generalization of latent class analysis (LCA), which has been widely applied in WM research to solve
a similar problem, i.e., to identify subtypes of a patient population in the absence of a gold standard. A
well-known weakness of LCA is that it makes an unrealistically strong independence assumption. The
authors relaxed the assumption by first detecting symptom co-occurrence patterns from survey data and
used those statistical patterns instead of the symptoms as features for LCA. This new method consists
of six steps: data collection, symptom co-occurrence pattern discovery, statistical pattern interpretation,
syndrome identification, syndrome type identification and syndrome type classification. A software
package called Lantern has been developed to support the application of the method. The method was
illustrated using a data set on vascular mild cognitive impairment.
Keywords: medicine, Chinese traditional; syndrome; syndrome classification; latent tree analysis; symptom
co-occurrence patterns; patient clustering; stand syndrome differentiation
Citation: Zhang NL, Fu C, Liu TF, Chen BX, Poon KM, Chen PX, Zhang YL. A data-driven method for
syndrome type identification and classification in traditional Chinese medicine. J Integr Med. 2017; 15(2):
110–123.
http://dx.doi.org/10.1016/S2095-4964(17)60328-5
Received August 30, 2016; accepted November 7, 2016.
Correspondence: Prof. Nevin Lianwen Zhang; E-mail: lzhang@cse.ust.hk. Prof. Yun-ling Zhang; E-mail: yunlingzhang2004@163.com
types. The efficacy of TCM treatments depends heavily on throughout the paper.
whether the classification is done properly.[2,3]
The problem of TCM syndrome classification of a 2 Technical background
WM disease consists of four subproblems:[4–7] (1) What
TCM syndrome types exist among the patients with the This paper builds upon two data analysis methods,
disease? (2) What is the prevalence of each syndrome namely latent class analysis (LCA) and latent tree analysis
type? (3) What are the characteristics of each syndrome (LTA). They are based on probabilistic models that
type in terms of symptom occurrence probabilities? describe relationships among categorical variables. Some
(4) How do we determine to the syndrome type(s) of a of the variables are observed, while the others are latent,
patient, based on symptoms?
that is, unobserved. In this section, we explain LCA and
The syndrome classification problem is of fundamental
LTA in layman’s term so that medical researchers without
importance to TCM research and clinical practice.[2,3] As
will be seen in Section 9, this problem has so far not been a strong background in statistics and machine learning can
satisfactorily solved. In this paper we present a data-driven understand the key ideas.
method for its solution. The idea is to: (1) conduct a cross- 2.1 Latent tree models and latent class models
sectional survey of the patients with the WM disease The models that we use are called latent tree models
and collect information about symptoms of interest to (LTMs). An LTM describes the relationship among a set
TCM; (2) perform cluster analysis on the data and divide of variables at two levels. At the qualitative level, it is an
the patients into clusters based on symptom occurrence undirected tree where the observed variables are located
patterns; (3) match the patient’s clusters with TCM at the leaf nodes, whereas the latent variables are located
syndrome types; (4) use the statistical characteristics of at the internal nodes. At the quantitative level, it describes
the patient’s clusters to quantify the TCM syndrome types the relationship between each pair of neighboring
and to establish classification rules. variables using a conditional probability distribution.
We will first explain the data analysis methods that Figure 1(a) shows an example LTM taken from Xu et
this paper relies upon in Section 2. Then we will present al.[9] Qualitatively, it asserts that a student’s math grade
our method for solving TCM syndrome classification (MG) and science grade (SG) are influenced by his
in Sections 3 through 8. Related works and limitations analytical skill (AS); his English grade (EG) and history
will be discussed in Section 9 and conclusions drawn grade (HG) are influenced by his literacy skill (LS) and
in Section 10. A data set on vascular mild cognitive the two skills are correlated. Here, the grades are observed
impairment (VMCI) [8] will be used for illustration variables, while the skills are latent variables.
Figure 1 The concept of latent tree model and latent class model
The subfigure (a) and the tables illustrate the concept of latent tree models using an example that involves two latent variables (the skill variables)
and four observed variables (the grade variables). The tables show some of the probability parameters for the latent tree model. The subfigure (b)
illustrates the concept of latent class models where intelligence is the only latent variable.
AS: analytical skill; LS: literacy skill; MG: math grade.
For simplicity, a ssume all the variables have two not. Similarly, EG and HG are independent of each other
possible values “low” and “high”. The dependence conditioned on the latent variable LS, but MG and SG are
of MG on AS is quantitatively characterized by the not.
conditional distribution P (MG|AS), which is also Historically, LCMs precede LTMs. LTMs are introduced
included in Figure 1. It says that a student with high AS as a generalization of LCMs in Zhang[11], where they are
tends to have high MG and a student with low AS tends called hierarchical LCMs.
to have low MG. Similarly, the dependences of other 2.2 Latent class analysis
grade variables on the skill variables are quantitatively LCA refers to the analysis of data using LCMs. As
characterized by the distributions P(SG|AS), P(EG|LS) an example, consider a data set comprising the grades
and P(HG|LS), respectively. To save space, they are not obtained in the aforementioned four subjects by students
shown. from one school. To perform LCA on the data, we
To specify the quantitative relationships among the latent assume there is a latent variable Y that is related to the
variables, it is convenient to root the model at one of the grade variables as shown in Figure 2(a). The task is
latent variables and regard it as a directed model—a tree- to: (1) determine the number of possible values for Y,
structured Bayesian network.[10] If we use AS as the root, (2) determine the marginal distribution P( Y ) and the
then we need to provide, as given in Figure 1, the marginal distributions P(MG|Y), P(SG|Y), P(EG|Y) and P(HG|Y)
distribution of the root P(AS) and the distribution of the grade variables conditioned on Y. The first task is
P(LS|AS) of LS conditioned on its parent AS. If LS was known as model selection, while the second as parameter
chosen as the root instead, we would need to provide estimation.
P(LS) and P(AS|LS). The choice of root does not matter In statistics, the concept of likelihood measures how
because different choices give rise to equivalent directed well a model fits data. In LCA, probabilistic parameters
models.[11] are determined using the maximum likelihood estimate
LTMs with a single latent variable are called latent (MLE) principle.[12] The number of possible values for Y
class models (LCMs). Figure 1(b) shows an example is often determined using the Bayes Information Criterion
LCM, where “intelligence” is the sole latent variable. (BIC).[13] The BIC score is likelihood, plus a penalty term
Qualitatively, it asserts that a student’s grades in the four for model complexity. The use of BIC intuitively means
subjects are all influenced by his intelligence level. that we want a model that fits data well, but do not want it
Different models make different independence to be overly complex.
assumptions. In Figure 1(b), the four grade variables Note that there are two equivalent versions of BIC in
are assumed mutually independent, conditioned on the literature that differ by the sign of the score. In the
the latent variable “intelligence”. This is known as the version the BIC score takes positive values, lower values
local independence assumption. Another way to state are preferred. In the other version, where BIC scores take
the assumption is that the correlations among the grade negative values, higher values are preferred.
variables can be properly modeled using a single latent In practice LCA is used as a tool for clustering
variable. In this sense, we also call it the unidimensionality discrete data. [14] Each value of the latent variable Y
assumption. represents a probabilistic cluster of individuals; all the
In Figure 1(a), the four grade variables are not assumed values collectively give a partition of all individuals. To
to be unidimensional. The correlations among the determine the number of possible values for Y amounts
grade variables are modeled using two latent variables to determining the number of clusters, and to determine
AS and LS. MG and SG are independent of each other the probabilistic parameters amounts to determining the
conditioned on the latent variable AS, but EG and HG are statistical characteristics of the clusters.
Figure 2 Illustration of the results of latent class analysis and latent tree analysis
The subfigure (a) shows the model for latent class analysis. There is only one latent variable Y. The task is to determine the number of values for Y and
the probability parameters. The subfigure (b) shows a model that might be obtained from latent tree analysis, where it is necessary to determine the
number of latent variables and connections among them additionally.
Figure 3 Structure of the latent tree model obtained on the VMCI data
The variables labeled with English phrases are symptom variables, while the Y-variables are latent variables. The integer next to a latent variable is the
number of its possible values. VMCI: vascular mild cognitive impairment.
To identify the cluster of patients that corresponds to VMCI. The data set is comprised of 803 patients and
a TCM syndrome type, we first select a set of symptom 93 symptoms. Detailed information about the data
variables based on domain knowledge and then perform set and the data collection process are reported in Fu
cluster analysis to group the patients based on those et al. [8]
variables. However, we do not always use the symptom The first step of the method is to perform LTA on the
variables themselves as features for the cluster analysis. VMCI data. This is done using the EAST algorithm. The
If two or more symptom variables are from the same structure of the resultant model is shown in Figure 3.
latent aspect of data, we “combine” them into one latent The variables labeled with English phrases are symptom
feature, and use the latent feature instead of the symptom variables, which are from the data set. The Y-variables are
variables themselves. The objective here is to relax the latent variables, which are introduced during data analysis.
local independence assumption. The integer next to a latent variable is the number of its
The following is an illustration of this approach: possible values. For example, Y01 has 2 possible values,
Figure 4 shows the model that we use to identify the whereas Y15 has 3.
cluster of patients that correspond to the symptom type The widths of the edges indicate strengths of
phlegm-dampness. According to domain knowledge, the correlations between neighboring variables. We see that
symptoms insomnia and dreamfulness are both related to
Y01 is strongly correlated with the symptom variables
phlegm-dampness, and hence are included in the model.
sallow complexion, asthenia of defecation, and dry stool
According to the results of LTA (Figure 3), they are from
the same latent aspect of data, and hence are not used as or constipation, and weakly related to clear profuse
features directly. Instead, they are “combined” into one urination. Similarly, Y08 is strongly correlated with the
latent feature and the latent feature is connected to the latent variables Y09 and Y12, and weakly related to Y04
clustering variable Z. Given Z, the two symptom variables and Y13.
are not mutually independent, and hence the local The latent variables reveal symptom co-occurrence
independence assumption is relaxed. patterns and symptom mutual-exclusion patterns hidden
The model in Figure 4 is called a joint clustering model in the data. Each of those statistical patterns can be
because it jointly considers information from several intuitively understood as the manifestation of a latent
aspects when performing cluster analysis on the patients. aspect of the data. In the rest of this section, we examine a
The BIC score of joint clustering model is –6 326. In few of the patterns.
contrast, the BIC score of the best LCM for the same Each of the latent variables represents a probabilistic
symptom variables is –6 390. So, a better model fit is partition of the patients in the data set. For example,
achieved by relaxing the local independence assumption. Y06 has two possible values and hence partitions
the patients into two clusters, which are denoted as
Y06 = s0 and Y06 = s1 respectively. Table 1 shows
the information about the partition compiled using
the “mod el interpretation” function of the Lant ern
software.
probabilities of the variables in the two clusters. The Table 4 Partition of patients given by the latent variable Y25
larger the mutual information, the larger the difference
Y25 = s0 Y25 = s1 Mutual
in occurrence probabilities, and the more important the Symptom
(0.64) (0.36) information
variable is for distinguishing between the two clusters.
Among the two symptoms in Table 1, thick tongue fur is Insomnia 0.16 0.78 0.20
more important because its occurrence probabilities in
the two clusters differ more than those of greasy tongue Dreamfulness 0.23 0.83 0.18
fur. Flushed face 0.10 0.03 0.01
We see that the two clusters consist of 79% and 21%
of the patients respectively. The symptoms occur with In general, a latent variable reveals either that a group
higher probabilities in the cluster Y06 = s1, and with of symptoms tend to co-occur, or that two subgroups of
lower probabilities in the cluster Y06 = s0. This indicates symptoms tend to be mutual exclusive, with symptoms
that the two symptoms are positively correlated. When in each group tending to co-occur. An examination of all
one symptom occurs, the other also tends to occur. In the latent variables in Figure 3 has found that Y02, Y05,
other words, the two symptoms tend to co-occur. In this Y09, Y10, Y12, Y14, Y15 and Y25 fall into the second
sense, Y06 reveals the probabilistic co-occurrence of the category, whereas the other latent variables fall into the
two symptoms. This is the statistical meaning of Y06. first category. The details are presented in Fu et al.[8]
Table 2 shows the information about that partition given
by the latent variable Y20. It reveals the probabilistic co- 5 Statistical pattern interpretation
occurrence of the three symptoms fat tongue, tongue with
ecchymosis, and tooth-marked tongue.
The objective is to identify patient clusters, based on
Table 3 shows the information about the partition given the VMCI data and domain knowledge that correspond
by the latent variable Y12. In Figure 3, Y12 is directly to TCM syndrome types, and to quantify the syndrome
connected to the two symptom variables slippery pulse types, using the statistical characteristics of the patient
and thin pulse. In the cluster Y12 = s0, slippery pulse clusters. We need to answer three questions to begin with:
occurs with high probability and thin pulse does not What are the target syndrome types? What information in
occur at all. In the cluster Y12 = s1, on the other hand, the data should we use for each target syndrome type? Is
slippery pulse occurs with low probability and thin pulse the information sufficient?
occurs with high probability. This indicates that the two One way to find the answers for these questions is to
symptoms are negatively correlated. When one symptom examine the symptoms in the data one by one, and, for
occurs, the other tends not to occur. In this sense, Y12 each symptom, list all the syndrome types that can lead
reveals the probabilistic mutual exclusion of slippery to the occurrence of the symptom according to domain
pulse and thin pulse. knowledge. In this way, a list of syndrome types is
Table 4 shows the information about the partition associated with each symptom. All the syndrome types
given by yet another latent variable Y25. It reveals that appear in the lists are the target syndrome types. For
the probabilistic mutual exclusion of two groups of each target syndrome type, all symptoms in the data set
symptoms: (1) insomnia and dreamfulness, and (2) flushed that it can cause to occur constitute the information we
face. Moreover, it indicates that the two symptoms in the should use for the syndrome type. The information is
first group tend to co-occur. sufficient if all key manifestations of the syndrome type
are covered.
Examining the symptoms one by one is tedious. We
Table 2 Partition of patients given by the latent variable Y20
suggest examining them in groups instead. As seen in
Y20 = s0 Y20 = s1 Mutual the previous section, LTA has discovered symptom co-
Symptom
(0.91) (0.09) information occurrence patterns and thereby divided the symptoms
into groups. We examine those statistical patterns one by
Fat tongue 0.05 0.58 0.08
one. For each pattern, we identify the syndrome types
Tongue with ecchymosis 0.02 0.30 0.04 that, according to domain knowledge, can lead to the co-
occurrence of the symptoms in the pattern. This step is
Tooth-marked tongue 0.08 0.46 0.04 therefore called Statistical Pattern Interpretation.
We begin with Y06, which reveals the probabilistic co-
Table 3 Partition of patients given by the latent variable Y12 occurrence of thick tongue fur and greasy tongue fur. To
determine the TCM connotation of the statistical pattern,
Y12 = s0 Y12 = s1 Mutual
Symptom
(0.43) (0.57) information
we ask this question: What TCM syndrome type or types
can bring about the co-occurrence of the two symptoms?
Slippery pulse 0.85 0.16 0.26 According to domain knowledge (e.g., Yang et al.[25]), the
answer is phlegm-dampness. Hence, phlegm-dampness is
Thin pulse 0.00 0.57 0.24
the interpretation for the statistical pattern.
Y20 reveals the probabilistic co-occurrence of the three the syndrome types and whether the information is
symptoms fat tongue, tongue with ecchymosis, and tooth- sufficient.
marked tongue. According to domain knowledge, the first One of the potential target syndrome types identified in
and the third symptoms can be caused by either phlegm- the previous section is phlegm-dampness. According to
dampness or Qi deficiency, and the second symptom can the discussions, the following symptoms should be used
be caused by blood stasis. Therefore, the co-occurrence of when quantifying the syndrome type: Y06 (thick tongue
the three symptoms can be due to either the co-occurrence fur and greasy tongue fur), Y20 (fat tongue and tooth-
of phlegm-dampness and blood stasis, or the co- marked tongue), Y12 (slippery pulse), and Y25 (insomnia
occurrence of Qi deficiency and blood stasis. We regard and dreamfulness).
phlegm-dampness and Qi deficiency as (two alternative) In the research of Fu et al., [8] all latent variables in
Figure 3 are examined and interpreted. The interpretations
primary interpretations of the statistical pattern, because
suggest that the following information should also be
they explain the leading symptom fat tongue. In contrast,
included when quantifying phlegm-dampness: Y11 (sticky
blood stasis is regarded as a secondary interpretation of or greasy feel in mouth), Y21 (dizziness, head feels as
the pattern. if swathed, distending headache, nausea or vomiting),
If a latent variable reveals the mutual exclusion of Y03 (dizzy headache), Y14 (thirst desire no drinks), Y16
two groups of symptoms, the two groups are interpreted (urinary incontinence), and Y19 (expectoration).
separately. For example, Y12 reveals the probabilistic In total, there are 16 symptoms that concern various
mutual exclusion of the two symptoms slippery pulse manifestations of phlegm-dampness. Together, they cover
and thin pulse. Slippery pulse can be explained by all major aspects of the impacts of phlegm-dampness.
phlegm-dampness, while thin pulse can be explained It is therefore concluded that phlegm-dampness is well
by either Qi deficiency or blood deficiency. Y25 reveals supported by the data and it can be a target syndrome type
the probabilistic mutual exclusion of two groups of for quantification.
symptoms: (1) insomnia and dreamfulness, and (2) flushed
face. The first group can be explained by any of several
7 Syndrome quantification
syndrome types, namely Yin deficiency, fire-heat, blood
deficiency, Qi deficiency, and phlegm-dampness. The
second group can be explained by either fire-heat or Yin To quantify phlegm-dampness, we perform cluster
deficiency. analysis on the patients based on the 16 symptom variables
Note that the interpretation of a co-occurrence pattern and thereby identify a patient cluster that corresponds
is about identifying an appropriate explanation for to the syndrome type. The size of the patient cluster is
the statistical pattern. It is not about determining the then regarded as a measurement of the prevalence of
syndrome type of a patient given that the pattern is phlegm-dampness, and occurrence probabilities of the
present. To interpret the co-occurrence of insomnia and 16 symptoms in the cluster are used as the definition of
dreamfulness, for instance, the correct question to ask phlegm-dampness.
is: What syndrome types can lead to the occurrence of The cluster analysis is carried out using the model
the two symptoms? There might be multiple possible shown in Figure 4. The latent variable Z, at the top,
answers for such a question as indicated above. The represents the patient clusters to be identified. The features
wrong question is: what is the syndrome type of a patient are from various latent aspects of the data. If there is only
if he has the two symptoms? This question cannot be one symptom variable from an aspect, it is connected to Z
answered because it is not possible to determine the directly. If there are multiple symptom variables from an
syndrome type of a patient only based on those two aspect, they are connected to Z via an intermediate latent
symptoms. variable.
Statistical pattern interpretation requires domain The analysis divides the patients into clusters by jointly
knowledge and thus expert judgment. It is possible that considering information from various aspects of data.
different experts interpret the same pattern in different Hence it is called joint clustering, and Z is called the joint
ways. If that happens, the issue can be resolved through clustering variable. The computational task is to determine
discussions among TCM researchers, for example, the number of possible values for Z and the probability
by following the Delphi method.[26] Statistical pattern parameters. This is done using Lantern through a
interpretation is where TCM researchers need to invest the procedure similar to standard LCA.
most effort when using the LTA method. As pointed out in Section 3, the use of the joint
clustering model in Figure 4, instead of the LCM with
6 Target syndrome type identification all the symptom variables as features, relaxes the local
independence assumption. The BIC score of the joint
The statistical pattern interpretation process gives rise to clustering model is –6 326. In contrast, the BIC score of
a list of syndrome types. Each of them is a potential target the best LCM is –6 390. So, the joint clustering model fits
for syndrome quantification. The next step is to determine the data better than the best LCM.
what symptoms in the data set should be used to quantify The result of the joint clustering is a partition with three
presence and absence of the symptoms. When the total becomes a classification rule and it is an approximation to
score exceeds a threshold, the patient is classified into the the rule given by inequality (1).
class Z = s. Table 6 shows the classification rule for phlegm-
We start by rewriting inequality (1) into the following dampness derived from the joint clustering model in
equivalent form using Bayes rule: Figure 4. We see that the score for greasy tongue fur is
P(Z = s)P(X1,…, Xn|Z = s) > P(Z = ~s)P(X1,…, Xn|Z = ~s) 7.1, the score for slippery pulse is 2.1, and so on. The
To obtain a score-based classification rule, we assume threshold is 4.2.
that the symptom variables are mutually independent
given Z = s or Z = ~s. Strictly speaking, the assumption is
Table 6 Classification rule for phlegm-dampness (Z = s12)
not true. In Figure 4, for instance, the two variables greasy
tongue fur and thick tongue fur are not independent of Symptom Score Threshold Accuracy
each other given Z. Hence approximations are introduced
Greasy tongue fur 7.10
here and their impacts will be assessed later. The
assumption allows us to further rewrite the inequality as Slippery pulse 2.10
follows:
P(Z = s)P(X1|Z = s)… P(Xn|Z = s) > P(Z = ~s) Sticky or greasy feel in mouth 2.80
P(X1|Z = ~s)… P( Xn|Z = ~s) Thick tongue fur 1.50
By taking the logarithm of both sides and re-arranging
the terms, we get: Dizzy headache 1.80
Tooth-marked tongue 1.00
Fat tongue 1.00
(2) Urinary incontinence 0.60 3.7 0.967
By subtracting a constant from both sides of inequality Dizziness 0.40 3.9 0.965
(2), we get:
Thirst desire no drinks 0.60 4.0 0.966
Head feels as if swathed 0.40 4.1 0.966
Expectoration 0.30 4.2 0.969
Two technical remarks are in order. First, the choice Nausea or vomiting 0.40 4.2 0.969
of base for the logarithm does not matter. We use
Distending headache 0.20 4.2 0.969
base 2. Second, the probability values might be 0
sometimes. We deal with this issue by smoothing. The Insomnia 0.02 4.2 0.969
formula for calculating the conditional distribution
Dreamfulness 0.02 4.2 0.969
. Smoothing means we
The shaded part can be deleted if one wants to simplify the rule.
data.[2,6,28] In such a study, the patients of a WM disease are quantify the syndrome types and establish classification
surveyed and information about symptom occurrence is rules. This new approach can also be called Unsupervised
collected. The patients are examined by TCM physicians Quantification. Although domain knowledge is required
and their syndrome types are determined. In other words, when interpreting the statistical patterns discovered, the
the data have syndrome class labels and hence the data approach places TCM syndrome classification on the
are labeled. Statistics are then calculated based on the basis of stronger evidence than Panel Standardization and
labels to determine syndrome prevalence and symptom Supervised Quantification. The evidence might not be as
occurrence probabilities for each syndrome type. strong as what one would get from Syndrome Essence
Classification rules are established using statistical and Study, but the latter has not been successful so far. So, the
machine learning techniques such as regression, neural LTA method represents the best of what can be done for
networks and support vector machines.[7] The conclusions the time being.
One limitation of the LTA method is that it can currently
are summaries of the behaviors of the TCM physicians
deal with only dichotomous data, and hence is not able
who participate in the studies. Hence we call such work
to take the severity of symptoms into consideration.
Supervised Quantification. This is mainly for the sake of simplicity. While it is
Syndrome Essence Study would provide the strongest possible and would be interesting to extend the method to
evidence for TCM syndrome classification, but it has not polytomous data, doing so would substantially complicate
been successful. Panel Standardization and Supervised the methodology, software and the results. The literature
Quantification are based on weak evidence, and the results on LCA has mostly been focused on dichotomous data
depend heavily on the experts who participate in the for the same reason.[19–22] Since the results are meant to be
research work. guidelines for physicians to use as references, rather than
In the literature, cluster analysis is used more rigid standards for physicians to follow meticulously, it is
commonly to group symptom variables than to group desirable to keep them as simple as possible.
patients. [29] The symptom variable clusters obtained
are interpreted as syndrome types. This is problematic 10 Conclusions
because symptom variables being strongly correlated
with each other (and hence grouped together) does not How to properly classify a population of patients
necessarily imply that the symptoms tend to co-occur. into TCM syndrome types is a problem of fundamental
Mutual exclusion also implies strong correlation. In importance to TCM research and clinic practice. The
addition, each symptom is placed in only one cluster problem has not been satisfactorily solved before. A novel
and hence is related to only one syndrome type. In TCM data-driven method for solving the problem is presented
theory, on the other hand, a symptom can be caused by in this paper. It is called the LTA method and consists of
multiple syndromes. six steps:
Factor analysis is another unsupervised learning (1) Data collection: conduct a cross-sectional survey of
method that has been used to quantify syndrome types the population and collect data about the symptoms and
in terms of symptoms.[30] Factor analysis assumes that signs of interest to TCM.
observed variables (symptoms) are linear combinations (2) Statistical pattern discovery: analyze data using
of latent variables (syndrome types). The coefficients LTMs to reveal probabilistic symptom co-occurrence/
are called factor loadings. Researchers typically report mutual-exclusion patterns hidden in the data.
factor loadings for latent variables and interpret the latent (3) Statistical pattern interpretation: interpret the
patterns to determine their TCM connotations.
variables based on observed variables with high factor
(4) Syndrome identification: based on the statistical
loadings. Factor analysis uses continuous latent variables,
patterns and their interpretations, identify a list of
and hence it does not give patient clusters as the LTA syndrome types that are well supported by the data.
method does. In addition, the factors are usually assumed (5) Syndrome quantification: partition the population
to be mutually independent, while TCM syndrome types into patient clusters by performing joint clustering, match
are correlated. the patient clusters with syndrome types, and quantify the
9.2 Strengths and limitations syndrome types using population statistics of the patient
In this paper, we have presented a novel approach to clusters.
TCM syndrome classification, namely the LTA method. (6) Syndrome classification: derive a classification rule
The method starts with data on symptom occurrence. for each syndrome type from the results of syndrome
There are no judgments about syndrome types and hence quantification.
the data are unlabeled. The data are analyzed using Among the six steps, steps 2, 5 and 6 are carried out
LTMs to identify statistical symptom co-occurrence using computers, while steps 1, 3 and 4 are done by TCM
patterns, and the patterns are used to identify patient researchers.
clusters that correspond to syndrome types. The statistical Preliminary ideas behind the LTA method have appeared
characteristics of the patient clusters are then used to in the literature in the past decade. [31–38] Significant
advances are presented in this paper regarding the general 2014; 2014: 418206.
framework, statistical pattern interpretation, syndrome 7 Liu GP, Li GZ, Wang YL, Wang YQ. Modelling of inquiry
identification, syndrome quantification and syndrome diagnosis for coronary heart disease in traditional Chinese
classification. A software package, called Lantern, has medicine by using multi-label learning. BMC Complement
been developed to facilitate the use of the method. A Altern Med. 2010; 10: 37.
complete case study is reported in an accompanying 8 Fu C, Zhang NL, Chen BX, Chen ZR, Jin XL, Guo RJ,
Chen ZG, Zhang YL. Identification and classification of
paper.[8] Researchers should be able to use the method in
TCM syndrome types among patients with vascular mild
their own work after reading the two papers. cognitive impairment using latent tree analysis. [2016-08-
26]. http://arxiv.org/abs/1601.06923.
11 Funding 9 Xu ZX, Zhang NL, Wang YQ, Liu GP, Xu J, Liu TF,
and Liu AH. Statistical validation of traditional Chinese
Research on this article was supported by Hong Kong medicine syndrome postulates in the context of patients
Research Grants Council under grants No. 16202515 and with cardiovascular disease. J Altern Complement Med.
16212516, Guangzhou HKUST Fok Ying Tung Research 2013; 19(10): 799–804.
Institute, China Ministry of Science and Technology 10 Pearl J. Probabilistic reasoning in intelligent systems:
TCM Special Research Projects Program under grants networks of plausible inference. San Mateo, California:
No. 200807011, No. 201007002 and No. 201407001-8, Morgan Kaufmann Publishers. 1988.
11 Zhang NL. Hierarchical latent class models for cluster
Beijing Science and Technology Program under grant
analysis. J Mach Learn Res. 2004; 5: 697–723.
No. Z111107056811040, Beijing New Medical Discipline
12 Aldrich J. R.A. Fisher and the making of maximum
Development Program under grant No. XK100270569, likelihood 1912–1922. Statistical Sci. 1997; 12(3): 162–
and Beijing University of Chinese Medicine under grant 176.
No. 2011-CXTD-23. 13 Schwarz G. Estimating the dimension of model. Ann Statist.
1978; 6(2): 461–464.
12 Acknowledgements 14 Bartholomew DJ, Knott M. Latent variable models and
factor analysis. 2nd ed. London: Arnold. 1999.
We thank Xuezhong Zhou, Zhaoxia Xu, Hua Zhang, 15 Mourad R, Sinoquet C, Zhang NL, Liu TF, Leray P. A
Ping Liu and Vivian Tam for useful discussions and thank survey on latent tree models and applications. J Artif Intell
Jerry Wing Fai Yeung for valuable comments on earlier Res. 2013; 47: 157–203.
16 Liu TF, Zhang NL, Chen PX, Liu AH, Poon LKM, Wang Y.
versions of this paper.
Greedy learning of latent tree models for multidimensional
clustering. Mach Learn. 2013; 98(1): 301–330.
13 Conflict of interests 17 Chen T, Zhang NL, Liu TF, Wang Y, Poon LKM. Model-
based multidimensional clustering of categorical data. Artif
The authors declare that they have no competing Intell. 2011; 176(1): 2246–2269.
interests, financial or non-financial. 18 Zhang NL, Poon LKM, Liu TF. Lantern software. [2016-
01-18]. http://www.cse.ust.hk/faculty/lzhang/tcm/.
19 van Smeden M, Naaktgeboren CA, Reitsma JB, Moons KG,
REFERENCES de Groot JA. Latent class models in diagnostic studies when
there is no reference standard—a systematic review. Am J
1 Normile, D. The new face of traditional Chinese medicine. Epidemiol. 2014; 179(4): 423–431.
Science. 2003; 299(5604): 188–190. 20 van Loo HM, de Jonge P, Romeijn JW, Kessler RC,
2 Zhu WF. Differentiation of syndrome pattern elements. Schoevers RA. Data-driven subtypes of major depressive
Beijing: People’s Medical Publishing House. 2008: 6–12. disorder: a systematic review. BMC Med. 2012; 10: 156.
Chinese. 21 Wasmus A, Kindel P, Mattussek S, Raspe HH. Activity
3 Zhao GQ. Perplexity and outlet of TCM syndrome essence’s and severity of rheumatoid arthritis in Hannover/FRG and
study. Yi Xue Yu Zhe Xue. 1999; 20(11): 47–50. Chinese. in one regional referral center. Scand J Rheumatol Suppl.
4 China State Administration of Traditional Chinese 1989; 79: 33–44.
Medicine. Diagnostic and therapeutic effect evaluation 22 Sullivan PF, Smith W, Buchwald D. Latent class analysis
criteria of diseases and syndromes in traditional Chinese of symptoms associated with chronic fatigue syndrome and
medicine. Nanjing: Nanjing University Press. 1994. fibromyalgia. Psychol Med. 2002; 32(5): 881–888.
Chinese. 23 Yang X, Chongsuvivatwong V, Lerkiatbundit S, Ye J,
5 China State Bureau of Technical Supervision. National Ouyang X, Yang E, Sriplung H. Identifying the Zheng in
standards on clinic terminology of traditional Chinese psoriatic patients based on latent class analysis of traditional
medical diagnosis and treatment–—Syndromes, GB/T Chinese medicine symptoms and signs. Chin Med. 2014;
16751.2—1997. Beijing: China Standards Press. 1997. 9(1): 1.
Chinese. 24 Vermunt JK, Magidson J. Latent class cluster analysis. In:
6 Wang J, Xiong X, Liu W. Traditional Chinese medicine Hagenaars JA, McCutcheon AL. Advances in latent class
syndromes for essential hypertension: a literature analysis analysis. Cambridge: Cambridge University Press. 2002:
of 13,272 patients. Evid Based Complement Alternat Med. 89–106.
25 Yang WY, Meng FY, Jiang YN, Hu H, Guo J, Fu YL. Med. 2008; 14(5): 583–587.
Diagnostics of traditional Chinese medicine. Beijing: 33 Zhao Y, Zhang NL, Wang T, Wang Q. Discovering symptom
Xueyuan Press. 1998. Chinese. co-occurrence patterns from 604 cases of depressive patient
26 Linstone HA, Turoff M. The Delphi method: techniques and data using latent tree models. J Altern Complement Med.
applications. Vol. 29. Boston: Addison-Wesley Publishing 2014; 20(4): 265–271.
Company, Inc. 1975. 34 Zhang NL, Yuan SH. Implicit structure model and research
27 Zhang NL, Yuan SH, Wang TF, Zhao Y, Wang Y, Liu on syndrome differentiation of Chinese medicine (I): Basic
TF, Wang QG. Latent structure analysis and syndrome thought and analytic tool of implicit structure. Beijing
differentiation for the integration of traditional Chinese Zhong Yi Yao Da Xue Xue Bao. 2006; 29(6): 365–369.
medicine and Western medicine (I): basic principle. Shi Jie Chinese.
Ke Xue Ji Shu Zhong Yi Yao Xian Dai Hua. 2011; 13(3): 35 Zhang NL, Yuan SH, Chen T, Wang Y. Implicit structure
498–502. Chinese with abstract in English. model and research on syndrome differentiation of Chinese
28 Zhang GZ, Wang P, Wang JS, Jiang CY, Deng BW, Li P, medicine (II): data analysis of kidney deficiency. Beijing
Zhao YM, Liu WL, Zhai X, Chen WW, Zeng L, Zhou DM, Zhong Yi Yao Da Xue Xue Bao. 2008; 31(9): 548–587.
Sun LY, Li YY. Study on the distribution and development Chinese.
rules of TCM syndromes of 2 651 psoriasis vulgaris cases. 36 Zhang L, Yuan SH, Chen T, Wang Y. Implicit structure
Zhong Yi Za Zhi. 2008; 49(10): 894–896. Chinese with model and research on syndrome differentiation of Chinese
abstract in English. medicine (III): syndrome differentiation from model and
29 Zhang NL, Zhou XZ, Chen T, He LY, Liu BY. The syndrome differentiation by experts. Beijing Zhong Yi Yao
interpretation variable clustering results in the context of Da Xue Xue Bao. 2008; 31(10): 659–663. Chinese.
TCM syndrome research. Zhongguo Zhong Yi Yao Xin Xi Za 37 Wang TF, Zhang NL, Zhao Y, Wang Y, Wu XY, Yuan
Zhi. 2007; 14(7): 102–103. Chinese. SH, Wang ZY, Du CF, Yu CG, Chen T, Poon KM, Wang
30 Xue J, Wang Y, Han R. Factor analysis on the distribution QG. Latent structure models and their applications in
of Chinese medicine syndromes in patients with TCM syndrome research. Beijing Zhong Yi Yao Da Xue
hyperlipidemia in Xinjiang region. Zhongguo Zhong Xi Xue Bao. 2009; 32(8): 519–526. Chinese with abstract in
Yi Jie He Za Zhi. 2010; 30(11): 1169–1172. Chinese with English.
abstract in English. 38 Zhang NL, Xu ZX, Wang YQ, Liu TF, Liu GP, Li FF,
31 Zhang NL, Yuan SH, Chen T, Wang Y. Latent tree models Yan HX, Guo R. Latent structure analysis and syndrome
and diagnosis in traditional Chinese medicine. Artif Intell differentiation for the integration of traditional Chinese
Med. 2008; 42: 229–245. medicine and Western medicine (II): joint clustering. Shi
32 Zhang NL, Yuan S, Chen T, Wang Y. Statistical validation of Jie Ke Xue Ji Shu Zhong Yi Yao Xian Dai Hua. 2012, 14(2):
traditional Chinese medicine theories. J Altern Complement 1422–1427. Chinese.
Submission Guide
Journal of Integrative Medicine (JIM) is an international, types include reviews, systematic reviews and meta-analyses,
peer-reviewed, PubMed-indexed journal, publishing papers randomized controlled and pragmatic trials, translational and
on all aspects of integrative medicine, such as acupuncture patient-centered effectiveness outcome studies, case series and
and traditional Chinese medicine, Ayurvedic medicine, herbal reports, clinical trial protocols, preclinical and basic science
medicine, homeopathy, nutrition, chiropractic, mind-body studies, papers on methodology and CAM history or education,
medicine, Taichi, Qigong, meditation, and any other modalities editorials, global views, commentaries, short communications,
of complementary and alternative medicine (CAM). Article book reviews, conference proceedings, and letters to the editor.
● No submission and page charges ● Quick decision and online first publication
For information on manuscript preparation and submission, please visit JIM website. Send your postal address by e-mail to
jcim@163.com, we will send you a complimentary print issue upon receipt.