Inferring Functional Groups from Microbial Gene Catalogue with Probabilistic Topic Models

Xin Chen1, TingTing He2, Xiaohua Hu1, Yuan An1, Xindong Wu3

of Information Science and Technology, Drexel University, Philadelphia, PA 19104, USA 2Dept. of Computer Science at Central China Normal University, Wuhan, China 3Department of Computer Science, University of Vermont, Burlington, VT, USA


Backgrounds: Genomics
• Genomics refers to the analysis of genomes A genome can be genomes. thought of as the complete set of DNA sequences that codes for the hereditary material that is passed on from generation to generation. These DNA sequences include all of the genes (the functional and physical unit of heredity passed from parent to offspring) and transcripts (the RNA copies that are the initial step in decoding the genetic information) included within the genome. Thus, genomics refers to the sequencing and analysis of all of th Th i f t th i d l i f ll f these genomic entities, including genes and transcripts, in an organism.


Backgrounds: GenBank and NCBI
• In recent years we see growth of GenBank and NCBI with the advancement of gene sequencing technology.


Backgrounds: annotating algorithms
• As the growth of GenBank and NCBI, a lot of annotating algorithms are developed to match genomic sequences to GenBank /NCBI standard reference and attach meta-information to the sequences.


5 .Backgrounds: meta-information g meta• The annotated meta-information involves hierarchical data such as NCBI Taxonomy and Gene Ontology.

available • The goal of metagenomics is to study the genome-wide gene-expression data from uncultured environment samples (like the ocean soil and ocean.Challenges: Metagenomics • With the fast advancing sequencing techniques. human body) and understand the underlying biological processes. 6 . large amounts of sequenced genomes and meta-genomes from uncultured microbial samples (microbe) have become available.

in order to characterize a set of common genomic features shared by the same species. functional analysis). metabolic capacity and gene regulatory) on the genome-level (a. and this task is also known as taxonomic classification or taxonomic analysis). taxonomic units (usually a homology-based sequence alignment. tell their functional roles. The answers to this question involve annotating the major functional units q g j (such as signal transduction.k.a. 7 . 2) What are the major functions of these genomes? • Our research objective: • We aim to develop a new method that is able to analyze the genome-level composition of DNA sequences.Research Questions What’s the major research questions of our study? • We use our data mining framework to investigate following questions: 1) Given a large number of genome fragments from an microbial samples. what genomes are there? • Answering this question requires mapping the meta-genomic reads to ( y gy q g .

Related topics in this presentation: • Structural annotation and protein encoding regions • • Homology-based functional analysis Topic Models T i M d l … 8 .

Promoters and UTR’s in the DNA sequences 9 . miRNA). tRNA.Structural annotation and protein encoding regions p g g • Structural annotation – Annotating the regions of known open reading frames (ORF’s). non-coding genes (rRNA.

The GenBank accession number of each reference sequence is available on each NCBI online query. 10 .Structure annotation and protein encoding regions (continue) • NCBI standard reference sequences have detailed structural d d f h d il d l annotations of both non-protein encoding regions (such as tRNA) and protein encoding regions (CDS) as well as the corresponding gene names (if applicable).

Related topics in this presentation: • • Structural annotation and protein encoding regions Topic Models T i M d l … • Homology-based functional analysis 11 .

molecular function) of gene product.Functional analysis . 12 . identifying the biology process to which the gene or gene product contribute (including information about enzyme.overview y • Functional analysis – Uncover the major gene functions related to the genomic sequences – Requires explaining the biochemical activity ( q p g y (a.a.k. pathway and metabolic capabilities related to the gene).

After that. 2009) • Homology-based approach h b H l b d h has been recently i l introduced to achieve d d hi functional annotation for metagenomic reads (Richter and Huson. The BLASTX hits will associate fragments with related protein ID and gene names.Homology-based functional analysis(Richter and Huson. with the help of the Gene Ontology (GO) database to refer associated gene names to corresponding GO terms. 13 • • . The framework begins with a homology based BLASTX algorithm to match the metagenomic fragments against the reference sequences in NCBI database. 2009). thus provides an overview of gene function and products for metagenomic fragments.

Homology-based functional analysis(Richter and Huson. 009) 14 . 2009) te s obta ed o de t e app g (Richter and uso . 2009) GO terms obtained from database identifier mapping ( c te a d Huson.

as th f th i ) there i no priority f th annotated GO is i it for the t t d terms. Homology-based approaches very much reply on the result of l 1 H l b d h h l h l f local l sequence alignment (such as BLAST and BLASTX) to the known open reading frames (ORF). 15 . which makes the ll lacks f tie-breaker t f th d th hit hi h k th functional annotation some how ambiguous (with hundreds of probable explanation) 2. it usually l k of a proper ti b k to further reduce the hits. In the latter case. – The BLAST-like local alignment may either return hundreds of hits. the current methods are unable to provide any functional annotation. depending on the threshold of E-value used. 2 The homology-based functional annotation methods did not provide homology based any insight about the “major” functional capabilities of genomes (like which gene functions are more commonly shared by strains from the same species). or return no hits.Limitations with Homology-based Functional Analysis Methods 1. In the former case.

Related topics in this presentation: • • Structural annotation and protein encoding regions Homology-based functional analysis … • T i M d l Topic Models 16 .

projected. cortex. In this system each cell has its specific function and is responsible for p p a specific detail in the pattern of the retinal image. the visual experiences are the dominant ones. By p Hubel. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. Wiesel y following the visual impulses along their path to the various cell layers of the optical cortex.Topic Modeling . Hubel and Wiesel have been able to demonstrate that the message about the g image falling on the retina undergoes a stepwise analysis in a system of nerve cells stored in columns. upon which the image in the eye was p j g retinal. 17 .Intuitive p g • Intuitive – Assume the data we see is generated by some parameterized random process. g . p – Learn the parameters p that best explain the data. to visual centers in the brain. so to speak. by point image was transmitted pointbrain. cerebral Through the y . Of all the sensory impressions proceeding to the brain. – Use the model to predict (infer) new data. perception. movie screen. image perception in the brain there is a considerably more complicated course of events. based on data seen so far far. and Wiesel we now know that behind the origin of the visual nerve. the cerebral cortex was a visual. optical eye. discoveries of Hubelcell. For a long time it was thought that the retinal sensory.

t t l f d t d t db {w1.wN). denoted by C = { 1 2 . (w1. – Item from a vocabulary indexed by {1. . D} • Topic p – Denoted by z. . – Each topic has its unique word distribution p(w|z) 18 . wN) • Collection – A total of D documents.w2. . . . the total number is K. words.V}. • Document – Sequence of N words denoted by w = (w1 w2 . .w2.wD}. . . .Notations • Word – Basic unit. .

Background & Existing Techniques of Generative Latent Topic Models • The Naïve Bayesian model Likelihood of word w given topic z z * = arg max p ( z | w ) ∝ p ( z ) p ( w | z ) Word-Topic decision Prior Probability of Topic z • The probabilistic latent semantic indexing (PLSI) model Assumption: Each document has a mixture of k topics. PLSI Model (Hoffman. 2001) . Fitting the model involves: Estimating the topic specific word distributions p(wi|zk) and document specific p( | p topic distributions p(zk|dj) from the corpse 19 via maximum likelihood estimation (MLE).

w -i . For new g . wi d β + n−i . coming document. the model needed to be re-estimated. 2003) ) θd~Dir(α) • In PLSI model. j α + n−i .. z -wi ) ∝ ⋅ . j T α + n− i . 20 . the topic mixture probability p(zk|dj) for documents are fixed once the model is estimated. p ( z j | d ) ~ Multi (θ d ) j j • The LDA model treats the probability of latent topics for p ( wi | z ) ~ Multi (φ ) each document p(z|d) and the conditional probability of words for each latent topic p(w|z) as latent random variables which φ j ~ Dir ( β ) are subject to change when new j g document comes.Latent Dirichlet Allocation (LDA) Model ( ( ) (Blei. d W β + n− i . Thus it is not scalable. j p ( zwi = j | wi .

j p ( wi | zwi = j . z -wi ) ∝ p (w -i . z -wi ) = ∫ p ( z = j | θ d ) ⋅ p(θ d | w -i . z -wi ) = ∫ p ( wi | z = j . z -wi )dθ d = d T α + n− i . ϕ j . j ) 21 .LDA Model Estimation . It follows that wi p(ϕ j | w -i . 2004) Probability of a topic being assigned to a word given other observations: p( zwi = j | wi . ϕ j . j ) and p(ϕ j ) ~ Dir (β ). w -i . z -wi ) ~ Dir (α + n− i . z -wi ) ⋅ p( z = j | w -i . j p( z = j | w -i . p ( wi | z = j .. z -wi | ϕ ) ~ Multi (ϕ ) p(θ d | w -i . w -i . z -wi ) ~ Dir ( β + n−i . w -i . z -wi | θ ) ~ Multi (θ ) and p(θ d ) ~ Dir (α ) d We have p (θ d | w -i . z -wi ) wi β + n− i . z -wi ) p (ϕ j | w -i . w -i . z -wi | ϕ j ) ⋅ p (ϕ j ) j j in which p (w -i . w -i . z -wi )dϕ j = . j d α + n−i . W β + n− i . z -wi ) = ϕ j p(ϕ j | w -i . z -wi ) ∝ p(w -i . z -wi ) ∝ p( wi | zwi = j. z -wi | θ d ) ⋅ p (θ d ) d d Since p (w -i .Gibbs Sampling Monte Carlo process (Griffiths.

1.Mote-Carlo Mote Carlo process • Given the word-topic posterior probability. w -i . 22 . which is similar to throwing dice (given the probability of each facet to appear) to determine the assignment of topics to each words for the next round... p Given probability for each word: p ( zwi = j | wi . the Monte Carlo process becomes really straightforward. New topic assignment for each word. z -wi ) j = 1 K ).

Statistical relationships of words and topics 23 .

An example of topic assignment to words 24 .

Experiments 25 .

we show that the configuration of functional groups in meta-genome samples can be inferred by probabilistic topic modeling modeling. When used to study microbial samples the functional elements (including taxonomic levels and indicators of gene orthologous groups and levels. Which may be further used to study the g y y genotype-phenotype yp p yp connection of human disease. • • 26 .Experiment: Inferring Functional Groups from Microbial Gene Catalogue with Topic Models • In our experiment based on the functional elements derived from experiment. Estimating the probabilistic topic model can uncover the configuration of functional groups (the latent topic) in each sample. KEGG pathway mappings) bear an analogy with ‘words’. p p g y The probabilistic topic modeling is a Bayesian method that is able to extract useful topical information from unlabeled data. non-redundant CDs catalogue.

cn/ The human gut microbial samples from [Qin. the IBD patients are from two different groups. 15 UC samples and 12 CD samples. Specifically. In total. we conduct a probabilistic topic modeling experiment to identify functional groups from human gut microbial community data is generated by [Qin. and the other group with ulcerative colitis (UC).org. one group with Crohn’s disease (CD).Experimental Data Collection p • In our experiment. et al. 27 .genomics. 2010]. et al. there are 85 healthy samples. 2010] belong to both healthy subjects ( ) and p y j (HS) patients with inflammatory bowel disease (IBD). which is openly accessible via http://gutmeta.

gut microbial samples are firstly assembled into longer contigs.299.a.map00640 species .Experimental Data Collection (continue) • According to [Q . minimal gut genome). After that.k.822 non-redundant CDs sequences with an average length of 704 bp. 2010]. The predicted CDs sequences were then aligned to each other and form a non-redundant CDs catalog (a. q g g p CDs_id: Name: Length: COG/KO: Pathway maping: Taxonomic level: MH0001 GL0006996_MH0001_[ [Lack_3'-end] [mRNA] locus=scaffold96_9:1:1206:]_[ ]_ 1206 COG4799 K01966 map00280. ]. et al. The non-redundant CDs catalog consists of 3. the Glimmer program was used to predict protein-encoding sequences (CDs) from assembled contigs contigs. the Illumina GA reads from human g [Qin.Eubacterium eligens 28 • .

The taxonomic abundance data for each sample can be computed by counting the indicators of NCBI taxonomical levels. l l The assignments of gene orthologous indicator and KEGG pathway indicator are achieved by BLASTP alignment of the amino-acid y g sequence from predicted CDs to the eggNOG database and KEGG database. the NCBI taxonomic level indicators. indicator of gene orthologous groups and KEGG pathway indicators.e.Experimental Data Collection (continue) • In our experiment. 29 • • . The taxonomical level of each non-redundant CDs sequence is determined by the lowest common ancestor (LCA) – based algorithm. its NCBI taxonomical level is non redundant sequence obtained by carrying out BLASTP alignment against the NCBI NR database. Given a non-redundant CDs sequence. three types of functional elements are derived from the non-redundant CDs catalog. i.

I total. there are a total of 1. with a vocabulary size of 4667.764 gene orthologous group indicators. with a vocabulary size of 748. and there are 953. there are 647 136 NCBI taxonomic level b l In t t l th 647.136 t i l l indicators. .293. with a 30 vocabulary size of 237.493 KEGG pathway indicators. ATPase and permease components" COG0438 : Glycosyltransferase KEGG Pathway Indicators map00230 : Metabolism_Nucleotide Metabolism_Purine metabolism map00240 : Metabolism Nucleotide Metabolism Pyrimidine metabolism Metabolism_Nucleotide Metabolism_Pyrimidine map00350 : Metabolism_Amino Acid Metabolism_Tyrosine metabolism • The union of unique functional elements jointly defines a fixed word vocabulary.Experimental Data Collection ( p (continue) ) NCBI Taxonomic Levels Genus Genus Phylum Class Genus Clostridium Bacteroides Firmicutes Clostridia Bacillus Orthologous Group Indicators COG0463 : Glycosyltransferases involved in cell wall biogenesis COG0642 : Signal transduction histidine kinase COG1132 : "ABC-type multidrug transport system.

we are interested in identifying the frequent co-occurrence patterns of co occurrence functional elements (a. and derived functional elements.Groups of functional elements in microbial community it Given non-redundant CDs catalog. functional groups).k.a. 31 .

which leads to the introduction of the background topic z0 in topic modeling modeling.Generative process of p p p proposed model • Commonly shared functional elements across samples may suggest functional similarity and biological relevance among samples. To cover such information. 32 . a genome-wide background distribution of functional elements need to be estimated.

00389 0 003 0.00594 0. AraC-type DNA-binding domaincontaining proteins Beta-galactosidase/beta-glucuronidase e a ga ac os dase/be a g ucu o dase 0.Indicator of Gene OGs Gene OGs Indicator Descriptions Probability COG0463 COG0642 COG0582 COG1132 COG0438 COG0745 COG1396 COG0577 COG2207 COG3 50 COG3250 Glycosyltransf erases involved in cell wall biogenesis Signal transduction histidine kinase Integrase ABC-type multidrug transport system.00644 0.00698 0.00689 0.Illustration of the background topic of gene OGs indicators Background Topic .00344 33 .00708 0 00708 0.00813 0.00595 0. ATPase and permease components" components Glycosyltransf erase Response regulators consisting of a CheY-like receiver domain and a winged-helix DNA-binding domain Predicted transcriptional regulators ABC-type antimicrobial peptide transport system permease component system.00664 0.

0176 metabolism Metabolism_Amino Acid Metabolism_Glutamate 0. "Gl i 0.0168 Metabolism_Peptidoglycan biosynthesis 34 .0264 and mannose metabolism d t b li Metabolism_Carbohydrate Metabolism_Starch and 0.Illustration of the background topic of KEGG Pathway Indicators Background Topic .0169 metabolism Metabolism_Glycan Biosynthesis and 0.0220 serine and threonine metabolism" Metabolism_Carbohydrate Metabolism_Glycolysis / 0.0190 g Gluconeogenesis Metabolism_Carbohydrate Metabolism_Pyruvate 0.0333 metabolism Metabolism_Carbohydrate Metabolism_Fructose 0.0260 sucrose metabolism Metabolism_Nucleotide Metabolism_Pyrimidine Metabolism Nucleotide Metabolism Pyrimidine 0.0221 metabolism Metabolism_Amino A id M t b li M t b li A i Acid Metabolism_"Glycine.0222 0 0222 metabolism Metabolism_Amino Acid Metabolism_Tyrosine 0.KEGG Pathway Indicator Pathway Map ID Descriptions Probability map00230 map00051 map00500 map00240 00240 map00350 map00260 map00010 map00620 p map00251 map00550 Metabolism_Nucleotide Metabolism_Purine 0.

01256 Topic ID Topic 121 Topic 153 Topic 77 Topic 165 Topic 99 MI Score 0.00550 Topic ID Topic 31 Topic 95 Topic 52 Topic 67 Topic 193 MI Score 0.02018 0.00260 0.Uncovered latent topics with respect to NCBI taxonomic indicators Illustration of the most relevant latent topics with p respect to different taxa Topic ID family_Enter f il E t obacteriaceae genus_Clostri dium genus_Bacter B t oides phylum_Bact eroidetes phylum_Firm ph l m Firm icutes Topic 48 Topic 50 Topic 156 Topic 132 Topic 0 MI Score 0.00765 0. 153. 95 and genus Bacteroides is most relevant to Topic 156. 35 .02476 0.00257 0.00279 0. information score (MI score). The MI severs as a relevance measurement between taxa and latent topics.03030 0. 77.00476 0.01628 0. 52.01661 0. genus Clostridium is most relevant to 0) Similarly Topic 50. Similarly. It shows that phylum Firmicutes is most relevant to the background topic (Topic 0).01001 0.00915 0.00212 Discoveries: For each taxon latent topics are sorted with respect to the mutual taxon.

048 0.062 0.UC-1 d l V1.475 in MH0001 and 0.CD-1).363 in O2.101 0.037 0. 36 .Uncovered latent topics with respect to NCBI taxonomic indicators Illustration of top-ranked latent topics with respect to diff t different microbial samples t i bi l l MH0001 Topic 0 Topic 124 Topic 181 Topic 159 Topic 86 Topic 72 p Topic 19 p(topic|sampl e) 0.124 0.116 0.CD-1 Topic 0 Topic 61 Topic 12 Topic 115 Topic 52 Topic 32 p Topic 50 p(topic|sampl e) 0.363 0.UC-1) is much higher than that in CD samples (0.059 0. correspondingly. the proportion of bacteria belong to phylum Firmicutes is significantly reduced.475 0.103 0.056 0.116 0.UC-1 Topic 0 Topic 95 Topic 143 Topic 83 Topic 65 Topic 139 p Topic 59 p(topic|sampl e) 0.033 V1.027 0.040 0.036 … … … … … … … … Discoveries : the probability of Topic 0 in Healthy and UC samples (0. The prevalence of Topic 95 and 52 in samples O2 UC 1 and sample V1 CD 1 may i di l O2. This suggests that for CD samples.286 in V1.286 0.034 0.018 0.017 O2.050 0.CD-1 indicate the existence and possibly h i d ibl high abundance of genus Clostridium and genus Bacteroides.

Uncovered latent topics with respect to NCBI taxonomic indicators 37 .

2007]. [Manichanh C et al.Summary of Discoveries • Our discoveries from the results is evidenced by the recent discoveries in fecal microbiota study of inflammatory b di i i f l i bi t t d f i fl t bowel di l disease (IBD) patients [Gerber. particular in CD. 2011]. which is associated with bacterial invasion of the mucosa. [Walker A. In UC. the reduction of phylum Firmicutes in UC is not significant. 2006]. This can be explained by the fact mucosal microbial diversity is reduced in IBDs. al. 2006]. which is consistent with our results. [Harry S. It has been reported that there is a significant reduction in the proportion of bacteria belonging to phylum Firmicutes in CD samples. 38 • • . the inflammation is typically more superficial. et. therefore. al. et..

Conclusions • Based on the functional elements derived from the nonredundant CDs catalogue. which demonstrate the effectiveness of the proposed method. • The latent topics estimated from human gut microbial samples are evidenced by the recent discoveries in fecal microbiota study. 39 . we have shown that the configuration of functional groups encoded in the geneexpression data of meta-genome samples can be inferred by meta genome applying probabilistic topic modeling to functional elements derived from the non-redundant CDs catalogue.

• In future work. the number of functional group has to be specified in advance. 40 . we propose to use nonparametric hierarchical work Bayesian models (such as HDP model) to handle the uncertainty in the number of functional groups. or iteratively tuned by criteria such as log-likelihood and perplexity.Future work • In the proposed model. which provide the flexibility of modeling microbial sequences with unknown functional group numbers.

Questions? Q ti ? 41 .

Backup Slides 42 .

p(Rg . functional element indicators (i. The variable pair (Rg. Zt ) MI (Rg . 43 .Mutual Information After estimating the topic model and assigning a latent topic to each functional element the relevance between latent topics and element.e. NCBI taxonomic level indicators. Zt ) = p(Rg .Zt) indicates whether a latent topic has been assigned to a specific functional element. respectively. Zt )log p(Rg ) p(Zt ) in which Rg and Zt are binary indicator variables corresponding to the functional element and the latent topic. indicator of gene orthologous groups and KEGG pathway indicators) i di t ) can b obtained b calculating th mutual i f be bt i d by l l ti the t l information ti (MI) between functional element indicators and obtained latent topics based on the final latent topic assignments to functional elements.

Likelihood Comparison p (w | z ) = ∏ ⎡ ∫ p(w | zt . . W ⎥ (⋅) W ( Γ(n0⋅) + Wη ) ⎣ Γ( β ) ⎦ t =1 Γ(nt + W β ) Γ(η ) T T T 44 . ϕ zt ) p(ϕ zt | zt )dϕ zt ⎤ ⎢ϕ ⎥ ⎦ t =1 ⎣ zt ( Γ(nt( wi ) + β ) Γ(Wη ) ∏ w Γ(n0wi ) + η ) ∏ wi ⎡ Γ(W β ) ⎤ i =⎢ ⋅∏ .

.Likelihood Comparison (continue) ( ) p (w | z ) = ∏ ⎡ ∫ p(w | zt . ϕ zt ) p(ϕ zt | zt )dϕ zt ⎤ ⎢ϕ ⎥ ⎦ t =1 ⎣ zt ( Γ(nt( wi ) + β ) Γ(Wη ) ∏ w Γ(n0wi ) + η ) ∏ wi ⎡ Γ(W β ) ⎤ i =⎢ ⋅∏ . W ⎥ (⋅) W ( Γ(n0⋅) + Wη ) ⎣ Γ( β ) ⎦ t =1 Γ(nt + W β ) Γ(η ) T T T 45 .

On constructing the two subsets we ensure that functional elements subsets. In our experiment. ⎡ −∑ Dtest log( p (w j )) ⎤ j =1 ⎥ perplexity ( Dtest ) = exp ⎢ Dtest t ⎢ ∑ j =1 N j ⎥ ⎣ ⎦ 46 . smaller perplexity value indicates better model fitting. from the same sample are equally split to both subsets. In practice.Perplexity Comparison y The perplexity is calculated for held-out testing data. data using parameters inferred from the trained topic model Thus the model. we use a 50% subset of the functional elements as training data and the other 50% as testing data. it is the inverse predicted model likelihood of data in held-out testing data.

Perplexity Comparison ( y (continue) ) 47 .

Dirichlet Process (DP) as a Non-Parametric Mixture Models The Dirichlet Process (DP) is defined as a distribution of random probability measure G0 ~ DP(γ. …. in which γ is a concentration parameter and H is a base measure defined on a sample space Θ. ( )) Dirichlet Process can also be constructed by stick-breaking construction as follows: G0 = ∑ β k δ (θ k ) k =1 ∞ β k = α k ∏ (1 − α i ) α k ~ Beta (1 γ ) ). (G ( ).…. γ H(Ar)). . . ( (γ ( ).…. ( 0(A1). H).G0(Ar)) ~ Dirichlet(γ H(A1).in which . for any finite measurable partition of Θ: {A1. i =1 k −1 Dirichlet process by its definition: Dirichlet process constructed by stick-breaking p y g construction: . . (1.Data sample xi drawn from a base distribution with associated parameters Θk The weights of mixture components β = {βk} (k=1.Ar}. By its definition. 48 .…. ∞) are also refer to as β ~ GEM(γ).

∑ π jk ⎟ ~ Dirichlet (α 0 ∑ β k . π jk k ⎛ ⎛ ⎞⎞ = π ' jk ∏ (1 − π ' jl )... G0) for each document j. measure across the corpora and defines a set of child random probability measures Gj ~ DP(α0.. it shows that: ∑π k =1 ∞ jk δ (θ k ) in whch πj={πjk} (k=1. π ' jk ~ Beta ⎜ α 0 β k . which leads to different document-level distribution over semantic mixture components: (Gj(A1).….. α0 G0 (Ar)) Each Gj can also be constructed by stick-breaking construction as:G j = Substitute the stick-breaking construction of G0 and Gj. α 0 ⎜1 − ∑ β l ⎟ ⎟ ⎝ l =1 ⎠ ⎠ l =1 ⎝ k −1 It then follows that πj ~ DP(α0.Gj(Ar)) ~ Dirichlet(α0 G0 (A1). β) Stick-breaking construction of hierarchical Dirichlet process 49 ..Hierarchical Dirichlet Process (HDP) The Hierarchical Dirichlet Process (HDP) considers G0 ~ DP(γ H) as a global probability DP(γ... it follows that: ⎛ ⎞ π jk .….. α 0 ∑ β k ) ⎜∑ k∈K r k∈K1 k∈K r ⎝ k∈K1 ⎠ Based on the aggregation properties of Dirichlet distribution and its connection with Beta distribution. ∞) specifies the weights of mixture component indicator k.….