You are on page 1of 6

Semantic Recommendation by a Modular Recurrent

Neural Network
Joern David
Institute of Computer Science / I1
Technical University Munich
D-85748 Garching, Germany
email: david@in.tum.de

Keywords: applicable to every problem that can meaningfully be repre-


Recurrent neural networks, text mining, information sented by a set of vectors {~v1 , . . . , ~vn }, ~vi ∈ Rd .
retrieval, sequence prediction, dynamic systems, semantic We believe that a more powerful form of artificial intel-
recommendation. ligence can be achieved by combining the symbolic with
the connectionistic paradigm. Thus we want to construct
Abstract— The central problem that is addressed by this connectionistic models (e.g. neural networks) that behave
paper is to learn and exploit the intrinsic associations be- as symbol processors and which autonomously develop an
tween knowledge repositories such as document trees, which
semantically belong to one knowledge domain, but are not understanding of their environment. This implies that symbols
explicitly linked together. This class of information retrieval (IR) have to be anchored in the connectionistic representation: A
and inference problems is characterized by a lack of explicit symbol is nothing without its creator.
associations between the independent knowledge repositories. To On this basis an intelligent hybrid system of both paradigms
comprehensively understand the knowledge in a domain it is can evolve. This article describes such an approach on the
important to associate knowledge fragments, which have been
created independently. concrete domain of graph-based knowledge structures such as
Thus we focus on the hidden semantic relationships between cor- document trees, that are implicitly correlated to each other.
related nodes that occur in the involved repositories. We conduct The behaviour of experts is observed in order to obtain
a combined approach of machine learning, formal grammars association rules on the knowledge bases, that are used for
and data mining methods to provide semantic recommendation supervised training of the recurrent neural network.
functionality upon existing graph-based knowledge bases. The
recommendation is a prediction of a sequence of nodes based A main benefit of this connectionistic machine learning unit
on the recent user behaviour on the knowledge graph. Domain is its inherent generalization capability. Also not explicitly
experts are observed while they navigate through these graphs learned but meaningful association rules can be inferred from
and therefore implicitly associate nodes, which have a high the internal representation space of the novel recurrent neural
cohesion (since they altogether describe the steps of a certain network MRNN. Thus new knowledge in form of associations
business process, for example). These associations together with
some automatically discovered association rules are learned by a between the knowledge bases can be derived from the internal
Modular Recurrent Neural Network (MRNN) in a supervised train network state. This knowledge should be provided to the
process. This recurrent network model represents a dynamic system actor proactively as a recommendation of a sequence
system that has a novel modular composition and adapts at of target nodes in the respective knowledge base.
runtime to the structure of the variably long train patterns.
In turn new knowledge in form of associations between the II. R ELATED WORK
knowledge bases can be inferred from the internal network state,
such that a domain beginner can be recommended proactively to a Since our semantic recommendation approach is composed
sequence of relevant nodes – due to his automatically recognized by concepts of dynamic systems, formal grammars and text
navigation history. mining, a brief excerpt concerning relevant research findings
in these areas is outlined.
I. I NTRODUCTION
Research in artificial intelligence always dealt with sym- A. Context prediction
bolic knowledge representations on the one side and with [MRF04] states: “The future states q̄t+1 , q̄t+2 , . . . q̄t+m
subsymbolic or connectionistic representations on the other for m discrete time steps are predicted recursively by some
side. Symbolic or rule-based techniques like expert systems arbitrarily complex process q̄t+i = p hqt , q̄t+1 , . . . q̄t+i−1 i;”
benefit from their strict systematics and high interpretability. This is exactly what we propose in this article, while the
Symbols are arbitrary, but can be composed to meaningful “complex process” is realized by a modular recursive neu-
symbolic structures like formal grammars according to the ral network, that computes the future state q̄t+i based on
Chomsky-hierarchy or rules in an expert system. all previous states qt−k , . . . , qt , q̄t+1 , . . . q̄t+i−1 . According
The connectionistic paradigm has the powerful concept of to [MRF04], this prediction process enables proactivity in
artificial neural networks at its disposal, which are universally general. In a proactive system, the system output at time t
can also depend on the predicted future states q̄t+1 , . . . q̄t+i−1 . also wants to exploit the memorization and generalization
This implies that the system states qi are representing a context capabilities of artificial neural networks in order to associate
and the complex process p is an abstract state machine. “Thus significant terms (keywords) with user queries on the one hand
interpreting the context changes as a state trajectory allows to side and with their occurrences in documents on the other hand
forecast the future development of the trajectory, and therefore side. However the focus of this IR system is not on machine
to predict the anticipated context.” learning
This interpretation exactly describes the goal of our MRNN
prediction model with its internal state layer, that represents III. M ODELING NAVIGATION SEQUENCES BY A FORMAL
the mentioned state trajectory. GRAMMAR

B. Formal grammars incorporated by neural networks In order to associate sequences of nodes in knowledge bases
Formal languages that are generated by formal grammars we want to model navigation operations on these nodes by a
can be learned by artificial neural networks. Recurrent neural formal grammar.
networks (RNNs) have already been applied to internalize the Def. Components of a formal grammar
explicit rules of formal grammars, such that they were capable • Σ is the alphabet of terminal symbols (lowercase letters).
to classify or predict words ω ∈ L of the generated language L • V is the set of variables or non-terminal symbols, V ∩Σ =
(cp. [Smi03]). These words are sequences of terminal symbols ∅ (capital letters).
of the underlying alphabet Σ. So different types of RNNs • P is the set of production rules, which is a finite subset:
like Elman Simple Recurrent Networks (SRN, e.g. [JMS]) or P ⊂ (V ∪ Σ)+ × (V ∪ Σ)∗ .
Jordan Nets were utilized to classify the words in terms of ∗
is the Cleene star, that stands for an arbitrary number
the membership ω ∈ L or ω ∈ / L (the word problem) [Cal03]. 0..n of single symbol occurrences (+ stands for 1..n).
By contrast, the prediction task is to continue the symbol
Now a formal grammar is a 4-tuple G = (V, Σ, P, S), where
sequence ω̌ = a1 . . . ak , ai ∈ Σ for a given partial word
S ∈ V is the start symbol.
ω̌, ω = ω̌ ω̂, such that the predicted sequence of subsequent
It is proven that neural networks with one hidden layer and
symbols ω̂ = ak+1 . . . an is conform with the rules of the
non-linear activation function (e.g. logistic function, tanh) are
grammar.
at least as powerful and expressive than the general Turing
Many languages that are processed by neural networks orig-
machine.1
inate from the regular grammars (Chomsky type 3), that can
So context-sensitive rules like αBγ → αβγ, B ∈ V, α, β, γ ∈
be represented by a Finite State Machine (FSM).
(V ∪Σ)∗ can be recognized by neural networks with the appro-
C. Information Retrieval priate learning algorithm theoretically. The expressiveness of
[MS05] proposes a Neural Network Information Retrieval context-sensitive grammars is very high, thus it is in practice
(IR) system, that consists of three layers namely query, very difficult to conduct context sensitive language learning.
keyword and document layer. This system should enable Nevertheless also very complex grammars are expressible by
information retrieval from text documents in Slovak language. RNNs. For our purpose it is sufficient to train a recurrent
Here the specialty is the use of neural networks for the neural network in a way, that it behaves like a Finite State
“transition of the information” between each of these layers. Machines (FSM).
The goal is to associate actor queries with a set of keywords, A FSM can only generate or respectively accept the regu-
which is itself associated with the actual document set, while lar languages, according to the Chomsky-hierarchy [Sch01].
the associations between the layers are established by two These are generated by the grammars of type three in this
neural networks. Like proposed in the paper at hand, [MS05] hierarchy, whose regular production rules – which are exactly
equivalent to the regular expressions – are the least expressive
ones, but sufficient to describe all relevant actor traces in the
recommendation system.
But also with the restrictions of regular grammars we can
leverage the context of an actor trace in form of the previously
visited nodes ab1 b2 . . . bj c, which altogether represent the
navigation-history for the generic knowledge tree depicted
in figure 2. In that case, the target sequence is represented
by the symbol sequence d1 . . . dk ef g. As a consequence,
the path through the knowledge tree generates a word ω,
that is composed of a sequence of terminal symbols and
ω ∈ L(G) ∈ L3 is derived from the following regular grammar

1 The power of neural networks could even exceed the power of Turing
Fig. 1. Three layer information retrieval concept enabled by a machines, due to the massive parallelism of information processing (MPP)
common feed-forward neural network architecture. by the interconnected neurons.
rules can be extracted afterwards by methods similar to those
of association rule mining. Like this, a general understanding
of the linked knowledge bases can be derived in form of
symbolic association rules.

IV. S OLUTION APPROACH

A. The Modular Recurrent Neural Network (MRNN)


The main task of the proposed neural network model is
to recognize context-dependent navigation operations on the
artifact tree and automatically predict the most appropriate
target sequence of nodes, that is relevant to the current topic
or task.
1) Theoretical foundation of the network model: Figure 3
Fig. 2. Sequence of navigation operations represented by a word illustrates the topology and the information flow of the pro-
of a formal language L(G) – according to the regular grammar G posed recurrent neural network. ~yt+1 , . . . , ~yt+m represent the
given above. The nodes of the knowledge tree represent the terminal associated sequence of target nodes, which are predicted by the
symbols (lower case letters) of the underlying alphabet Σ associated
with L. A single actor trace is depicted as a sequences of arrows. neural net to be associated strongest with the original sequence
of nodes in the artifact tree. This means that the prediction
is computed on the basis of the recent navigation-history
G: ~xt−k , . . . , ~xt according to the observed actor behaviour.
An important characteristics of the recurrent design is the
G = {S → aB1 D1 → d1 D2 inherent temporal memory of the neural network. So the
B1 → b1 B2 Dk → dk E internal state layer ~st−k , . . . , ~st can store a temporal sequence
.. .. of actor operations in the training phase. Afterwards in the
. .
operational application of the network, it can fully recover a
B j → bj C E → eF similar sequence of navigation operations in form of another
C → cD1 F → fG actor trace and then is able to give a prediction of the most
G → g} likely associated node-sequence in the target tree.
Thus the proposed MRNN can learn the temporal structure of a
Therefore L(G) as a regular language is an instance of the set of symbol sequences, what is achieved by the temporally
class of type 3 languages L3 . The capital letters are variables, unfolded network topology. The temporal dynamics of the
that are not part of the actual knowledge tree. history and target sequences are explicitly modeled in the
hidden state layer. The neuron formations of that layer serve
By contrast to the common Hidden Markov Model (e.g. as hidden and as context units, because ~st−1 provides the
[Rab89]), we want to take the history into account and thus context for the recursive computation of the hidden state ~st .
it is not expedient to build the prediction only on the current Thus these neurons are called hidden and context records
state. and one record per time step corresponds to one node of a
This is illustrated in the knowledge tree, which is shown knowledge repository such as the artifact or target tree.
in figure 2, where the previously visited nodes resulting in
the navigation-history a . . . c are deciding for the further
proceeding of the actor in the knowledge base. The
recommendation system must finally support this kind of
context recognition, because an actor should be guided to
the subsequent nodes ef g, if his recently observed behaviour
matches to an already learned rule or can be inferred from
it. In terms of the given regular grammar the meaning is,
that the processed terminal symbol c only matches to the rule
C → cD1 , when it occurs subsequent to the history ab1 . . . bj
– which can be considered as the search or navigation context.
Fig. 3. Schematic topology of the proposed modular recurrent
network called MRNN. Internal state transition ~st → ~st+1 , A, B
In this section we have shown that the task of incorporating and C are weight matrices. ~
xt is the external input symbol at time t,
arbitrary navigation sequences, which are implicitly generated ~
yt+1 is the correspondingly predicted output symbol. The depicted
block arrow direction stands for a forward information passage.
by domain experts, can be considered as learning of an
unknown regular grammar. This means in turn that all learned
B. Dynamical modular network structure matrix B, even if no external inputs ~xt are available anymore.
In order to process arbitrarily long node sequences The prediction-horizon, that means the number of reliably
~xt−k , . . . , ~xt , ~yt+1 , . . . , ~yt+m with a neural network, we intro- predictable target nodes m, can normally be recognized by
duce a novel modular configuration of an recurrent network, means of the output behaviour ~yt+m , which attains to a steady
that adopts to the number of history and sequence nodes at state for m > mmax and m → ∞:
runtime. Theorem:
This dynamical modular structure is founded on the following ~yt −→ ~y
ψ (3)
recurrent propagation model (forward information flow). m>mmax ,m→∞

Prior definitions A = const. ⇒ ~st −→ ~s


ψ (4)
m>mmax ,m→∞
k d k k d k
• A is a R , B is a R and C a R matrix.
• d is the dimension of the feature space. Since for the internal weights holds B[i][j] < 1, f oralli, j –
• h = dim(~ si ), ∀i is an arbitrary number of hidden neurons due to the chosen sigmoidal activation function with ch = 1
per state module or in other words the dimensionality of – the mapping B(. . . (B(B~st )) . . .) is a so-called contraction,
the state layer. h is independent from d. which converges to a fixpoint ψ for m → ∞.
This result was also observed in practice during intensive
~st = f (B~st−1 + A~xt ) (1) work with the functional prototype implementation (Java)
~yt = f (C~st−1 ) (2) of the proposed semantic recommendation system (coming
up in the following paper with working title: “Semantic
As introduced above, the ~si stand for the internal state at Recommendation for the Software Engineering Process”).
each discrete time step. This state layer is the backbone for
learning from the navigation-history and for predicting the V. A PPLICATION TO A TEXTUAL KNOWLEDGE BASE
recommended target nodes. The crucial recurrent equation 1 As a concrete application of the recommendation function-
combines an external input ~xt and the previous state ~st−1 ality a knowledge base like a document tree should be con-
to the subsequent state ~st , which indirectly depends on all sidered. Therefore each document node has to be transformed
foregoing external inputs ~xt−k , . . . , ~xt−1 and internal states into a meaningful and machine-processable representation.
~st−k , . . . , ~st−1 . f : R → [cl , ch ] is a function with the
A. Node representation model
following properties.
For each knowledge node a feature vector ~vi ∈ V ⊆ Rd
1) f is monotonic increasing.
is to be computed in a way that it represents its content in a
2) limx→−∞ f (x) = cl , limx→∞ f (x) = ch , ch > cl .
numerically processable form. In case of textual content we
Often f the is chosen as the sigmoidal function apply well-known text mining methodologies for computing
1
f (x) = 1+exp(−x) . a numerical document representation.
The specialty of the suggested model is its modular com- Stemming..
position like depicted in figure 3. Thus the MRNN has to be We give a procedure by means of the term frequency -
trained with a modified backpropagation through time (BPTT) inverse document frequency, that computes a characteristical
algorithm, while the training process itself is not subject of vector for each document in the knowledge base (KB) by
this article (but will be treated in detail in succeeding papers). rating the relevance of all existing keywords.
However there is a crucial feature of the modular network Def. Term and document ratios
structure:
|docs(ti )|
Independent of the length k of the history sequence or the df (ti ) = |KB| (5)
length m of the target sequence, the weight matrices A, B P n(ti ,doc)
tf (ti , doc) = (6)
and C are adjusted in the train process, because they are t∈doc n(t,doc)

P n(ti ,doc) |KB|


reused in every time step t. This implies that there are exactly tf idf (ti , doc) = · |docs(ti )| (7)
t∈doc n(t,doc)
three matrix instances for the training of all variably long
association rules. During one training epoch2 the network n(t, doc) counts the number of terms t in a single document
adapts at runtime to the individual structure of the respective doc.
node sequence with its history and target part – fed into the The document frequency df (ti ) counts the number of docu-
neural network after preprocessing and transformation to a ments that contain a term ti related to the number n of existing
train pattern. documents in a certain repository KB= {doc1 , . . . , docn } .
1) Prediction of target sequences: The hidden temporal tf (ti , doc) gives the relative term frequency of a term ti within
representation by the MRNN’s state layer enables theoretically a single document doc. An exemplary ranking table is given
infinite long prediction sequences, since the MRNN possesses in table I. The idea of tf idf is to assign a higher relevance
an continuous internal state layer [GSW05]. Thereby the to terms that occur often within a single document. At the
current state (~st )t∈N is repeatedly propagated through the same time terms are rated higher, that appear more seldom
in the entire repository KB. The notion behind the tf idf is,
2A single cycle through all train patterns. that globally frequent terms are not valuable for differentiating
term tf idf (doc) will serve for the identification and addressing of the nodes,
t14 0.41 that participate in an association between artifact and target
t35 0.37 tree. The 2-layered mapping is illustrated by figure 4. The
t27 0.26 dichotomy of the symbolic and latent semantic representation
enables a fault-tolerant and fuzzy linking of both knowledge
. .
. . bases via the recurrent neural network. The latter operates
. .
only on the node representation layer and is decoupled from
TABLE I
the unique identifier scheme of the symbolic layer. Therefore
R ANKING TABLE WITH term frequency–inverse document frequency FOR
the neural network is exclusively bound to the vector space
EACH TERM ti IN A DOCUMENT doc.
layer on the side of the artifact repository as well as on the
side of the target repository.
This second layer ensures that similar or neighbored instances
between the documents doc1 , . . . , docn , because they are of the knowledge tree nodes such as text documents are also
comprised by the most of the documents. On the other side neighbored regarding their content representation in the vector
locally frequent terms characterize a single document very space Rd .
well.3
Based on the introduced term and document frequency C. Pretraining of the recommendation system
measures, a fixed number d of most significant terms has to
be selected. Thus a global term significancy measure required In order to provide a basic recommendation functionality
is, that rates terms over all documents in KB. on a certain knowledge domain – assumed to be represented
The d selected terms serve now for computing the feature by a graph or tree structure consisting of text documents
vector, that describes a document by the globally most signif- – the recommendation system automatically discover a set
icant terms and computes their local value in the respective of fundamental similarity rules. These rules are association
document doc by tf idf . rules of history
 and target length 1 and are determined by
  a simple n2 similarity search over all document nodes in
tf idf (trank 1 , doc) KB={doc1 , . . . , docn }. As similarity measure the standard
 tf idf (trank 2 , doc) 
Euclidean norm on Rd or the so-called cosine distance can
∀doc ∈ KB : ~vi =  (8)
 
.. 
be used.
 . 
tf idf (trank d , doc) cos-distance
D E
This corresponds to a modified bag of words approach ~a, ~b
with d terms, where the terms are locally rated via tf idf d(~a, ~b) = (9)
and w.r.t. to a single document. Not until this global k~ak ~b
feature vector model the semantics of each document are
approachable by the MRNN. Here global means that the h·, ·i is the standard scalar product.
structure (trank 1 , . . . , trank d ) and therefore the keyterms Document index pairs (i, j) with cond : d(doci , docj ) ≤
describing the content of each document are equal, only the t, t ∈ R are considered as semantically similar and lead to
component-wise tf idf ()-rating distinguishs the document binary association rules
instances.
doci ⇒ docj .
In order to ensure a sufficient performance of the neural
network processing, the complexity of the document represen- The set of document pairs that meet the condition cond are
tation should be constraint via d to a reasonable magnitude. transformed into train patterns and are pretrained by the
Furthermore dimension reduction techniques like Principle MRNN.
Component Analysis (PCA) can be leveraged to optimize
the document representation regarding its semantics and the
VI. C ONCLUSION
efficiency of neural processing.
The suggested method for constructing a meaningful doc- We have introduced a generic semantic recommendation
ument vector model will be utilized in the generic 2-layered functionality based on a novel recurrent neural network –
representation of knowledge nodes, that is presented in the the Modular Recurrent Neural Network MRNN. The MRNN
following section. enables learning of variably long associations rules on graph-
B. 2-layer representation of knowledge tree nodes based knowledge bases.
Knowing that the knowledge tree nodes can be represented
in a vector space V ⊆ Rd , the following 2-layered mapping ACKNOWLEDGMENT
3 tf idf plus document length normalization The authors would like to thank...
Fig. 4. Illustration of the 2-layered mapping concept for knowledge node representation and association.

R EFERENCES
[Cal03] Robert Callan. Neuronale Netze im Klartext. In Pearson Studium,
ISBN 3-8273-7071-X, 2003.
[GSW05] Faustino J. Gomez, J”urgen Schmidhuber, and Daan Wierstra.
Modeling Systems with Internal State using Evolino. In Proc.
of the 2005 conference on genetic and evolutionary computation
(GECCO), Washington, D. C., ACM Press, New York, NY, USA,
2005, pages 1795–1802, 2005.
[JMS] Fergal W. Jones, IPL McLaren, and Rainer Spiegel. The
Prediction-Irrelevance Problem in Grammar Learning. In Uni-
versity of Cambridge, Department of Experimental Psychology,
Downing Site, Cambridge, CB2 3EB, UK.
[Kra91] Klaus Peter Kratzer. Neuronale Netze. In Carl Hanser Verlag
M”unchen Wien, 1991.
[MRF04] Rene Mayrhofer, Harald Radi, and Alois Ferscha. Recognizing and
Predicting Context by Learning from User Behavior. In Institut fr
Pervasive Computing, Johannes Kepler Universitt Linz, 2004.
[MS05] Igor MOKRIŁ and Lenka SKOVAJSOV. Neural Network Model
of System for Information Retrieval from Text Documents in Slovak
Language. In Acta Electrotechnica et Informatica No. 3, Vol. 5,
2005.
[Rab89] Lawrence R. Rabiner. A Tutorial on Hidden Markov Models and
Selected Applications in Speech Recognition. 1989.
[Sch01] Uwe Sch”oning. Theoretische Informatik - kurzgefasst. In SPEK-
TRUM, akademischer Verlag, 2001.
[SLD96] I. Syu, S.D. Lang, and N. Deo. A neural network model for
information retrieval using latent semantic indexing. In ICNN
96, The 1996 IEEE International Conference on Neural Networks,
pages 1318–23 vol.2, 1996.
[Smi03] Andrew Smith. Grammar Inference Using Recurrent Neural
Networks. In Department of Computer Science, University of
California, San Diego, La Jolla, CA 92037, 2003.

You might also like