# HMM are used to analyse, predict time series.

Time series is a sequence of data points, measured typically at successive time instants spaced at uniform time intervals. Senone is the smallest speech unit, for which we can compute HMM emission probabilities. A Markov process can be thought of as 'memoryless': loosely speaking, a process satisfies the Markov property if one can make predictions for the future of the process based solely on its present state just as well as one could knowing the process's full history. Recognition process The common way to recognize speech is the following: we take waveform, split it on utterances by silences then try to recognize what's being said in each utterance. To do that we want to take all possible combinations of words and try to match them with the audio. We choose the best matching combination. There are few important things in this match. First of all it's a concept of features. Since number of parameters is large, we are trying to optimize it. Numbers that are calculated from speech usually by dividing speech on frames. Then for each frame of length typically 10 milliseconds we extract 39 numbers that represent the speech. That's called feature vector. They way to generates numbers is a subject of active investigation, but in simple case it's a derivative from spectrum. Second it's a concept of the model. Model describes some mathematical object that gathers common attributes of the spoken word. In practice, for audio model of Senone is Gaussian mixture of it's three states - to put it simple, it's a most probable feature vector. From concept of the model the following issues raised - how good does model fits practice, can model be made better of it's internal model problems, how adaptive model is to the changed conditions. Third, it's a matching process itself. Since it would take a huge time more than universe existed to compare all feature vectors with all models, the search is often optimized by many tricks. At any points we maintain best matching variants and extend them as time goes producing best matching variants for the next frame. Models According to the speech structure, three models are used in speech recognition to do the match: An acoustic model contains acoustic properties for each senone. There are contextindependent models that contain properties (most probable feature vectors for each phone) and context-dependent ones (built from senones with context). A phonetic dictionary contains a mapping from words to phones. This mapping is not very effective. For example, only two to three pronunciation variants are noted in it, but it's practical enough most of the time. The dictionary is not the only variant of mapper from words to phones. It could be done with some complex function learned with a machinelearning algorithm. A language model is used to restrict word search. It defines which word could follow previously recognized words (remember that matching is a sequential process) and helps to significantly restrict the matching process by stripping words that are not probable. Most common language models used are n-gram language models-these contain statistics of word sequences-and finite state language models-these define speech sequences by finite state automation, sometimes with weights. To reach a good accuracy rate, your language model must be very successful in search space restriction. This means it should be very good at predicting the next word. A language model usually restricts the vocabulary considered to the words it contains. That's an issue for name recognition. To deal with this, a language model can contain smaller chunks like sub words or even phones. Please note that search space restriction in this case is usually worse and corresponding recognition accuracies are lower than with a word-based language model.

When creating an acoustic model. AX. phonetic dictionaries and even large vocabulary language models available for download. Usually. If you are going to apply your engine for some other language. the probability of someone saying "one" (e.Those three entities are combined together in an engine to recognize speech. This process is called training the acoustic models. In Sphinx-4.2). a graph search problem. These live features are scored against the acoustic model. As an example. It is a search problem. It is composed of the HMMs of the sounds units of the words "one" and "two": Constructing the above graph requires knowledge from various sources. The graph represents all possible sequences of phonemes in the entire language of the task under consideration. . you need to get such structures in place. The score obtained indicates how likely that a particular set of features (extracted from live audio) belongs to the phoneme of the corresponding acoustic model.8) is much higher than saying "two" (0. The process of speech recognition is to find the best possible sequence of words (or units) that will fit the given input speech. The component of the recognizer that generates these features is called the front end. in the above graph. In HMM-based speech recognizers. and the word "two" to T and OO. while the probability of the transition between the entry node and the first node of the HMM for T will be 0. During speech recognition. which is a type of statistical model. which maps the word "one" to the phonemes W. Overview of an HMM-based Speech Recognition System Sphinx-4 is an HMM-based speech recognizer. the search graph also has information about how likely certain words will occur.8. N.2. T and OO. in our example. features are derived from the incoming speech (we will use "speech" to mean the same thing as "audio") in the same way as in the training process. AX and N. This is called the acoustic model for that phoneme. as specified by the grammar of the task. Then.g. HMM stands for Hidden Markov Models. and the using these vectors (usually called features) parameters of the acoustic model are then estimated. the probability of the transition between the entry node and the first node of the HMM for W will be 0. The path to "one" will consequently have a higher score. It requires a dictionary. This information is supplied by the language model. The graph is typically composed of the HMMs of sound units concatenated in a guided manner. Suppose that. the task of constructing this search graph is done by the linguist.. 0. the speech signals are first transformed into a sequence of vectors that represent certain characteristics of the signal. and in the case of HMM-based recognizers. For many languages there are acoustic models. It requires the acoustic model to obtain the HMMs for the phonemes W. lets look at a simple search graph that decodes the words "one" and "two". each unit of sound (usually called a phoneme) is represented by a statistical model that represents the distribution of all the evidence (data) for that phoneme.

the input speech signal is transformed into a sequence of feature vectors. Architecture and Main Components In this section. using heuristics like pruning away the lowest scoring paths. active list. This can lead to a very large number of possible paths through the graph. the pruner. the sequence of parameterized speech signals (i. and how they work together during the recognition process. For example. Then. scorer. we look at all the paths that have reached the final exit node (the red node). and the active list. and search graph are all Java interfaces. which in turn constructs the scorer. and a result taking all the words of that path is returned. the user can . In this configuration file. As a result. we describe the main components of Sphinx-4. It contains almost all the concepts (the words in red) that were introduced in the previous section. In Sphinx-4. an XMLbased file that is loaded by the configuration manager. These components will in turn construct their own subcomponents. First of all. it constructs the front end (which generates features from speech). acoustic model. a lot of the nodes have self transitions. As you can see from the above graph. The search manager. the task of searching through the graph for the best path is done by the search manager. After the last feature vector is decoded. For example. and the language model. the features) is matched against different paths through the graph to find the best fit. The path with the highest score is the best fit. When the recognizer starts up. and the linguist (which generates the search graph) according to the configuration specified by the user. There can be different implementations of these interfaces.e. linguist. As we described earlier. Sphinx 4. finding the best possible path can take a very long time. Most of these components represents interfaces. lets look at the architecture diagram of Sphinx-4. depending on the implementation. The purpose of the pruner is to reduce the number of possible paths during the search. language model. pruner. The best fit is usually the least cost or highest scoring path. the linguist will construct the acoustic model. how does the system know which implementation to use? It is specified by the user via the configuration file. the decoder. It will use the knowledge from these three components to construct a search graph that is appropriate for the task. There are a few additional concepts in the diagram. The decoder will construct the search manager.. which we will explain promptly. the dictionary. there are two different implementations of the search manager. dictionary.Once this graph is constructed.