Ph# 051-9019724 Ext-2724 The Markov decision process (MDP) is a model of predicting outcomes. The model attempts to predict an outcome given only information provided by the current state. At each step during the process, the decision maker may choose to take an action available in the current state, resulting in the model moving to the next step and offering the decision maker a reward. Q-learning finds an optimal policy for maximizing the expected value of the total reward over any and all successive steps, starting from the current state to the goal state. Q-learning can identify an optimal action-selection policy for any given Finite MDP Exploitation Exploration