Oleksiy Mnyshenko

|1

PROPOSAL REFINEMENT
Understanding Rule Learning Dynamics as an ALGORITHM PRIOR
Given finite actions space A, prior θ is chosen such that . If prior places positive probabilities on only , then some positive probability has to be placed on so that these actions are not initially excluded. Therefore let , where ε is suitably small . Since then probabilities in prior θ need to be readjusted so that . Also note that value of utility function is available, such that [Most likely utilities have to be consistent with probabilities placed on each ]. Given above ramifications

_____________________________________________________________________________________ QUESTIONS: 1. How to choose ε?

2. Is it sufficient to uniformly distribute probability of ε over actions not included in the original prior?

3. Is it possible to spread ε smoother manner to prevent our prior from looking as a field of one-point peaks and uniform distribution over the rest of the field [we may apply intuition that allows us to allocate greater portion of ε to actions that are closer to action with positive probability in original prior]?

4. Is the following observation correct? In order to perform and update of the potential function in accordance to the following equation: v(a, t, t+1) = 0v(a, t, t) + r(a, t, t; at) , Therefore we need some value for utility of action over which we spread ε so that reinforcement can be performed, r(at, t, t; at) 1u(at, t, t).

UPDATING potential function and TRANSITION PROBABILITIES Once first action . Definition: Therefore updating proceeds in a following way: ∀ ∈ [ Therefore reinforcement is: Given new updated values period then replace it by t. Since (t+1) now is a current is chosen with following probability: We have just completed ONE ITERATION of the algorithm! _____________________________________________________________________________________ QUESTIONS: 1. What are the strategies for choosing parameter b (bandwidth)? Is it a static parameter or it can be dynamically adjusted based on data generated by the algorithm? . Thus we can choose next action. Since we will be using normalized Gaussian Kernels then value of potential function is updated for every action in the actions space A. We update potential function in the “neighborhood” of . Note that definition of the neighborhood depends on the choice of similarity function.Oleksiy Mnyshenko |2 Extra reasoning: Maybe keeping is fine because on average actions with known utilities will be chosen therefore updated values will come from these actions. Domains within which updating occurs can be restricted by using pyramidal Gaussians.

Above process may get stuck in a local maximum and it can be avoided by using perturbed Markov process such that with probability ) next state is chosen in accordance to transition probabilities . meaning that above Markov process is irreducible. Irreducibility of space has potential to yield interesting results as related to convergence. i ) | .probability of transition from current state in which last update occurred to a new state such that update of potential value function was around around Every state characterized by: [Question: if we were to run computation how would we record value of utility function for our actions space] Transition probabilities: Note that above Markov chain is inhomogeneous since transition probabilities are dependent on the last that was used to update potential function on A. Since initially prior is such that and . Therefore the whole state space consists of one recurrent class.Oleksiy Mnyshenko |3 MARKOV PROCESS Process of choosing an action from current probability distribution based on current values of potential function and then updating the potential function again to arrive to new probability distribution on actions space A can be modeled as a Markov process. then there is always going to be a positive probability of randomly choosing some at any t.

Aiding convergence through specifying similarity function in a way that will allow kernel to have smaller variance where utility function is sensitive to small variation in actions. CONVERGENCE Our goal is to demonstrate that the statement below that holds for extreme case of counterfactual thinking also is true under the reinforcement learning algorithm that was outlined above: regardless of the recently undertaken action. 2. mistake is made such that next around which update will be performed is chosen randomly from uniform distribution over the actions space. . This assumption may appear a nuisance since if such counterfactual thinking is possible agent would choose an action with maximum utility first time he has to decide. and greater bandwidth where changes in payoff across a neighborhood of action is small. Also the question of whether agent knows all of his available actions may arise as well. converges to such that utility of Questions: 1. Formally. In other words we want to demonstrate that undertaking is maximized.Oleksiy Mnyshenko |4 and with probability . Need to investigate other ways of specifying similarity function.

2007). Turabian. Tricks of the Trade: How to Think about Your Research While You’re Doing It (University of Chicago Press.“The Theory of Evolution and Dynamical Systems”. . Josef Hofbauer. 1998). 3. and Williams. 2. 2003). Seventh Edition: Chicago Style for Students and Researchers (Chicago: University of Chicago Press. The Craft of Research (University of Chicago Press. Booth. Kernel Density Estimation/Constructing similarity function that aid convergence. The Clockwork Muse: A Practical Guide to Writing Theses. Economics: 1. Cambridge University press. and Dissertations. A Manual for Writers of Research Papers. Dissertations and Books (Harvard University Press. Colomb. Theses.Oleksiy Mnyshenko |5 READINGS Topics: Mathematics 1. Research (methodology): Becker. Colomb and Williams. Learning behavior (books or articles that will help me put such a highly focused research in the context with other overarching topics) . 1999). Infinite and inhomogeneous Markov chains. Get the 7th edition! Zerubavel. Karl Sigmund. 2. Booth. Law of effect. Bayesian Convergence theorem.

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.