Notes on Chapter 21: Reinforcement Learning¶

Reinforcement Learning¶

reinforcement learning is a very different approach to learning than supervised and unsupervised learning

a reinforcement learner is able to perform actions in an environment, and get rewards or penalties from their actions

the goal of a reinforcement learner is to maximize the rewards the get

in some complex domains, reinforcement learning is the only practical way to do learning because it can be hard to find labeled training examples

reinforcement learning is usually modeled as a markov decision process (mdp)

you can think of an mdp as a graph, where the nodes are states of the world, and the edges are actions that go between states

if the agent is in state s and does action a, then P(s,a,s’) is the probability that the agent ends of in state s’

this allows for the possibility that an action might fail to do what the agent intended

Reward(s,a,s’) is the immediate reward the agent receives after going from s to s’ by action a

the agent chooses what action to do using a policy

a learning agent updates its policy after getting reward feedback
the exact structure of a policy depends upon the implementation of the learner

see Figure 22.1 (p. 832) for a good example of how to think about these ideas in terms of a maze-solving problem

reinforcement learners learn by interacting with their environment, and so this means they usually need to do some exploration, i.e. try out actions that are not necessarily the best ones in an effort to learn more about the environment (and those getting a higher overall reward from later actions)

for example, suppose you are going to eat out at a restaurant, and you could go to either a familiar restaurant with food you know you like, or visit a just-opened restaurant you’ve never been to before
- going to the familiar restaurant will likely get you a high reward, because you like there food
- but discovering new food can be good, especially if you end up liking it, and so sometimes it is worthwhile taking a chance and visiting the new restaurant
- the balance between discovery and exploitation has been formalized into multi-arm bandit problems
  - imagine a slot machine with two different levers (A and B) you can pull
  - each lever wins with some unknown probability
  - if you have $100, and it costs $1 per pull, then what should be your strategy for pulling levers if you want to maximize the number of wins?

one application of reinforcement learning that has some big successes is game playing

in 1992, TD-Gammon became a word class backgammon playing program using a reinforcement learning technique called temporal difference learning
more recent, AlphaZero has used reinforcement learning to play chess, Shogi, and Go at world champion levels

Other Kinds of Learning¶

some AI researchers are interested in directly using knowledge in learning

one way to do that is to frame learning as a logic problem, and to consider how logical predicates might be learned from examples

one interesting technique for doing this is inductive logic programming (ILP), where the learning agent is given positive and negative examples of a logical predicate, and learns a logic program that describes it

logic programs an be run as regular computer programs, and so ILP can be thought of as learning programs

see chapter 19 of the textbook if you are interested in more information on this approach

it is still research-oriented, without having achieved the same success of example-based learning