Monte Carlo Methods
Introduction
Monte Carlo (MC) methods are a class of computational algorithms that rely on repeated random sampling to obtain numerical results. In the context of reinforcement learning, Monte Carlo methods are used for learning optimal policies by experiencing complete episodes of interaction with the environment. These methods do not require a complete model of the environment, making them suitable for problems where the transition probabilities and reward functions are unknown.
Key characteristics of Monte Carlo methods in RL:
Model-free: They learn directly from experience without requiring a model of the environment.
Episodic: They require complete episodes to update value functions and policies.
High variance: Due to the reliance on random sampling, they can have high variance compared to dynamic programming methods.
Simple to understand and implement: They are relatively straightforward to implement compared to other RL algorithms.
Monte Carlo Prediction
Monte Carlo prediction involves estimating the value function for a given policy by averaging the returns observed after visiting a state. The return is the total discounted reward from time step
:
Where:
is the reward received at time step
.
is the discount factor (0 ≤
≤ 1).
is the terminal time step.
The value function is estimated as the average return starting from state
:
There are two main approaches for estimating the value function:
First-visit MC: Averages the returns only for the first time each state is visited in an episode.
Every-visit MC: Averages the returns for every time the state is visited in an episode.
Monte Carlo Control
Monte Carlo control aims to find the optimal policy by iteratively improving the policy based on the estimated action values. The main steps are:
Policy evaluation: Estimate
for the current policy using Monte Carlo prediction.
Policy improvement: Improve the policy by acting greedily with respect to the estimated action values.
To ensure exploration, we can use -greedy exploration, where with probability
, a random action is selected, and with probability
, the greedy action is selected.
import random
def epsilon_greedy(state, Q, epsilon, possible_actions):
if random.random() < epsilon:
return random.choice(possible_actions)
else:
return max(possible_actions, key=lambda a: Q[(state, a)])
Note
Monte Carlo control can be implemented using either on-policy or off-policy methods.
On-Policy Monte Carlo Control
On-policy methods improve the policy that is used to collect the data. An example of on-policy Monte Carlo control is policy iteration with -greedy exploration.
Off-Policy Monte Carlo Control
Off-policy methods improve a policy that is different from the policy used to collect the data. This allows for learning from demonstration or using a more exploratory policy for data collection. An example of off-policy Monte Carlo control is importance sampling.
Importance Sampling
Importance sampling is a technique used to estimate the expected value of a function under one distribution using samples from another distribution. In off-policy Monte Carlo, importance sampling is used to correct for the difference between the behavior policy (used to generate the data) and the target policy (being evaluated and improved).
Let be the target policy and
be the behavior policy. The goal is to estimate the expected return under
using samples generated by
.
The importance sampling ratio at time is given by:
Where:
is the target policy.
is the behavior policy.
To derive the importance sampling update, we want to estimate using samples from
. We can write:
In practice, we estimate this expectation by averaging the returns weighted by the importance sampling ratio:
Where and
are the importance sampling ratio and return for the
-th episode, respectively, and
is the number of times state
has been visited.
Incremental Implementation
To efficiently compute the average return, an incremental update can be used:
Where:
is the number of times state
has been visited.
Conclusion
Monte Carlo methods provide a powerful approach for learning optimal policies in reinforcement learning, especially when a model of the environment is not available. By experiencing complete episodes and using techniques like importance sampling and incremental updates, Monte Carlo methods can effectively estimate value functions and improve policies.