- Extensions of the monte-carlo tree search (MCTS)
in this tutorial
- instead of tabular Q function, use NN-approximated Q function for broader applications (can take continuous states)
- instead of planning form root at once and extract the learned policy, modify the code to plan online.
- The advantage is that we don’t need to visit every state and learn the whole picture. We just quickly perform suboptimal actions
that can get things done.
- allow pre-trained Q to guide the tree search
- Analysis on performances in a simple grid world
Online planning with MCTS
- MCTS builds up a tree (previous notes)
- state-node layer and action-node layer alternatively
- procedure
- select a state node that is not fully expanded
- expand the selected node by trying an action (randomly) and get the next state node
- simulate from s’ till a termination state
- backpropagate the return
- To do online planning and more efficient search, I modify the following steps:
- expand with bandit algorithms instead of random actions
- simulate only till a limited depth
- it is inefficient to always simulate to the end for every node
- to calculate the accumulated reward, use the approximated q for the rest of the episode
Online v.s. offline planning in terms of time and success rate
- test case: grid world
- start at left bottom (0,0)
- goal: reach (3,2)
- action: moving in four directions
- performance analysis is over 5 trials
Plan offline (baseline)
- I call the original code “offline” because
- the agent is placed at the initial state without any prior knowledge (of the world or the pre-trained Q), it has to
- build the tree
- while searching from the root over and over again
- update Q function
- until the predefined timeout
- the policy is extracted at the end
- timeout 2.5
- success: 1 out of 5 trials
- time spent (from building the tree to update Q):
- time spent 2.5152747631073
- time spent 2.6165850162506104
- time spent 2.5032870769500732
- time spent 2.503243923187256
- time spent 2.5086100101470947
- one failing example
- the first few layers of the tree
- the extracted policy can't lead the agent to the goal


- timeout 5
- success: 5 out of 5 trials
- time spent:
- time spent 5.0320940017700195
- time spent 5.042095184326172
- time spent 5.0121259689331055
- time spent 5.012312173843384
- time spent 5.088823080062866
- example
- the first few layers of the tree
- the extracted policy leads the agent to the goal


- This approach takes longer to extract a consistent policy because
- requires visiting all the states a lot of times to train the NN-approximated Q fn
- the variance is high because the training is in monte-carlo style
- another disadvantage: only when the extracted policy is perfect, the agent can start using it
Plan online
- analysis follows most of the approaches set in the last section. To accommodate the online fashion, the trails is performed with these variables
- tree limit: only build the tree till depth=3 (o.w. it’s just offline)
- simulation depth: instead of Monte Carlo simulation, look ahead only for this many steps
- pre-trained Q
- online planning can give policy like
- (0, 0) ►(1, 0) ►(2, 0) ▼(2, 0) ▼(2, 0) ◄(1, 0) ◄(0, 0) ▼(0, 0) ▼(0, 0) ▼(0, 0) ▼(0, 0) ▼(0, 0) ▼(0, 0) ▼(0, 0) ▼(0, 0) ▼(0, 0) ▼(0, 0) ▼(0, 0) ▼(0, 0) ◄(0, 0) ◄(0, 0) ◄(0, 0) ◄(0, 0) ◄(0, 0) ►(1, 0) ◄(0, 0) ◄(0, 0) ►(1, 0) ◄(0, 0) ►(1, 0) ▲(1, 0) ◄(0, 0) ▲(0, 1) ▲(0, 2) ◄(0, 2) ◄(0, 2) ◄(0, 2) ▼(0, 1) ▲(0, 2) ▼(0, 1) ►(0, 1) ▲(0, 2) ◄(0, 2) ►(1, 2) ►(2, 2) ►goal
- what the agent does is not optimal, but planning online can lead the agent to the goal earlier than the offline method as shown below
- in real-world application, we can constrain the robot’s motions (not performing the action that is not safe e.g. going towards the wall)
plan online with tree limit
- depth = 3
- success 5 out of 5 trials, average time 3.04
- time spent 0.5703649520874023
- time spent 3.430715799331665
- time spent 3.38307785987854
- time spent 5.041029930114746
- time spent 2.761559247970581
- depth = 5
- success 5 out of 5 trials, average time 2.04
- time spent 0.2971813678741455
- time spent 0.7031919956207275
- time spent 7.24423885345459
- time spent 0.44618701934814453
- time spent 1.539870262145996
- conclusion
- planning online can lead the agent to the goal (2.04s / 3.04s) earlier than the offline method (5s)
- picking the right tree limit is crucial to balance the performance and the time cost
- in worst case scenario it can take longer than the offline method, especially for this simple grid example
- because the agent does action online and can keep visiting the few same states when the Q function is not converging or approximated fast enough
plan online with tree limit (depth=5) + simulation depth
- bootstrap with simulation depth=10
- success 5 out of 5 trials, average time 1.42
- time spent 1.0405361652374268
- time spent 0.5547537803649902
- time spent 0.6425988674163818
- time spent 2.4945077896118164
- time spent 2.3596439361572266
- → compared to Monte Carlo approach (2.04s), adding simulation depth limit can speed up training (1.42s)
- bootstrap with simulation depth=5
- success 3 out of 5 trials, average time 3.34
- time spent 0.5151748657226562
- time spent 1.3147289752960205
- time spent 1.042971134185791
- time spent 4.154265880584717 (fail: end in trap)
- time spent 9.689126014709473 (fail: end in trap)
- → if the simulation depth is too small, the agent takes longer and even fails sometimes, likely because the bootstrapping with sparse reward function in the env
- what if we use pre-trained Q
plan online with tree limit (depth=5) + simulation depth (=5) + pre-trained Q
- methods
- result
- perfect pre-trained Q
- success 5 out of 5 trials, average time 0.90
- time spent 0.4416477680206299
- time spent 0.6683788299560547
- time spent 1.6460108757019043
- time spent 0.713360071182251
- time spent 1.0494542121887207
- pre-trained Q for only 5 eps
- success 5 out of 5 trials, average time 0.77
- time spent 0.5686788558959961
- time spent 0.54811692237854
- time spent 1.0325219631195068
- time spent 0.2939751148223877
- time spent 1.4327030181884766
- using pre-trained models, we can get the agent to the goal with faster online planning
- pre-training is extra work and time, but here the results show that
- the pre-trained model does NOT need to be perfect in terms of the extracted policy and approximated values
- the pre-training does NOT need good experiences, any kinds of experiences can be good for later planning
- in one trial, the agent ended in the trap for all the five episodes, nevertheless, the online planning is able to get the agent to the goal faster than without any experience.