Real-World Humanoid Locomotion
with Reinforcement Learning

Authors contributed equally and listed in alphabetical order

University of California, Berkeley

Humanoid robots that can autonomously operate in diverse environments have the potential to help address labour shortages in factories, assist elderly at homes, and colonize new planets. While classical controllers for humanoid robots have shown impressive results in a number of settings, they are challenging to generalize and adapt to new environments. Here, we present a fully learning-based approach for real-world humanoid locomotion. Our controller is a causal Transformer that takes the history of proprioceptive observations and actions as input and predicts the next action. We hypothesize that the observation-action history contains useful information about the world that a powerful Transformer model can use to adapt its behavior in-context, without updating its weights. We train our model with large-scale model-free reinforcement learning on an ensemble of randomized environments in simulation and deploy it to the real world zero-shot. Our controller can walk over various outdoor terrains, is robust to external disturbances, and can adapt in context.

Learning Humanoid Locomotion

We present a learning-based approach for humanoid locomotion. Our controller is a Transformer that predicts future actions autoregressively from the history of past observations and actions. We hypothesize that the history of observations and actions implicitly encodes the information about the world that a powerful Transformer model can use to adapt its behavior dynamically at test time. For example, the model can use the history of desired vs actual states to figure out how to adjust its actions to better achieve future states. This can be seen as a form of in-context learning—changing model behavior without updating the model parameters.

Our controller is a causal Transformer trained by autoregressive prediction of future actions from the observation-action history.

Massively Parallel Training in Simulation

Our model is trained with large-scale model-free reinforcement learning (RL) on an ensemble of randomized environments in simulation. We leverage fast GPU simulation powered by IsaacGym and parallelize training across multiple GPUs and thousands of environments. Thanks to this, we are able to collect a large number of samples for training (order of 10 Billion in about a day).

We train our policies on various terrain types, including planes, rough planes, and smooth slopes. Our robots execute a variety of randomly sampled walking commands such as walking forward, sideward, turning, or a combination thereof.

Real-World Deployment

We find that our policies trained entirely in simulation are able to transfer to the real world zero-shot. We deploy our controller to a number of outdoor environments. These include plazas, walkways, sidewalks, running tracks, and grass fields. The terrains vary considerably in terms of material properties, like concrete, rubber, and grass, as well as conditions, like dry or wet.

Walking in different outdoor environments

Omnidirectional Walking

Our controller is able to accurately follow a range of velocity commands to perform omni-directional locomotion, including walking forward, backward, and turning.

Walking backward and turning

Dynamic Arm Swing

We find that our approach leads to emergent human-like dynamic arm swing behaviors in coordination with leg movements, i.e., a contralateral relationship between the arms and the leg.

Walking with a dynamic arm swing

In-Context Adaptation

We study the ability of our controller to recover from foot-trapping that occurs when one of the robot legs hits a discrete step obstacle (left). Note that steps or other form of discrete obstacles were not seen during training. This setting is relevant since our robot is blind and may find itself in such situations during deployment. We find that our controller is still able to detect and react to foot-trapping events based on the history of observations and actions. Specifically, after hitting the step with its leg the robot will attempt to lift its legs higher and faster on subsequent attempts.

We command the robot to walk forward over a terrain with three sections: flat, downward slope, and flat again (right). We observe that our controller adapts its behavior based on terrain, changing the gait from natural walking on flat, to small steps on downward slope, to natural walking on flat again. This behavior is emergent and has not been pre-specified during training.

Recovery from foot-trapping (left) and gait change based on terrain type (right)

External Disturbances

Finally, we test the robustness of policies to sudden external forces. These experiments include pushing the robot with a wooden stick (left) and throwing a large yoga ball at the robot (right). We find that our controller is able to stabilize the robot in both of these scenarios.

Robustness to external disturbances


  title={Real-World Humanoid Locomotion with Reinforcement Learning},
  author={Ilija Radosavovic and Tete Xiao and Bike Zhang and Trevor Darrell and Jitendra Malik and Koushil Sreenath},