Loading
Cookies help us deliver our services. By using our services, you agree to our use of cookies. Learn more

crowdAI is shutting down - please read our blog post for more information

NIPS 2017: Learning to Run

Reinforcement learning environments with musculoskeletal models


Completed
2154
Submissions
631
Participants
88626
Views

How are the hopper models working?

Posted by PickleRick almost 2 years ago

If you see the current top submission by user jyu, instead of running it looks more like as if the agent is hopping like a kangaroo? How is that achieved? Are there any constraints to put in the training model / reward function which leads to such agents? My guess is that learning only forces to apply for one of the legs and replicating them for the other leg would lead to such models.

Posted by spMohanty  almost 2 years ago |  Quote

My experience has been that there is a pretty interesting correlation between the “type” of policy learnt by the agent, and the approach used for policy search. I have a strong hunch @jyu used TRPO, as I get similar hopper policies using TRPO and no “reward-engineering” at all.

Posted by QinYongliang  almost 2 years ago |  Quote

TRPO is a policy searcher. The search is done via line search instead of random exploration. Therefore, since the environment has a pretty stable initial condition(the agent always start with both legs aligned), it would result in policies as shown in jyu and spMohanty’s submission.

Posted by QinYongliang  almost 2 years ago |  Quote

TRPO guarantees monotonical improvement over training episodes; It’s less related to human reinforcement learning but more related to numerical optimization. As for now, we still can’t be sure a hopping policy will eventually beat all those running policy.

There’s one detail I’ve mentioned before on Github: since the environment limits the agent from tilting to the side by restricting the body in the x-y plane, it might actually be better for the agent to hop than to run. comparing to DeepMind’s walking agent’s DoFs, our competition is actually only a subset of it.