Loading
Cookies help us deliver our services. By using our services, you agree to our use of cookies. Learn more

crowdAI is shutting down - please read our blog post for more information

NIPS 2017: Learning to Run

Reinforcement learning environments with musculoskeletal models


Completed
2154
Submissions
630
Participants
92467
Views

Tutorial: Getting the skeleton to walk

Posted by QinYongliang over 2 years ago

36 days to go, if you still can’t get your skeleton to walk, this might help.

1) To succesfully walk, you train a neural network which maps states to actions that maximizes the Expected Future Reward. If you already learned a little bit about Deep Reinforcement Learning, you know what I’m talking about. You can use an out-of-the-box DRL algorithm for this task, such as DDPG.

Problem is, the observation vector (41 dimensions) is just an OBSERVATION, not STATE. A state should be markovian, which means from a state you must be able to predict what happens next, or estimate what reward you get after, given certain actions. Which is basically what your learning algorithm, say DDPG, is trying to do.

The observation vector is not markovian. For example, since the velocity of your agent’s head is not in the observation vector, the agent won’t be able to know which way his head is currently leaning forward, therefore couldn’t make the right judgement of muscle actions to balance himself. As a result, the agent will walk poorly.

To make the observation vector markovian, you can either:

a) calculate velocities of all the body parts: v = dx/dt = (x_now - x_last)/0.01 and add them to your observation vector, then feed to your learning algorithm.

b) concatenate two consecutive observations into one and use that as your observation vector, then expect the neural network to do the dx/dt calculation for you.

simply choose option a), set your DDPG algorithm’s gamma value to 0.995 (~200 steps or 2 seconds lookahead of future reward estimation) and run your agent for ~10000 episodes, should result in a not-too-bad score.

2) When running at 100 fps you don’t neccessarily sample at 100 fps. At 100fps your past experience memory will very soon be filled with observations; but since they are sampled at 100fps, they are quite correlated to each other. Sample experiences at a rate lower than 100fps, say 50fps or 33fps, will improve the agent’s ability to learn, since your memory now contains experiences with higher variety and less correlation.

3) in DDPG for example, you ask the agent to train itself per act, which means 100 fps. You might be interested in training at 200fps.

4) When designing the network topology, keep in mind that:

Your objective is to maximize an “expected future reward” by learning from past experience. Since “past experience” is changing over time, your learning objective, from the neural network perspective, is also changing over time. Therefore your network must be very well suited to this adaptive process, or “malleable” or “retrainable”. In other words, if your network cannot be retrained for other purposes once it’s trained for some purpose, then it should not be used here.

It is thus recommended to use shallow and wide topology, rather than deep and narrow ones. Don’t waste time on residual connections. Don’t use tanh or sigmoid units (except at output layer).

For rectifiers, personally I use LeakyReLU, because they don’t kill the gradients like ReLUs.

5) You soon realized that you simply can’t run the environment for 10000 episodes.

a) memory leaks destroyed your environment. solution: start RunEnv() in a separate process and talk to that process. After ~100 episodes, kill that process and start a new one. Don’t start RunEnv() in a separate thread; The RunEnv() will invoke the same instance OpenSim backend, result in no protection of memory leaks.

b) the environment is running at ~1 to 3 fps, thus takes forever. solution: start 16 RunEnv() as described above and run them in parallel. Sample the actions and reward in parallel. Put a threading.Lock on the agent’s memory to prevent conflicts. This requires knowledge in Operating System’s Threads/Processes/Locks, and I suggest the corresponding course by UCB on YouTube.

c) you don’t have 16 cores. solution: start the RunEnv()s on mutiple AWS instances and train with them by doing RPCs. This requires knowledge in Network Communication, Udacity/Coursera/YouTube is the place to go. Python users should try the Pyro4 library.

I’m currently in China, so I bought 16 servers on Aliyun (the equivalent of AWS in China). This way I managed to train my agent to get an average score of 16.9 points.

Goodluck everyone.

24

Posted by M.Lehmann  over 2 years ago |  Quote

Thanks for sharing!

Posted by QinYongliang  over 2 years ago |  Quote

6) and of course, noise. In DRL, there are two types of algorithm: deterministic and probalistic. They are designed to deal with deter- and proba- environments. If the space of actions an agent could take consists of a set of choices, and the agent has to choose one of the actions to take per step, then this environment is probalistic.

For a proba- environment, you can train a neural network to fit the distribution of reward over the choices, given the state; then you can sample actions according to the distribution, given a new state. DQN is one such algorithm, and it’s one of the easiest DRL algorithm to understand. Deep Policy Gradient is even simpler than DQN, but DQN worked way better due to its more clever action sampling method.

But this “learning to run” thing, is deterministic and continuous, meaning that instead of making discrete choices, you will have to output a list of real values as your action per step. You can still fit a distribution of reward over a continuum of choices, by using mixture of gaussians blah blah, and you can of course sample from such a distribution. Doing so results in the CDQN algorithm, which means Continuous space DQN.

While CDQN is cool, it’s not the simplest, and require extensive knowledge in probalistics; the simplest deter- algorithm would be DDPG, or Deep Deteministic Policy Gradient.

In DQN we randomly sample the choices using a multinoulli distribution, as a means of exploration, because you have to make mistakes first, then learn from them. in an deter- algorithm like DDPG, exploration is achieved by adding noise to the action values. Details are in the DDPG paper.

So if you choose DDPG, noise is a very important hyperparameter. The more noise we add, the more different situations our agent will run into; but the agent will also fail more often. When the learning stops to progress, it is very likely that your exploration noise is too low. Tune that up a bit and see if it improves.

3

Posted by QinYongliang  over 2 years ago |  Quote

7) if you didn’t get the AWS credit because you didn’t join the competition early enough, and you are also not willing to pay for cloud services, there’s still hope for you. I just stopped all my servers on Aliyun (they are expensive). At home I have 2 PCs, one with i7-3610QM, one with Pentium E5300, and an i5-based RMBP. This is 8+2+4 cores in total, so running in parallel they are only slightly slower than my 16-server(one core each) rig. So if you really want to walk like a human, find every piece of hardware around you that are capable of running osim-rl and connect them together. Most people here should have no difficulty getting themself 2 or 3 PCs (just ask friends or parents).

Posted by QinYongliang  over 2 years ago |  Quote

8) windows user might complain that osim-rl don’t work with Python3 on windows, yet.

I spent a great deal of time to make the osim-rl runnable on windows with Python3. It requires Visual C++ compilation and (possibly) alteration of source code so please don’t waste your time like I did.

Instead you should start the environment in Python2.7, preferably in a conda virtual environment. Then write your code in Python3, and use RPC to talk to the environment, instead of running them in the same process.

Posted by PickleRick  over 2 years ago |  Quote

Thanks for the detailed tutorial. Can you please expand on how something like DDPG agent (from keral-rl) can be used with multiple parallel RunEnv()s?

Posted by QinYongliang  over 2 years ago |  Quote

Thanks for the detailed tutorial. Can you please expand on how something like DDPG agent (from keral-rl) can be used with multiple parallel RunEnv()s?

all those agent has a ‘play’ (or in some other names) method, that takes an environment and env.reset() then env.step() thru it. It will also sample observations and store them into the agent’s memory and perform training updates to the neural network.

You can create multiple threads and ask each of them to 1) obtain an environment, for example by starting one in a separate process, and 2) call the agent’s ‘play’ method with that environment as a parameter.

This method is quite straight-forward if you coded DDPG yourself with TF or Theano or Torch, instead of using an existing package like keras-rl. I believe most people here rely on keras-rl. From an engineering perspective, black box = failure so make your choices carefully.

1

Posted by QinYongliang  over 2 years ago |  Quote

for multiprocessing, I wrote a detailed example here.

https://github.com/stanfordnmbl/osim-rl/issues/58

you should also read every issue on github about osim-rl to avoid all kinds of problems.

Posted by Shubhi  over 2 years ago |  Quote

Thanks for sharing. “When running at 100 fps you don’t neccessarily sample at 100 fps” -> How do I change the sample from 100fps to 50 fps? I am using example.py and dont see any place where this is being set.

Posted by QinYongliang  over 2 years ago |  Quote

Thanks for sharing. “When running at 100 fps you don’t neccessarily sample at 100 fps” -> How do I change the sample from 100fps to 50 fps? I am using example.py and dont see any place where this is being set.

you should not change the environment itself, but you can wrap it up with your code :)

Posted by haipeng_chen  about 2 years ago |  Quote

Hello, qin. I have a question that where to add the velocities of all the body parts to the observation vector? I changed the get_observation in run.py, but it didn’t seem to work. The observation_space in example.py is also 41.