Deep Reinforcement Learning with Hugging Face: Part 1
Introduction to this series
The purpose of this series of articles is as a way for me to reflect and hopefully help other people out who may, like me, be looking to get further into the wonderful world of Reinforcement learning.
Apart from this somewhat utilitarian reason, I am also genuinely interested in reinforcement learning - it has always been what I think about first when I think about AI. This is the ‘AI’ of games, whether it be teaching an agent to play Pong or Atari games or Doom. But what, after all, is reinforcement learning and how do we train and agent? Join me in this multipart series where I attempt to find out.
What is Reinforcement Learning?
Think of a time when you have tried to learn how to do something or use something. For me, the classic one is trying to figure out how to use so new piece of technology. How do we proceed to learn how to use it, assuming we don’t have (or don’t want to 😄) read the manual?
If you are anything like me, the first thing you will try is simply to do something or indeed anything and see what happens! Then after that, and assuming the thing is not now broken, we observe what happens and remember what we did and what happened and then we try to do something else. We proceed like this until (hopefully) we understand how to do the new thing or use the thing.
Summarising that, it might look something like this:

- An Agent makes an Action.
- This brings about a new State of the system.
- There is some Reward if this action brought the system closer to a desired state.
- The agent learns from this action and acts again.
These three steps are repeated iteratively until some threshold is reached where we can say the agent knows how to act to maximise the reward.
Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback.
What is the Hugging Face Deep Reinforcement Learning Course?
The Hugging Face Deep RL Course is written by Thomas Simonini, Omar Sanseviero and Sayak Paul. The goal is provide a hands on course which allows people to get started immediately and start training agents. What I particularly like about the course is that for many of the units there are Google Collab notebooks which allow you to get started immediately, and with quite generous access to T4 GPU so you don’t have to rely on your own local GPU resources (or lack thereof). The course is made up of 8 units, and completion of 80% of these will get you a certificate of completion. If you push trained agents to your Hugging Face for all units, and meet the requirements in terms of the model scoring metrics, you can get a certificate of excellence. Given that models on Hugging Face are world-leading when it comes to LLM models, learning agents and other deep learning tasks, being able to get a certificate for free, using free GPU compute resources, from Hugging Face is truly a wonderful thing to behold. Okay, let’s jump into what I learned in Unit 1.
What I learned in intro and Unit 1
Kind of in keeping with the spirit of other quite good deep learning resources, such as the FastAI course, unit 1 of the Introl to Deep RL jumps straight in at the deep end. Users train a reinforcement agent after a rudimentary introduction the topic, using something called a ‘PPO’ approach, which after a very short amount of training allows you to generate an a movie, like the one at the start of this article. This small movie shows an agent navigating an environment to safely land the spaceship in the designated area. What is so powerful to me about this approach, is all you are doing here is defining a series of rewards to an agent and then basically saying ‘maximise the rewards’. The details of how, and what approaches to try are left up to the agent itself. Think about this for a moment - the agent is given an objective but is not proscribed a set of ‘ideal’ or ‘best’ approaches to achieve it - it is simply left to experiment as it will in the environment, with rewards for landing in the fastest most accurate way possible. If the agent hits the side rails, the game resets and the agent learns not to hit the rails. If it takes too long, the game ends and the enivronment resets. The agent learns to be fast. At the end of the training you push the trained agent to your hugging face account, where it saves the hyperparameters you used to train the model, the model itself and metrics about the model and of course the video of the spacecraft landing. Very cool.
Up Next…
In unit 2 we will be digging more into the underlying theory behind reinforcement and going through a bit of the required background to really understand what we achieved in Unit 1. Join me as we continue learning about reinforcement learning in the next unit!