{"id":58,"date":"2015-12-19T19:40:04","date_gmt":"2015-12-19T17:40:04","guid":{"rendered":"https:\/\/demo.cocobasic.com\/colorius-wp\/?p=58"},"modified":"2023-01-01T17:34:38","modified_gmt":"2023-01-01T15:34:38","slug":"globular-star-cluster-radio-scope-great-turbulent-clouds","status":"publish","type":"post","link":"https:\/\/nail.cs.ut.ee\/index.php\/2015\/12\/19\/globular-star-cluster-radio-scope-great-turbulent-clouds\/","title":{"rendered":"Demystifying deep reinforcement learning"},"content":{"rendered":"<p style=\"font-weight: 400;\">This is the part 1 of my series on deep reinforcement learning. See part 2 \u201c<a style=\"font-style: inherit; font-weight: inherit;\" href=\"http:\/\/neuro.cs.ut.ee\/deep-reinforcement-learning-with-neon\/\">Deep Reinforcement Learning with Neon<\/a>\u201d for an actual implementation with Neon deep learning toolkit.<\/p>\n<p style=\"font-weight: 400;\">Today, exactly two years ago, a small company in London called DeepMind uploaded their pioneering paper \u201c<a style=\"font-style: inherit; font-weight: inherit;\" href=\"http:\/\/arxiv.org\/abs\/1312.5602\">Playing Atari with Deep Reinforcement Learning<\/a>\u201d to Arxiv. In this paper they demonstrated how a computer learned to play Atari 2600 video games by observing just the screen pixels and receiving a reward when the game score increased. The result was remarkable, because the games and the goals in every game were very different and designed to be challenging for humans. The same model architecture, without any change, was used to learn seven different games, and in three of them the algorithm performed even better than a human!<\/p>\n<p style=\"font-weight: 400;\">It has been hailed since then as the first step towards\u00a0<a style=\"font-style: inherit; font-weight: inherit;\" href=\"https:\/\/en.wikipedia.org\/wiki\/Artificial_general_intelligence\">general artificial intelligence<\/a>\u00a0\u2013 an AI that can survive in a variety of environments, instead of being confined to strict realms such as \u00a0playing chess. No wonder\u00a0<a style=\"font-style: inherit; font-weight: inherit;\" href=\"http:\/\/techcrunch.com\/2014\/01\/26\/google-deepmind\/\">DeepMind was immediately bought by Google<\/a>\u00a0and has been on the forefront of deep learning research ever since. In February 2015 their paper \u201c<a style=\"font-style: inherit; font-weight: inherit;\" href=\"http:\/\/www.nature.com\/articles\/nature14236\">Human-level control through deep reinforcement learning<\/a>\u201d was featured on the cover of Nature, one of the most prestigious journals in science. In this paper they applied the same model to 49 different games and achieved superhuman performance in half of them.<\/p>\n<p style=\"font-weight: 400;\">Still, while deep models for supervised and unsupervised learning have seen widespread adoption in the community, deep reinforcement learning has remained a bit of a mystery. In this blog post I will be trying to demystify this technique and understand the rationale behind it. The intended audience is someone who already has background in machine learning and possibly in neural networks, but hasn\u2019t had time to delve into reinforcement learning yet.<\/p>\n<p style=\"font-weight: 400;\">The roadmap ahead:<\/p>\n<ol style=\"font-weight: 400;\">\n<li style=\"font-style: inherit; font-weight: inherit;\"><b><strong style=\"font-style: inherit;\">What are the main challenges in reinforcement learning?<\/strong><\/b>\u00a0We will cover the credit assignment problem and the exploration-exploitation dilemma here.<\/li>\n<li style=\"font-style: inherit; font-weight: inherit;\"><b><strong style=\"font-style: inherit;\">How to formalize reinforcement learning in mathematical terms?<\/strong><\/b>\u00a0We will define Markov Decision Process and use it for reasoning about reinforcement learning.<\/li>\n<li style=\"font-style: inherit; font-weight: inherit;\"><b><strong style=\"font-style: inherit;\">How do we form long-term strategies?<\/strong><\/b>\u00a0We define \u201cdiscounted future reward\u201d, that forms the main basis for the algorithms in the next sections.<\/li>\n<li style=\"font-style: inherit; font-weight: inherit;\"><b><strong style=\"font-style: inherit;\">How can we estimate or approximate the future reward?<\/strong><\/b>\u00a0Simple table-based Q-learning algorithm is defined and explained here.<\/li>\n<li style=\"font-style: inherit; font-weight: inherit;\"><b><strong style=\"font-style: inherit;\">What if our state space is too big?<\/strong><\/b>\u00a0Here we see how Q-table can be replaced with a (deep) neural network.<\/li>\n<li style=\"font-style: inherit; font-weight: inherit;\"><b><strong style=\"font-style: inherit;\">What do we need to make it actually work?<\/strong><\/b>\u00a0Experience replay technique will be discussed here, that stabilizes the learning with neural networks.<\/li>\n<li style=\"font-style: inherit; font-weight: inherit;\"><b><strong style=\"font-style: inherit;\">Are we done yet?<\/strong><\/b>\u00a0Finally we will consider some simple solutions to the exploration-exploitation problem.<\/li>\n<\/ol>\n<h2>Reinforcement Learning<\/h2>\n<p dir=\"ltr\">Consider the game Breakout. In this game you control a paddle at the bottom of the screen and have to bounce the ball back to clear all the bricks in the upper half of the screen. Each time you hit a brick, it disappears and your score increases \u2013 you get a reward.<\/p>\n\n\n<figure class=\"wp-block-image size-full is-style-default\"><img loading=\"lazy\" decoding=\"async\" width=\"836\" height=\"260\" src=\"https:\/\/nail.cs.ut.ee\/wp-content\/uploads\/2023\/01\/breakout_deepmind.png\" alt=\"Figure 1: Atari Breakout game. Image credit: DeepMind.\" class=\"wp-image-588\" srcset=\"https:\/\/nail.cs.ut.ee\/wp-content\/uploads\/2023\/01\/breakout_deepmind.png 836w, https:\/\/nail.cs.ut.ee\/wp-content\/uploads\/2023\/01\/breakout_deepmind-300x93.png 300w, https:\/\/nail.cs.ut.ee\/wp-content\/uploads\/2023\/01\/breakout_deepmind-768x239.png 768w\" sizes=\"auto, (max-width: 836px) 100vw, 836px\" \/><figcaption class=\"wp-element-caption\">Figure 1:\u00a0Atari Breakout game. Image credit: DeepMind.<\/figcaption><\/figure>\n\n\n<p><\/p>\n<p>Suppose you want to teach a neural network to play this game. Input to your network would be screen images, and output would be three actions: left, right or fire (to launch the ball). It would make sense to treat it as a classification problem \u2013 for each game screen you have to decide, whether you should move left, right or press fire. Sounds straightforward? Sure, but then you need training examples, and a lots of them. Of course you could go and record game sessions using expert players, but that\u2019s not really how we learn. We don\u2019t need somebody to tell us a million times which move to choose at each screen. We just need occasional feedback that we did the right thing and can then figure out everything else ourselves.This is the task\u00a0<strong>reinforcement learning<\/strong>\u00a0tries to solve. Reinforcement learning lies somewhere in between supervised and unsupervised learning. Whereas in supervised learning one has a target label for each training example and in unsupervised learning one has no labels at all, in reinforcement learning one has\u00a0sparse\u00a0and\u00a0time-delayed\u00a0labels \u2013 the rewards. Based only on those rewards the agent has to learn to behave in the environment.While the idea is quite intuitive, in practice there are numerous challenges. For example when you hit a brick and score a reward in the Breakout game, it often has nothing to do with the actions (paddle movements) you did just before getting the reward. All the hard work was already done, when you positioned the paddle correctly and bounced the ball back. This is called the\u00a0<strong>credit assignment problem<\/strong>\u00a0\u2013 i.e., which of the preceding actions were responsible for getting the reward and to what extent.Once you have figured out a strategy to collect a certain number of rewards, should you stick with it or experiment with something that could result in even bigger rewards? In the above Breakout game a simple strategy is to move to the left edge and wait there. When launched, the ball tends to fly left more often than right and you will easily score on about 10 points before you die. Will you be satisfied with this or do you want more? This is called the\u00a0<strong>explore-exploit dilemma<\/strong>\u00a0\u2013 should you exploit the known working strategy or explore other, possibly better strategies.<\/p>\n<p> <\/p>\n<p>Reinforcement learning is an important model of how we (and all animals in general) learn. Praise from our parents, grades in school, salary at work \u2013 these are all examples of rewards. Credit assignment problems and exploration-exploitation dilemmas come up every day both in business and in relationships. That\u2019s why it is important to study this problem, and games form a wonderful sandbox for trying out new approaches.<\/p>\n<p> <\/p>\n<h2>Markov Decision Process<\/h2>\n<p>Now the question is, how do you formalize a reinforcement learning problem, so that you can reason about it? The most common method is to represent it as a Markov decision process.<\/p>\n<p> <\/p>\n<p>Suppose you are an\u00a0<strong>agent<\/strong>, situated in an\u00a0<strong>environment<\/strong>\u00a0(e.g. Breakout game). The environment is in a certain\u00a0<strong>state<\/strong>\u00a0(e.g. location of the paddle, location and direction of the ball, existence of every brick and so on). The agent can perform certain\u00a0<strong>actions<\/strong>\u00a0in the environment (e.g. move the paddle to the left or to the right). These actions sometimes result in a\u00a0<strong>reward<\/strong>\u00a0(e.g. increase in score). Actions transform the environment and lead to a new state, where the agent can perform another action, and so on. The rules for how you choose those actions are called\u00a0<strong>policy<\/strong>. The environment in general is stochastic, which means the next state may be somewhat random (e.g. when you lose a ball and launch a new one, it goes towards a random direction).<\/p>\n<p>&#8230; <a href=\"https:\/\/neuro.cs.ut.ee\/demystifying-deep-reinforcement-learning\/\">continue reading<\/a><\/p>\n<p><\/p>","protected":false},"excerpt":{"rendered":"<p>This is the part 1 of my series on deep reinforcement learning. See part 2 \u201cDeep Reinforcement Learning with Neon\u201d &#8230;<\/p>\n","protected":false},"author":2,"featured_media":591,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[36],"tags":[27,33,30],"class_list":["post-58","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-educational","tag-deep-learning","tag-deepmind","tag-reinforcement-learning"],"_links":{"self":[{"href":"https:\/\/nail.cs.ut.ee\/index.php\/wp-json\/wp\/v2\/posts\/58","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nail.cs.ut.ee\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/nail.cs.ut.ee\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/nail.cs.ut.ee\/index.php\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/nail.cs.ut.ee\/index.php\/wp-json\/wp\/v2\/comments?post=58"}],"version-history":[{"count":1,"href":"https:\/\/nail.cs.ut.ee\/index.php\/wp-json\/wp\/v2\/posts\/58\/revisions"}],"predecessor-version":[{"id":594,"href":"https:\/\/nail.cs.ut.ee\/index.php\/wp-json\/wp\/v2\/posts\/58\/revisions\/594"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/nail.cs.ut.ee\/index.php\/wp-json\/wp\/v2\/media\/591"}],"wp:attachment":[{"href":"https:\/\/nail.cs.ut.ee\/index.php\/wp-json\/wp\/v2\/media?parent=58"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/nail.cs.ut.ee\/index.php\/wp-json\/wp\/v2\/categories?post=58"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/nail.cs.ut.ee\/index.php\/wp-json\/wp\/v2\/tags?post=58"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}