Multi-armed Bandit Problem與增強學習的聯系

本文轉載自查看原文 2016-12-01 11:23 3356 multi-armed bandit/ reinforcement learning

選自《Reinforcement Learning: An Introduction》, version 2, 2016, Chapter2

https://webdocs.cs.ualberta.ca/~sutton/book/bookdraft2016sep.pdf

引言中是這樣引出Chapter2的：

One of the challenges that arise in reinforcement learning, and not in other kinds of learning, is the trade-off between exploration and exploitation. To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward. But to discover such actions, it has to try actions that it has not selected before. The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future. The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task. The agent must try a variety of actions and progressively favor those that appear to be best. On a stochastic task, each action must be tried many times to gain a reliable estimate of its expected reward. The exploration-exploitaion dilemma has been intensively studied by mathematicians for many decades (see chapter 2). For now, we simply note that the entire issue of balancing exploration and exploitation does not even arise in supervised and unsupervised learning, at least in their purest forms.

增強學習的挑戰之一是如何處理exploration與exploitation之間的折中，這是其他類學習問題所沒有的。為了獲得很多獎勵、收益，增強學習的agent更傾向於選擇那些在過去嘗試過且收益很大的行為。但是為了發現這樣的行為，它必須嘗試之前沒有選擇過的。也就是說，對於agent，一方面它要盡可能的利用它已經知道的知識來獲得收益，另一方面，它必須積極進行探索使得未來能夠做出更好的選擇。矛盾在於過分的追求exploration或exploitation都會導致任務的失敗。所以agent應該一方面積極嘗試多種多樣的行為，另一方面應該盡量選擇那些目前看來最好的。在隨機試驗中，每個行為必定被多次嘗試以獲得對於期望收益最為可靠的估計。exploration-exploitation矛盾已經被數學家廣泛研究了幾十年（見第2章）。至少現在我們可以簡單的理解為平衡exploration與exploitation的問題並沒有出現在有監督與無監督的學習問題中。

chapter2是這樣引出的：

The most import feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions. This is what creates the need for active exploration, for an explicit trail-and-error search for good behavior. Purely evaluative feedback indicates how good the action taken is, but not whether it is the best or the worst action possible. Purely instructive feedback, on the other hand, indicates the correct action to take, independently of the action actually taken. This kind of feedback is the basis of supervised learning, which includes large parts of pattern classification, artificial neural networks, and system identification. In their pure forms, these two kinds of feedback are quite distinct: evaluative feedback depends entirely on the action taken, whereas instructive feedback is independent of the action taken. There are also interesting intermediate cases in which evaluation and instruction blend together.

增強學習有別於其他類的學習方式，它使用訓練數據不但能夠給出正確的行為指令，而且能夠評價該行為（采用該行為的獎勵、收益）。由此產生了通過顯式搜索有利行為的主動的探索需求。單純的評價式反饋指明了若采取某一行為，則產生的收益是多少，而不是僅僅判斷這個行為是最好的活最差的。從另一個角度來講，單純的指示型反饋僅指明應該采取的正確行為，與實際采取的行為無關。這種反饋是有監督學習的基礎。這兩種反饋是完全不同的：評價式反饋完全依賴於已經采取的行為，而指示型反饋獨立於實際采取的行為。也有一些處於兩者之間的例子。

In this chapter we study the evaluative aspect of reinforcement learning in a simplified setting, one that does not involve learning to act in more than one situation. This nonassociative setting is the one in which most prior work involving evaluative feedback has been done, and it avoids much of the complexity of the full reinforcement learning problem. Studying this case will enable us to see most clearly how evaluative feedback differs from, and yet can be combined with instructive feedback.

本章研究增強學習在簡化場景下的評價方面，所謂簡化場景也就是說不涉及多個學習場景。這種非關聯場景已有許多相關工作涉及到評價式反饋，但是比完全的增強學習問題要簡單。學習這些例子有助於我們理解評價式反饋，以及與之相結合的指示型反饋。

The particular nonassociative, evaluative feedback problem that we explore is a simple version of the k-armed bandit problem. We can use this problem to introduce a number of basic learning methods which we extend in later chapters to apply to the full reinforcement learning problem. At the end of this chapter, we take a step closer to the full reinforcement learning problem by discussing what happens when the bandit problem becomes associative, that is, when actions are taken in more than one situation.

我們將要探索的這種特殊的非關聯的評價式反饋問題是k-armed bandit problem的簡化版本。我們用這個問題引出后續章節中要介紹的完全增強學習的基本方法。本章的最后，對bandit問題進行擴展，使得action發生在多個場景中，得到了關聯型版本。

總結：

Multi-armed bandit problem（又稱k-armed bandit problem）並非完全的reinforcement learning，而只是其簡化版本。所以該書將bandit問題作為引子，引出reinforcement learning的問題。reinforcement learning中的一些概念都是其中的一些概念擴展而來的。

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 【RL系列】Multi-Armed Bandit問題筆記【RL系列】Multi-Armed Bandit筆記——UCB策略與Gradient策略 bandit redis學習筆記(八): multi Ceres學習-2.Problem 深度增強學習--DDPG 增強學習（一） ----- 基本概念深度增強學習--DPPO 增強學習----介紹 JAVA學習（增強For循環）