最近在看推薦系統方面的東西,看到Bandit算法的幾種基本實現思路,看到網上沒有很好的代碼實現,將文中的三種經典的代碼實現了一下。
算法的具體介紹就不寫啦,可以參考一下blog:
https://blog.csdn.net/z1185196212/article/details/53374194
https://blog.csdn.net/dengxing1234/article/details/73188731
e_greedy算法:
以epsilon的概率選擇當前最大的,以1-epsilon概率隨機選擇。
import numpy as np T = 1000 N = 10 true_award = np.random.uniform(0,1,N) estimated_award = np.zeros(N) item_count = np.zeros(N) epsilon = 0.1 def e_greedy(): choose = np.random.binomial(n=1,p=epsilon) if choose: item = np.argmax(estimated_award) award = np.random.binomial(n=1,p=true_award[item]) else: item = np.random.choice(N, 1) award = np.random.binomial(n=1,p=true_award[item]) return item, award total_award = 0 for t in range(T): item, award = e_greedy() total_award+=award estimated_award[item] += award item_count[item]+=1 for i in range(N): estimated_award[i] /= item_count[i] print(true_award) print(estimated_award) print(total_award)
Thompson Sampling算法:
對每個arm以beta(win[arm], lose[arm])產生隨機數,選擇最大的隨機數作為本輪選擇的arm。
import numpy as np T = 1000 N = 10 true_award = np.random.uniform(0,1,N) win = np.zeros(N) lose = np.zeros(N) estimated_award = np.zeros(N) def Thompson_sampling(): arm_prob = [np.random.beta(win[i]+1, lose[i]+1) for i in range(N)] item = np.argmax(arm_prob) reward = np.random.binomial(n=1,p=true_award[item]) return item, reward total_reward = 0 for t in range(T): item, reward = Thompson_sampling() if reward==1: win[item]+=1 else: lose[item]+=1 total_reward+=reward for i in range(N): estimated_award[i] = win[i]/(win[i]+lose[i]) print(true_award) print(estimated_award) print(total_reward)
UCB算法:
不斷的對概率進行調整,用觀測概率 p'+ 誤差delta 對真實概率 p進行估計。
import numpy as np T = 1000 N = 10 ## 真實吐錢概率 true_award = np.random.uniform(low=0, high=1,size=N) estimated_award = np.zeros(N) choose_count = np.zeros(N) total_award = 0 def cal_delta(T, item): if choose_count[item] == 0: return 1 else: return np.sqrt(2*np.log(T) / choose_count[item]) def UCB(t, N): upper_bound_probs = [estimated_award[item] + cal_delta(t, item) for item in range(N)] item = np.argmax(upper_bound_probs) reward = np.random.binomial(n=1, p=true_award[item]) return item, reward for t in range(1,T+1): item, reward = UCB(t, N) total_award += reward estimated_award[item] = (choose_count[item]*estimated_award[item] + reward) / (choose_count[item]+1) choose_count[item]+=1 print(true_award) print(estimated_award) print(total_award)