最近工作中遇到一個問題,在集群上運行的任務有時候無法正常結束,或者無法正常啟動。這會造成這批運行的任務無法正常結束運行,處於pending的狀態,導致后面的任務無法正常啟動。
該問題困擾我們項目已經有半年左右了,一直沒有想到很好的解決辦法。主要原因就是任務的狀態只能在瀏覽器中看出,無法通過后台的日志或者數據庫查詢得到。在瀏覽器中,如果我們看到某個任務長時間沒有運行時間和狀態的變化,就可以把這個任務當做是“僵屍”任務,從而可以將該任務手動結束掉(kill)。
春節之后在網上看到一些有關爬蟲的文章,里面提到過有一種爬蟲就是模擬瀏覽器的行為(包括登錄、點擊等)去得到網頁的數據,進而進行網頁抓取,有用信息提取。於是我思考,我們項目的問題和瀏覽器的交互,只有幾種情況,完全可以通過這種方式解決“僵屍”任務。經過一周左右的研究和一周斷斷續續的coding,終於將這個問題解決了,現在把解決問題的主要思路和關鍵技術難點寫下來,希望一來可以加深自己的印象,二來可以幫助到需要的人。因為實現的任務比較單一,且實現過程比較倉促,code主要就是實現了一些功能,沒有進行優化,也沒有太參考什么編碼規范,設計模式之類的。以后遇到更大的問題,再考慮這些吧。
技術要點:
(1)Python的package:selenium,用這個package,可以和瀏覽器進行交互,如打開某個瀏覽器(Chrome,FireFox等),登錄需要驗證的網站(輸入用戶名&密碼),點擊某個特定圖標等等,下面是兩個有關selenium的鏈接:
https://www.baidu.com/link?url=tTeJRPOMKX8noXyTa2YPgpaD6vVlGQ2-RVAfwRg4Yvm&wd=&eqid=acd0879a0043c2e9000000045741cd39
http://www.cnblogs.com/fnng/archive/2013/05/29/3106515.html
(2)selenium的PhantomJS,這是一個虛擬的瀏覽器,可以把它看成一個在后台運行的瀏覽器,用戶看不到瀏覽器的頁面,但其他的功能和普通瀏覽器基本一樣,比如可以截圖,點擊某個圖標,抓取網頁信息等,之所以使用了這個用來模仿瀏覽器,是因為我們的server無法安裝普通的瀏覽器,只能運行在終端模式下運行的程序;
http://phantomjs.org/
(3)xpath,這個是我編程中耗時最多的模塊,主要原因有幾個,一是元素定位有問題,網站是一秒鍾刷新一次,上一秒獲取到的元素下一秒就找不到了;二是相似元素太多,層級關系太復雜,用一般的相對路徑去尋找,有可能找到一些不想要的元素,所以就造成了尋找元素過程的費時費力。下面是兩個有關xpath的介紹,比較實用,特別是在網頁爬蟲方面(后面我還要專門介紹爬蟲):
http://www.cnblogs.com/fdszlzl/archive/2009/06/02/1494836.html
http://www.ruanyifeng.com/blog/2009/07/xpath_path_expressions.html
以下是核心code,因為項目隱私的原因,把一些敏感的內容用*******代替。如果有什么問題,可以給我留言。
1 ''' 2 command: 3 4 python KillJobs.py -url=172.20.9.42:1100 -screenShotPath=***** 5 6 ''' 7 8 from selenium import webdriver 9 from selenium.common.exceptions import NoSuchElementException 10 from selenium.webdriver.common.keys import Keys 11 import re 12 import time 13 import argparse 14 import sys 15 import os 16 17 18 mailReceiver = [ 19 "xxxxxxxxxxxxxxxx@xx" 20 ] 21 22 ZOMBIE_JOB_LIST = {"list1": [], "list2": [], "list3":[]} 23 24 def get_mail_receiver(): 25 receiver = ' ' 26 for recv in mailReceiver: 27 receiver = receiver + recv + ' ' 28 29 return receiver 30 31 def kill_zombie_jobs(screenShotPath, url): 32 browser = webdriver.PhantomJS() # Get local session of PhantomJS 33 # browser = webdriver.Firefox() # Get local session of Firefox 34 browser.set_window_size(2500, 2000) 35 36 targetUrl = "http://%s/#JOBS" %url 37 print "url: ", targetUrl 38 39 job_to_be_kill_indicate = 0 40 41 browser.get(targetUrl) # Load page 42 userName = browser.find_elements_by_class_name("gwt-TextBox") 43 password = browser.find_elements_by_class_name("gwt-PasswordTextBox") 44 submitButton = browser.find_elements_by_class_name("gwt-Button") 45 46 if len(userName) == 0 or len(password) == 0 or len(submitButton) == 0: 47 print "error in open url: %s" %targetUrl 48 browser.quit() 49 return 50 51 userName[0].send_keys("root") 52 password[0].send_keys("changeit") 53 time.sleep(1) 54 submitButton[0].click() 55 56 time.sleep(2) 57 58 sceen_shot_name = screenShotPath + "/Before_kill_jobs_screen_shot.png" 59 browser.save_screenshot(sceen_shot_name) 60 61 jobs_name_pattern_0 = "//body/div[2]/div[2]/div/div[4]/div/div[3]/div/div[4]/div/div[2]/div/div[2]/div/div/div/div[3]/table[2]/tbody/tr[1]/td/fieldset/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td[1]" 62 jobs_name_pattern = "//body/div[2]/div[2]/div/div[4]/div/div[3]/div/div[4]/div/div[2]/div/div[2]/div/div/div/div[3]/table[2]/tbody/tr[1]/td/fieldset/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr[Order]/td[1]" 63 jobs_duration_pattern = "//body/div[2]/div[2]/div/div[4]/div/div[3]/div/div[4]/div/div[2]/div/div[2]/div/div/div/div[3]/table[2]/tbody/tr[1]/td/fieldset/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr[Order]/td[5]" 64 65 for i in range(1, 4): 66 tmp_list = "list"+str(i) 67 job_name_elements_list = browser.find_elements_by_xpath(jobs_name_pattern_0) 68 job_length = len(job_name_elements_list) 69 70 for index in range(1, job_length+1): 71 job_name_pattern = jobs_name_pattern.replace("Order", str(index)) 72 job_duration_pattern = jobs_duration_pattern.replace("Order", str(index)) 73 job_name = get_element_name(browser, job_name_pattern) 74 job_duration_time = get_duration_time(get_element_name(browser, job_duration_pattern)) 75 76 if len(job_name) > 10 and job_duration_time == 0: 77 ZOMBIE_JOB_LIST[tmp_list].append(job_name) 78 time.sleep(60) 79 80 zombie_job_list = get_zombie_job_list(ZOMBIE_JOB_LIST) 81 print "\n ---------To be killed job list: ", zombie_job_list 82 len1 = len(zombie_job_list) 83 print "\n ---------To be killed job list length: ", len1 84 kill_jobs_in_list(browser, zombie_job_list) 85 print "\n ---------After killed job list: ", zombie_job_list 86 len2 = len(zombie_job_list) 87 print "\n ---------After killed job list length: ", len2 88 89 time.sleep(2) 90 sceen_shot_name = screenShotPath + "/After_kill_jobs_screen_shot.png" 91 browser.save_screenshot(sceen_shot_name) 92 browser.quit() 93 94 if len2 < len1: 95 job_to_be_kill_indicate = 1 96 return job_to_be_kill_indicate 97 98 99 def get_element_name(browser, element_pattern): 100 element_name = "" 101 try: 102 element = browser.find_element_by_xpath(element_pattern) 103 element_name = element.text 104 except Exception, e: 105 print "element not exist any more!!!!!" 106 element_name = "" 107 108 return element_name 109 110 def get_duration_time(timeStr): 111 if timeStr is None or timeStr == "": 112 return 0 113 if re.match(r"\d{2}:\d{2}:\d{2}", timeStr) is None: 114 return 0 115 116 timeSec = int(timeStr[0:2]) * 3600 + int(timeStr[3:5]) * 60 + int(timeStr[6:8]) 117 118 return timeSec 119 120 def send_kill_jobs_mail(mailer, screenShotPath, url, indicator): 121 # jobs screen before and after kill 122 mailTitle = "Jobs_on_%s_Hanging" %url 123 screenShotFile1 = screenShotPath + "/Before_kill_jobs_screen_shot.png" 124 screenShotFile2 = screenShotPath + "/After_kill_jobs_screen_shot.png" 125 logFile = screenShotPath + "/nodes_hanging.log" 126 command = 'mail -a ' + screenShotFile1 + ' -a ' + screenShotFile2 + ' -s ' + mailTitle + mailer + ' < ' + logFile 127 print "command: ", command 128 os.system(command) 129 130 return 0 131 132 def get_zombie_job_list(job_name_list): 133 print "list1: ", job_name_list['list1'] 134 print "list2: ", job_name_list['list2'] 135 print "list3: ", job_name_list['list3'] 136 job_list = [] 137 if (not job_name_list) or (not job_name_list['list1']) or (not job_name_list['list2']) or (not job_name_list['list3']): 138 return job_list 139 else: 140 for job in job_name_list['list1']: 141 if (job in job_name_list['list2']) and (job in job_name_list['list3']): 142 job_list.append(job) 143 144 return job_list 145 146 def kill_jobs_in_list(browser, zombie_job_list): 147 if (not browser) or (not zombie_job_list): 148 return 0 149 150 jobs_name_pattern_0 = "//body/div[2]/div[2]/div/div[4]/div/div[3]/div/div[4]/div/div[2]/div/div[2]/div/div/div/div[3]/table[2]/tbody/tr[1]/td/fieldset/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td[1]" 151 jobs_name_pattern = "//body/div[2]/div[2]/div/div[4]/div/div[3]/div/div[4]/div/div[2]/div/div[2]/div/div/div/div[3]/table[2]/tbody/tr[1]/td/fieldset/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr[Order]/td[1]" 152 jobs_kill_pattern_0 = "//table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr/td[3]/div/button" 153 jobs_kill_pattern = "//body/div[2]/div[2]/div/div[4]/div/div[3]/div/div[4]/div/div[2]/div/div[2]/div/div/div/div[3]/table[2]/tbody/tr[1]/td/fieldset/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr[Order]/td[3]/div/button" 154 jobs_duration_pattern = "//body/div[2]/div[2]/div/div[4]/div/div[3]/div/div[4]/div/div[2]/div/div[2]/div/div/div/div[3]/table[2]/tbody/tr[1]/td/fieldset/table/tbody/tr/td/table/tbody/tr/td/table/tbody/tr[Order]/td[5]" 155 156 for i in range(0, 5): 157 job_name_elements_list = browser.find_elements_by_xpath(jobs_name_pattern_0) 158 job_length = len(job_name_elements_list) 159 160 for index in range(1, job_length+1): 161 job_name_pattern = jobs_name_pattern.replace("Order", str(index)) 162 job_kill_pattern = jobs_kill_pattern.replace("Order", str(index)) 163 job_duration_pattern = jobs_duration_pattern.replace("Order", str(index)) 164 job_name = get_element_name(browser, job_name_pattern) 165 job_duration_time = get_duration_time(get_element_name(browser, job_duration_pattern)) 166 167 if (job_name in zombie_job_list) and (job_duration_time == 0): 168 print "This job should be killed: ", job_name 169 try: 170 kill_button_element = browser.find_element_by_xpath(job_kill_pattern) 171 kill_button_element.click() 172 confirm_kill_button_pattern = "//table/tbody/tr/td/table/tbody/tr/td[1]/button" 173 174 confirm_kill_button_element = browser.find_element_by_xpath(confirm_kill_button_pattern) 175 if confirm_kill_button_element.text == "Yes": 176 print "press button: ", confirm_kill_button_element.text 177 confirm_kill_button_element.click() 178 time.sleep(1) 179 zombie_job_list.remove(job_name) 180 except Exception, e: 181 print "Confirm Yes Button does not exist any more!!!!!" 182 time.sleep(0.5) 183 184 185 def monitor(): 186 # kill exist PhantomJS 187 command = "killall phantomjs" 188 print "kill all existing phantomjs: ", command 189 os.system(command) 190 191 parser = argparse.ArgumentParser() 192 parser.add_argument('-url', action='store', dest='url', help='data url', required=True) 193 parser.add_argument('-screenShotPath', action='store', dest='screenShotPath', help='the screen shot path', required=True) 194 results = parser.parse_args() 195 196 print 'DataRush URL = ', results.url 197 url = results.url 198 print 'Screen Shot Path = ', results.screenShotPath 199 screenShotPath = results.screenShotPath 200 201 mailer = get_mail_receiver() 202 203 print "START: Monitor DataRush Starting.............................." 204 205 job_killed_inicate = kill_zombie_jobs(screenShotPath, url) 206 207 if job_killed_inicate == 1: 208 print "zombie jobs has been killed!!!!!!!" 209 send_kill_jobs_mail(mailer, screenShotPath, url, 1) 210 else: 211 pass 212 213 print "End: Monitor Finished....................................." 214 215 if __name__ == '__main__': 216 monitor() 217 218 219
