今天要做數據清洗的時候,要更新一個數據庫字段,考慮到用多進程去更新數據庫,也許程序會跑得快一些,結果開了64個進程,
結果是其他程序更新的時候,速度非常慢,最后發現的原因是,數據庫中有64個SQL語句執行更新,這樣就導致了對數據庫進行增刪改查的速度很慢。
這是一個血的教訓,所有以后的操作盡量少用多進程更新數據庫。即使是想用多進程進行SQL update,可以少開幾個進程,提升效果比較明顯
粘貼查來代碼,以供以后學習參考
#-*-coding:utf-8-*- from common.contest import * import time def spider(item): print "正在清晰地url是:", item['item_url'] item_url = item['item_url'] item_lotnum1 = item['item_lotnum'] item_sold = item['item_sold'] artron_session_url = item['artron_session_url'] artfoxlive_session_url = item['artfoxlive_session_url'] print item_lotnum1 print item_sold try: item_lotnum2 = "@@@" + item_lotnum1 + "@@@" item_lotnum = re.findall('@@@000(.*?)@@@',item_lotnum2)[0] except: try: item_lotnum2 = "@@@" + item_lotnum1 + "@@@" item_lotnum = re.findall('@@@00(.*?)@@@', item_lotnum2)[0] except: try: item_lotnum2 = "@@@" + item_lotnum1 + "@@@" item_lotnum = re.findall('@@@0(.*?)@@@', item_lotnum2)[0] except: item_lotnum = item_lotnum1 item_sold_cur_spider = "" if '流拍' in item_sold: item_sold = -2 item_sold_cur_spider = -2 elif '撤拍' in item_sold: item_sold = -3 item_sold_cur_spider = -3 elif '落槌價' in item_sold: item_sold1 = str(item_sold).replace('落槌價', '').replace(':', '').replace(',', '').replace(':', '').replace(' ', '').replace(' ', '') item_sold = re.findall('\d+', item_sold1)[0] item_sold_cur_spider = re.findall('[^\d]+', item_sold1)[0] else: pass print item_sold print item_sold_cur_spider print artron_session_url print artfoxlive_session_url item_lotnum = item_lotnum.replace('@','') print item_lotnum sql = 'update spider_yachang_2017_2_update_sold_price set item_sold_price_spider2 = %s, item_sold_cur_spider2 = %s
where session_url=%s and item_lotnum= %s ' data = (str(item_sold), str(item_sold_cur_spider), str(artron_session_url), str(item_lotnum)) update_data1(sql, data=data) if __name__ == "__main__": time1 = time.time() sql = """ SELECT * FROM oversea_artfoxlive_2017_2_detail_info """ resultList = select_data(sql) print len(resultList) pool = multiprocessing.Pool(64) for item in resultList: # print "正在爬取的位置是:",resultList.index(item) # spider(item) pool.apply_async(spider, (item,)) pool.close() pool.join()