python 行政區域地址標准化:業務經理填報的地址亂起八糟,高德接口有點厲害!


需求:由於業務檢查需求,需要將一個結構化地址,如”XX省XX市XX區XXX號“地區轉化為對應國家統計區行政划分的

省、市、區(縣)、鎮(街道)、鄉結構。

解決思路:

1、自行編制文本解析方法,考慮比較復雜,很多情況不能覆蓋,暫時不考慮,如果能解析,則速度會比較快。

2、通過爬蟲,在百度搜索“百度百科” + 業務地址,通過分析第一個頁面中的地址信息,但是可能會出現很多不一樣的信息,分析起來有一定難度。但是優點是可以無限制爬取。

3、依靠高德API接口https://lbs.amap.com/api/webservice/guide/api/georegeo【地理編碼、逆地理編碼】,個人開發者明天擁有30萬免費使用額度,對於一般而言已經足夠,速度還快。

基於當前業務量大小,決定使用思路3。

前期准備:

依賴庫:requests、lxml、pandas

1、閱讀高德API接口參數,得出可以使用“地址名”來進行地理編碼得到經緯度,再使用逆地理編碼,通過經緯度得到“省、市、區(縣)、鎮(街道)”信息。特殊情況:部分地址十分不規則的話,需要增加默認搜索地址。

2、爬取 統計用區划和城鄉划分代碼:http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2019/index.html,以如下形式儲存。主要考慮高德【逆地理編碼】API沒有到鄉級,如果有就不要爬取國家統計局信息了。最后通過所在街道下的城鄉信息,與機構地址匹配找出相應的最后一級信息。

 

 3、學習xpath解析方法,使用lxml庫。高德API返回內容是xml形式。

具體實現:

1、pandas打開excel文件,主要用加上dtype=object參數,保持數據原來的屬性,不然一些數值型文本會被加載為數值。

file_name = 'data/address2test.xls'
df = pd.read_excel(file_name,dtype=object)
city_bk = '惠州市'
# 構造請求
req_geo_url = ''
req_geo_s = 'https://restapi.amap.com/v3/geocode/geo?address='
req_geo_e = '&output=XML&key=2a8d3af7ce489cb7e219d7df54d92678'
req_regeo_url = ''
req_regeo_s = 'https://restapi.amap.com/v3/geocode/regeo?output=xml&location='
req_regeo_e = '&key=2a8d3af7ce489cb7e219d7df54d92678&radius=1000&extensions=all'
headers = {
            'User-Agent':'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.1; WOW64; Trident/7.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET4.0C; .NET4.0E; Media Center PC 6.0)',
}
list_err_url = []  # 存儲錯誤的url

# 對標題進行重新排序,默認取第一個列為地址,並追加后續列,如果已經有,則會保存。reIndex需要通過返回賦值,沒有inplace參數。
new_columns =  [df.columns[0]] + ['執行結果','標准地址','國家','省份','城市','縣區代碼','縣區','鄉鎮代碼','鄉鎮','街道',"鄉村地址""]
df = df.reindex(columns=new_columns)  

2、遍歷每一行,這里使用df.apply方法,構造高德api requests請求,逐行執行。

df_sel = df['執行結果'] != 1
df.loc[df_sel,"執行結果"],df.loc[df_sel,"標准地址"],df.loc[df_sel,"國家"],df.loc[df_sel,"省份"],df.loc[df_sel,"城市"],df.loc[df_sel,"縣區代碼"],df.loc[df_sel,"縣區"],df.loc[df_sel,"鄉鎮代碼"],df.loc[df_sel,"鄉鎮"],df.loc[df_sel,"街道"] =  zip(*df[df_sel].apply(append_address, axis=1))
# 請求函數
def append_address(x):
    result = 1
    url = req_geo_s + str(x[0]) + req_geo_e
    print('執行序號:',str(x.name),'地址:',str(x[0]),'url:',url)
    # 初始化結果
    location = formatted_address = country = province = city = citycode = district = ''
    adcode = township = towncode = streetNumber_street = streetNumber_number = ''
    try:
        resp = requests.get(url,timeout=5,headers = headers)  # 設置訪問超時,以及http頭
        xml = etree.XML(resp.content)
        count = xml.xpath('/response/count/text()')[0]
        if int(count) == 0:
            # 如果為空,說明他的地址很不規范,但是這種一般是本地的業務
            resp = requests.get(req_geo_s + city_bk + str(x[0]) + req_geo_e,timeout=5,headers = headers)  # 設置訪問超時,以及http頭
            xml = etree.XML(resp.content)
        city = xml.xpath('/response/geocodes/geocode/city/text()')  # 如果有多個,則選擇為惠州市的
        locations = xml.xpath('/response/geocodes/geocode/location/text()')
        # 判斷找到了多少個,如果有多個的話,則返回默認城市
        if len(city) == 1:
            location = locations[0]
        else:
            location = locations[0]
            for i in range(len(city)):
                if city[i] == city_bk:
                    location = locations[i]
    except Exception as e:
        print('req_geo_e error message:',str(e),'error url:',url)
        list_err_url.append(url)
        result = 0
        location = ''
    # 如果正常,則繼續訪問
    if location != '' and result != 0:
        url = req_regeo_s + location + req_regeo_e
        try:
            resp = requests.get(url,timeout=5,headers = headers)  # 設置訪問超時,以及http頭
            xml = etree.XML(resp.content)
            # 逆編碼內容
            formatted_address = xml.xpath('/response/regeocode/formatted_address/text()')
            if len(formatted_address)>0: formatted_address = formatted_address[0]
            
            country = xml.xpath('/response/regeocode/addressComponent/country/text()')
            if len(country)>0: country = country[0]
            
            province = xml.xpath('/response/regeocode/addressComponent/province/text()')
            if len(province)>0: province = province[0]
            
            city = xml.xpath('/response/regeocode/addressComponent/city/text()')
            if len(city)>0: city = city[0]
            
            citycode = xml.xpath('/response/regeocode/addressComponent/citycode/text()')
            if len(citycode)>0: citycode = citycode[0]
            
            district = xml.xpath('/response/regeocode/addressComponent/district/text()')
            if len(district)>0: district = district[0]
            
            adcode = xml.xpath('/response/regeocode/addressComponent/adcode/text()')
            if len(adcode)>0: adcode = adcode[0]
            
            township = xml.xpath('/response/regeocode/addressComponent/township/text()')
            if len(township)>0: township = township[0]
            
            towncode = xml.xpath('/response/regeocode/addressComponent/towncode/text()')
            if len(towncode)>0: towncode = towncode[0]
            
            streetNumber_street = xml.xpath('/response/regeocode/addressComponent/streetNumber/street/text()')
            if len(streetNumber_street)>0: streetNumber_street = streetNumber_street[0]
            
            streetNumber_number = xml.xpath('/response/regeocode/addressComponent/streetNumber/number/text()')
            if len(streetNumber_number)>0: streetNumber_number = streetNumber_number[0]
        except Exception as e:
            print('location error message:',str(e),'error url:',url)
            result = 0
            list_err_url.append(url)
    # 返回元祖執行結果
    return(result,formatted_address,country,province,city,adcode,district,towncode,township,streetNumber_street + streetNumber_number)

3、執行到這里,已經獲取到了4級地址信息,還需要補充最后一級。先通過爬取到的統計局標准,構造一個{‘區域代碼(前6位):{城鎮/代碼(7-9位):[vllage]}}的一個2層字典+列表的一個結構。

# 讀取行政區划,village解析為5級字典
sdf  = pd.read_csv('data/stats.csv',dtype=object)) 
sdf.drop(sdf[sdf['statType'] != 'village'].index, inplace=True)
sdf.drop(columns=['statName', 'statProvince','statCity','statCounty','statTown','statVillageType'],inplace=True)

# 構造行政區域字典,
d_state = {}
for i in range(len(sdf)):
    #if i > 3:
    #    break
    # 分割
    statCode = str(sdf.iloc[i]['statCode']).strip().replace("'","")
    city = statCode[:6]
    town = statCode[6:9]
    # 形成(鄉全程,鄉簡稱(用於匹配),標識符)
    village_deal = deal_village(str(sdf.iloc[i]['statVillage']))  #處理過戶
    #print('city:',city,'town:',town)
    if not city in d_state:
        d_state[city] = {}
    d_t = d_state[city]
    if not town in d_t:
        d_t[town] = []
    d_t[town].append(village_deal)

4、再次遍歷經過標准化處理的地址,使用village的簡稱與具體地址做匹配,如果存在則返回,並補充。最后結果如下:

 

 總結

1、高德API成功率當前2萬多條,僅有28條無法識別,5000條需要補充默認城市信息才能進行查找,總體效果較好。

2、最后鄉級進行補充,僅用簡稱進行簡單匹配,效果一般。考慮使用爬蟲查找最近的社區或村委會,或找找有無相關可以查找對應的網站進行爬取。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM