python 如何识别字符串中的人名，如何识别一串拼音字符串以及韵母的提取

本文转载自查看原文 2021-11-25 17:51 1262 其他
# python 如何识别字符串中的人名 ，如何识别一串拼音字符串以及韵母的提取

## 一、识别字符串中的人名或特定名词

### 1.安装Python SDK

~~~markdown
安装方法：pip install baidu-aip
~~~

### 2.获取APP ID

~~~markdown
为了使用这个接口，我们还需要获取到百度智能云提供的账号（如下图中的APPID、  API KEY、  SECRET KEY）。
登录官网后，我们需要在百度智能云的管理中心创建一个应用， 这样我们就能通过这个ID使用接口了，如下图。 网址
    https://console.bce.baidu.com/ai/#/ai/nlp/overview/index
~~~

![20191024184444895](assets/20191024184444895.png)

### 3.代码调用

~~~python
def get_chinese_name(text):
    """
    :param text: 中文字符串
    :return: 人名
    """
 
    """识别人名"""
    # 上一步获取到的ID AK SK
    APP_ID = '你的ID' 
    API_KEY = '你的AK'
    SECRET_KEY = '你的SK'
 
    client = AipNlp(APP_ID, API_KEY, SECRET_KEY)
 
    text = str(text.encode('gbk', 'ignore'), encoding='gbk')  # ignore忽略无法编码的字符,如果不加这个会报错。
 
    # 设置请求间隔,免费版的QPS限制为2,有能力的可以购买。
    time.sleep(1)
 
    # 调用词法分析的返回结果
    print(client.lexer(text))
 
    """ 调用词法分析 """
    for i in client.lexer(text)['items']:
        # 若字符串中有人名就返回人名
        if i['ne'] == 'PER':
            return i['item']
 
    return ''
~~~

 我们测试一段字符串 

~~~python
text = "这是一段测试文本,我的中文名是媛媛"
print(get_chinese_name(text))
~~~

返回结果

~~~python
 {'log_id': 375282685928253176, 'text': '这是一段测试文本,我的中文名是媛媛', 'items': [{'loc_details': [], 'byte_offset': 0, 'uri': '', 'pos': 'r', 'ne': '', 'item': '这', 'basic_words': ['这'], 'byte_length': 2, 'formal': ''}, {'loc_details': [], 'byte_offset': 2, 'uri': '', 'pos': 'v', 'ne': '', 'item': '是', 'basic_words': ['是'], 'byte_length': 2, 'formal': ''}, {'loc_details': [], 'byte_offset': 4, 'uri': '', 'pos': 'm', 'ne': '', 'item': '一段', 'basic_words': ['一', '段'], 'byte_length': 4, 'formal': ''}, {'loc_details': [], 'byte_offset': 8, 'uri': '', 'pos': 'vn', 'ne': '', 'item': '测试', 'basic_words': ['测试'], 'byte_length': 4, 'formal': ''}, {'loc_details': [], 'byte_offset': 12, 'uri': '', 'pos': 'n', 'ne': '', 'item': '文本', 'basic_words': ['文本'], 'byte_length': 4, 'formal': ''}, {'loc_details': [], 'byte_offset': 16, 'uri': '', 'pos': 'w', 'ne': '', 'item': ',', 'basic_words': [','], 'byte_length': 1, 'formal': ''}, {'loc_details': [], 'byte_offset': 17, 'uri': '', 'pos': 'r', 'ne': '', 'item': '我', 'basic_words': ['我'], 'byte_length': 2, 'formal': ''}, {'loc_details': [], 'byte_offset': 19, 'uri': '', 'pos': 'u', 'ne': '', 'item': '的', 'basic_words': ['的'], 'byte_length': 2, 'formal': ''}, {'loc_details': [], 'byte_offset': 21, 'uri': '', 'pos': 'n', 'ne': '', 'item': '中文名', 'basic_words': ['中文', '名'], 'byte_length': 6, 'formal': ''}, {'loc_details': [], 'byte_offset': 27, 'uri': '', 'pos': 'v', 'ne': '', 'item': '是', 'basic_words': ['是'], 'byte_length': 2, 'formal': ''}, {'loc_details': [], 'byte_offset': 29, 'uri': '', 'pos': '', 'ne': 'PER', 'item': '媛媛', 'basic_words': ['媛媛'], 'byte_length': 4, 'formal': ''}]}
媛媛 
~~~

### 4.参数说明 

~~~markdown
在这里还需要说明一下接口返回参数 。为了方便，我们先把上一步得到的数据格式化如下，其中

pos : 词性，词性标注算法使用。
~~~

####  **词性缩略说明** 

 ***ne :*** 命名实体类型。如下面例子中的“媛媛”的ne关键字为PER ，表示人名。 

![词性缩略说明](assets/词性缩略说明.png)

####  **专名识别缩略词含义**

![专名识别缩略词含义](assets/专名识别缩略词含义.png)

~~~python
{
    'log_id': 375282685928253176,
    'text': '这是一段测试文本,我的中文名是媛媛',
    'items': [{
        'loc_details': [],
        'byte_offset': 0,
        'uri': '',
        'pos': 'r',
        'ne': '',
        'item': '这',
        'basic_words': ['这'],
        'byte_length': 2,
        'formal': ''
    }, {
        'loc_details': [],
        'byte_offset': 2,
        'uri': '',
        'pos': 'v',
        'ne': '',
        'item': '是',
        'basic_words': ['是'],
        'byte_length': 2,
        'formal': ''
    }, {
        'loc_details': [],
        'byte_offset': 4,
        'uri': '',
        'pos': 'm',
        'ne': '',
        'item': '一段',
        'basic_words': ['一', '段'],
        'byte_length': 4,
        'formal': ''
    }, {
        'loc_details': [],
        'byte_offset': 8,
        'uri': '',
        'pos': 'vn',
        'ne': '',
        'item': '测试',
        'basic_words': ['测试'],
        'byte_length': 4,
        'formal': ''
    }, {
        'loc_details': [],
        'byte_offset': 12,
        'uri': '',
        'pos': 'n',
        'ne': '',
        'item': '文本',
        'basic_words': ['文本'],
        'byte_length': 4,
        'formal': ''
    }, {
        'loc_details': [],
        'byte_offset': 16,
        'uri': '',
        'pos': 'w',
        'ne': '',
        'item': ',',
        'basic_words': [','],
        'byte_length': 1,
        'formal': ''
    }, {
        'loc_details': [],
        'byte_offset': 17,
        'uri': '',
        'pos': 'r',
        'ne': '',
        'item': '我',
        'basic_words': ['我'],
        'byte_length': 2,
        'formal': ''
    }, {
        'loc_details': [],
        'byte_offset': 19,
        'uri': '',
        'pos': 'u',
        'ne': '',
        'item': '的',
        'basic_words': ['的'],
        'byte_length': 2,
        'formal': ''
    }, {
        'loc_details': [],
        'byte_offset': 21,
        'uri': '',
        'pos': 'n',
        'ne': '',
        'item': '中文名',
        'basic_words': ['中文', '名'],
        'byte_length': 6,
        'formal': ''
    }, {
        'loc_details': [],
        'byte_offset': 27,
        'uri': '',
        'pos': 'v',
        'ne': '',
        'item': '是',
        'basic_words': ['是'],
        'byte_length': 2,
        'formal': ''
    }, {
        'loc_details': [],
        'byte_offset': 29,
        'uri': '',
        'pos': '',
        'ne': 'PER',
        'item': '媛媛',
        'basic_words': ['媛媛'],
        'byte_length': 4,
        'formal': ''
    }]
}
~~~

## 二、把一串拼音字符串分割成独立的拼音

~~~markdown
    假如我们要将字符串 “zhoujielun” 分割成 “zhou-jie-lun”这样的格式，那么我们可以采取逆向匹配的方法，即下图中字符串的指针A向B移动过程中不断匹配拼音表（拼音表可自行下载），直到字符串s[A:]在拼音表中，就匹配成功。重复这一过程就可以将这一字符串分隔成一个个拼音。
~~~

![20191025125215453](assets/20191025125215453.png)



 算法直接看代码 

~~~python
def pinyin_word(string):
    '''
    将一段拼音，分解成一个个拼音
    :param string: 匹配的字符串
    :return: 匹配到的拼音列表
    '''
    max_len = 6   # 拼音最长为6
    string = string.lower()
    stringlen = len(string)
    result = []
 
    # 读本地拼音表
    with open('pinyin.txt', 'r', encoding='utf-8') as fi:
        pinyinLib = fi.readlines()
        for i in range(len(pinyinLib)):
            pinyinLib[i] = pinyinLib[i][:-1]  # 去换行符
 
    # 逆向匹配
    while True:
        matched = 0
        matched_word = ''
        if stringlen < max_len:
            max_len = stringlen
        for i in range(max_len, 0, -1):
            s = string[(stringlen-i):stringlen]
            # 字符串是否在拼音表中
            if s in pinyinLib:
                matched_word = s
                matched = i
                break
        # 未匹配到拼音
        if len(matched_word) == 0:
            break
        else:
            result.append(s)
            string = string[:(stringlen-matched)]
            stringlen = len(string)
            if stringlen == 0:
                break
    return result



print(pinyin_or_word('zhoujielun'))

输出结果：['lun', 'jie', 'zhou'] 
~~~

## 三、拼音韵母的提取

 做法和上一点相同，建立一个韵母表，逆向匹配即可。 

~~~python
def pinyin_word(string):
    '''
    将一段拼音，分解成一个个拼音
    :param string: 匹配的字符串
    :return: 匹配到的拼音列表
    '''
    max_len = 6   # 拼音最长为6
    string = string.lower()
    stringlen = len(string)
    result = []
 
    # 读本地拼音表
    with open('pinyin.txt', 'r', encoding='utf-8') as fi:
        pinyinLib = fi.readlines()
        for i in range(len(pinyinLib)):
            pinyinLib[i] = pinyinLib[i][:-1]  # 去换行符
 
    # 逆向匹配
    while True:
        matched = 0
        matched_word = ''
        if stringlen < max_len:
            max_len = stringlen
        for i in range(max_len, 0, -1):
            s = string[(stringlen-i):stringlen]
            # 字符串是否在拼音表中
            if s in pinyinLib:
                matched_word = s
                matched = i
                break
        # 未匹配到拼音
        if len(matched_word) == 0:
            break
        else:
            result.append(s)
            string = string[:(stringlen-matched)]
            stringlen = len(string)
            if stringlen == 0:
                break
    return result
~~~

## 四、一些方法整理

### 1.读xlsx格式

~~~python
    import xlrd
    data = xlrd.open_workbook('data.xlsx')
    table = data.sheet_by_index(0)  # 按索引
    nrows = table.nrows
    ncol = table.ncols
    rowvalue = table.row_values(0)    # 第0行数据
    colvalue = table.col_values(0)    # 第0列数据
    print(rowvalue, colvalue)
~~~

### 2.判断中文字符串

~~~python
    import re
    Pattern = re.compile(u'[\u4e00-\u9fa5]+')  # 判断是否是中文的正则表达式对象
    match = Pattern.match(string)
    # 判断字符串是否为汉字
    if match:
        zh_name = match.group()
        print(zh_name)
~~~

### 3.汉字转拼音

~~~python
不带语调的汉字转拼音：


    from pypinyin import pinyin, lazy_pinyin
    # 汉字转为拼音
    new_name = '我爱编程'
    new_name = ''.join(lazy_pinyin(new_name))
    print(new_name)
    
输出结果：woaibiancheng
~~~
获取APP ID

词性缩略说明

专名识别缩略词含义

免责声明！

本站转载的文章为个人学习借鉴使用，本站对版权不负任何法律责任。如果侵犯了您的隐私权益，请联系本站邮箱yoyou2525@163.com删除。
猜您在找 让html自动识别字符串中的换行符'\n' 换行输出正则表达式识别字符串中的URL Python统计字符串中出现次数最多的人名 python识别一段由字母组成的字符串是否是拼音 Python统计字符串中出现次数最多的人名 Python识别字符型图片验证码 Python识别字符型图片验证码 Python中的字符串 Python3中字符串中的数字提取方法 python如何提取字符串？
python 如何识别字符串中的人名 ，如何识别一串拼音字符串以及韵母的提取

获取APP ID

词性缩略说明

专名识别缩略词含义

免责声明！

python 如何识别字符串中的人名，如何识别一串拼音字符串以及韵母的提取