NLP（十四）自制序列標注平台

本文轉載自查看原文 2019-08-09 00:10 489 NLP/ 序列標注

背景介紹

在平時的NLP任務中，我們經常用到命名實體識別（NER），常用的識別實體類型為人名、地名、組織機構名，但是我們往往也會有識別其它實體的需求，比如時間、品牌名等。在利用算法做實體識別的時候，我們一般采用序列標注算法，這就對標注的文本格式有一定的要求，因此，一個好的序列標注的平台必不可少，將會大大減少我們標注的工作量，有效提升算法的更新迭代速度。
本文將介紹筆者的一個工作：自制的序列標注平台。我們以時間識別為例。比如，在下面的文章中：

按計划，2019年8月10日，榮耀智慧屏將在華為開發者大會上正式亮相，在8月6日，榮耀官微表示該產品的預約量已破十萬台，8月7日下午，榮耀總裁趙明又在微博上造勢率先打出差異化牌，智慧屏沒有開關機廣告，並表態以后也不會有，消費者體驗至上，營銷一波接一波，可謂來勢洶洶。

我們需要從該文章中標注出三個時間：2019年8月10日，8月6日，8月7日下午，並形成標注序列。
下面將詳細介紹筆者的工作。

序列標注平台

由於開發時間倉促以及筆者能力有限，因此，序列標注平台的功能還沒有很完善，希望筆者的工作能拋磚引玉。
項目的結構圖如下：

templates中存放靜態資源，time_index.html為平台的操作界面，time_output為平台標注完實體后的文件保存路徑，time_server.py是用tornado寫的服務端路徑控制代碼，utils.py中是獲取某個路徑下的txt文件的最大數值的函數。

其中，utils.py的完整代碼如下：

# -*- coding: utf-8 -*-
# time: 2019-03-14
# place: Xinbeiqiao, Beijing

import os

# 獲取當前所在目錄的txt文本的最大數值
def get_max_num(path):
    files = os.listdir(path)
    if files:
        numbers = list(map(lambda x: int(x.replace('.txt', '')), files))
        return max(numbers)
    else:
        return 0

time_server.py的完整代碼如下：

# -*- coding: utf-8 -*-
# time: 2019-08-08
# place: Xinbeiqiao, Beijing

import os.path
import tornado.httpserver
import tornado.ioloop
import tornado.options
import tornado.web
from tornado.options import define, options
from utils import get_max_num

#定義端口為9005
define("port", default=9005, help="run on the given port", type=int)

# GET請求
class QueryHandler(tornado.web.RequestHandler):
    # get函數
    def get(self):
        self.render('time_index.html', data = ['', []])

# POST請求
class PostHandler(tornado.web.RequestHandler):
    # post函數
    def post(self):

        # 獲取前端參數, event, time, index
        event = self.get_argument('event')
        times = self.get_arguments('time')
        indices = self.get_arguments('index')
        print(event)
        print(times)
        print(indices)

        # 前端顯示序列標注信息
        tags = ['O'] * len(event)

        for time, index in zip(times, indices):
            index = int(index)
            tags[index] = 'B-TIME'
            for i in range(1, len(time)):
                tags[index+i] = 'I-TIME'

        data = [event, tags]

        self.render('time_index.html', data=data)

        # 保存為txt文件
        dir_path = './time_output'
        with open('./%s/%s.txt' % (dir_path, get_max_num(dir_path)+1), 'w', encoding='utf-8') as f:
            for char, tag in zip(event, tags):
                f.write(char+'\t'+tag+'\n')


# 主函數
def main():
    # 開啟tornado服務
    tornado.options.parse_command_line()
    # 定義app
    app = tornado.web.Application(
            handlers=[(r'/query', QueryHandler),
                      (r'/result', PostHandler)
                      ], #網頁路徑控制
            template_path=os.path.join(os.path.dirname(__file__), "templates") # 模板路徑
          )
    http_server = tornado.httpserver.HTTPServer(app)
    http_server.listen(options.port)
    tornado.ioloop.IOLoop.instance().start()

main()

time_index.html文件如下：

<!DOCTYPE html>
<html>
<head>
	<meta charset="utf-8">
	<title>時間抽取標注平台</title>
	<link rel="stylesheet" href="https://cdn.staticfile.org/twitter-bootstrap/3.3.7/css/bootstrap.min.css">
    <script src="https://cdn.bootcss.com/jquery/3.4.1/jquery.min.js"></script>
	<script src="https://cdn.staticfile.org/twitter-bootstrap/3.3.7/js/bootstrap.min.js"></script>
	<style>
        mark {
            background-color:#00ff90; font-weight:bold;
        }
		p{text-indent:2em;}
    </style>
    <script>
        var click_cnt = 0;

        // 雙擊第i個select, 添加文字的index
        function select_click(i){
        	var content = document.getElementById('event').value;
        	var time = document.getElementById('time_'+i.toString()).value;

        	for(var j=0; j<=content.length-time.length; j++){
        		if(content.substr(j, time.length) == time){
        			var select = document.getElementById('index_'+i.toString());
        			var option = document.createElement("option");
        			option.value = j;
        			option.innerHTML = j;
        			select.appendChild(option);
        		}
        	}
        }

		// 添加輸入框和select框
        $(document).ready(function(){

            $("#add_time").click(function(){
                 click_cnt = click_cnt + 1;
                 var input_id = new String('time_'+click_cnt.toString());
                 var index_id = new String('index_'+click_cnt.toString());
                 var content = "<input type='text' id=" + input_id + " class='form-control' style='width:306px;' name='time' /> \
                 				&emsp;&emsp;&emsp; <select class='form-control' name='index' id="+ index_id + " style='width:120px;' \
                 				ondblclick='select_click("+click_cnt.toString()+")'></select>";
                 $(content).appendTo($("#time_column"));
            });

        });

	</script>
</head>
<body>

<center>
    <br><br><br>
<form class="form-horizontal" role="form" method="post" action="/result" style="width:600px">
	<div class="form-group">
		<label for="event" class="col-sm-2 control-label">輸入語料</label>
		<div class="col-sm-10">
			<textarea type="text" class="form-control" id="event" style="width:490px; height:200px" name="event"></textarea>
		</div>
	</div>
	<div class="form-inline" style="text-align:left;">
		<label for="time_0" class="col-sm-2 control-label">時間</label>
		<div class="col-sm-10" id="time_column">
			<input type="text" class="form-control" id="time_0" style="width:306px;" name="time" />
            &emsp;&emsp;&emsp;
            <select class="form-control" id="index_0" name="index" style="width:120px;" ondblclick="select_click(0)"></select>
		</div>
	</div>
	<div class="form-group">
		<div class="col-sm-offset-2 col-sm-10">
            <br>
            <button type="button" class="btn btn-default" id="add_time">添加時間</button>
			<button type="submit" class="btn btn-success">顯示標簽</button>
			<a href="/query"><button type="button" class="btn btn-danger">返回</button></a>
            <button type="reset" class="btn btn-warning">重置</button>
		</div>
	</div>

</form>
	<br>
	<div style="width:600px">
		<p> 原文：{{data[0]}} </p>
		<table class="table table-striped">
		{% for char, tag in zip(data[0], data[1]) %}
			<tr>
				<td>{{char}} </td>
				<td>{{tag}} </td>
			</tr>
		{%end%}
		</table>
	</div>
</center>

</body>
</html>

平台使用

運行上述time_server.py后，在瀏覽器端輸入網址: http://localhost:9005/query , 則會顯示如下界面：

在輸入語料框中，我們輸入語料：

8月8日是“全民健身日”，推出重磅微視頻《我們要贏的，是自己》。

在時間這個輸入框中，可以標注語料中的時間，同時雙擊同一行中的下拉列表，就能顯示該標注時間在語料中的起始位置，有時候同樣的標注時間會在語料中出現多次，那么我們在下拉列表中選擇我們需要的標注的起始位置即可。
點擊添加時間按鈕，它會增加一行標注，允許我們在同一份預料中標注多個時間。我們的一個簡單的標注例子如下：

點擊顯示標注，則會顯示我們標注完后形成的序列標注信息，同時將該序列信息保存為txt文件，該txt文件位於time_output目錄下。在網頁上的序列標注信息如下：

同時，我們也可以查看保存的txt文檔信息，如下：

點擊返回按鈕，它會允許我們進行下一次的標注。剛才展示的只是一個簡單例子，稍微復雜的標注如下圖：

它形成的標注序列(部分)如下：

按	O
計	O
划	O
，	O
2	B-TIME
0	I-TIME
1	I-TIME
9	I-TIME
年	I-TIME
8	I-TIME
月	I-TIME
1	I-TIME
0	I-TIME
日	I-TIME
，	O
榮	O
耀	O
智	O
慧	O
屏	O
將	O
在	O
華	O
為	O
開	O
發	O
者	O
大	O
會	O
上	O
正	O
式	O
亮	O
相	O
，	O
在	O
8	B-TIME
月	I-TIME
6	I-TIME
日	I-TIME
，	O
榮	O
耀	O
官	O
微	O
表	O
示	O
該	O
產	O
品	O
......

總結

本平台僅作為序列標注算法的前期標注工具使用，並不涉及具體的算法。另外，后續該平台也會陸續開放出來，如果大家有好的建議，也可以留言～
本項目已上傳只Github, 網址為： https://github.com/percent4/entity_tagging_platform

注意：不妨了解下筆者的微信公眾號： Python爬蟲與算法（微信號為：easy_web_scrape），歡迎大家關注~

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 NLP | 序列標注總結 NLP文本標注工具與平台（數據標注公司）【NLP】序列標注問題？如何處理詞中間的“的”等不相干符號？BIESO標注結構？未解決 nlp四大任務(分類、匹配、序列標注、文本生成)集成項目 nlp詞性標注的作用序列標注中的BIO標注介紹 CNN做序列標注問題（tensorflow）做平台產品必須了解的知識--以數據標注平台為例自制 COCO api 直接讀取類 COCO 的標注數據的壓縮文件數據可視化之powerBI技巧（十四）采悟：PowerBI中自制中文單位萬和億