python文本編輯： re.sub-------讀取文本，去除指定字符並保存

本文轉載自查看原文 2019-10-10 16:56 613 python/ 文本編輯/ re.sub/ 替換

現在有這樣一個任務：
我們有一個文本，內容如下：

ws0012cs3d4 這，里。3是.一!?些a文 Z本...

文本里面有中英文標點符號，英文字符，數字，字母，中文，空格等等，現在我們需要把這些文本按行讀取，前面的標號（ws0012cs3d4 ）保持不變，后面文本過濾成僅包含中文文本的數據，然后把標號和文本重新拼起來，如下形式：

ws0012cs3d4 這里是一些文本

保存在新的文件中。

代碼如下：

# -*- coding: utf-8 -*-
 '''
 get txt file ,
 remove all numbers , symbles, tab ,prosody in each txt and save it in a new txt ;
 save txt in a new file which an be use in mtts 
 這里有兩個函數，分別實現了不同功能，可以隨意使用
 '''
 
 from __future__ import unicode_literals
 import re
 import os
 
 '''
 刪除文本中的奇數行
 '''
 def remove_lines(txtfile):
 with open(txtfile) as reader, open('newfile.txt', 'w') as writer:
     for index, line in enumerate(reader):
         if index % 2 == 0:
             writer.write(line)
 return 'newfile.txt'

 
 def _txt_preprocess(txtfile):
     with open(txtfile) as reader, open('newfile2.txt', 'w') as writer:
         txtlines = [x.strip() for x in reader.readlines()]
         for line in txtlines:
             num, txt = line.split(' ', 1)    # 把取出的一行按空格切分且只切分一次
             txt = re.sub('[,.，。、：；？！… “ ”# 0-9 a-z A-Z ａ-ｚＡ-Ｚ]', '', txt)    # []內是希望過濾掉的所有符號，最后兩個是大小寫全角英文字符
             space = ' '
             changeline = '\n'
             tmp = num + space + txt + changeline    #重新拼接文本
             writer.write(tmp)
 
 
 
 
 if __name__ == '__main__':
     import argparse
     parser = argparse.ArgumentParser(
         description="convert mandarin_txt and wav to label for merlin.")
     parser.add_argument(
         "txtfile",
         help=
         "Full path to txtfile which each line contain num and txt (seperated by a white space) "
     )
     args = parser.parse_args()
 
     _txt_preprocess(args.txtfile)

免責聲明！

本站轉載的文章為個人學習借鑒使用，本站對版權不負任何法律責任。如果侵犯了您的隱私權益，請聯系本站郵箱yoyou2525@163.com刪除。

猜您在找 Python讀取文本，輸出指定中文（字符串） python逐行讀取文本 Python只讀取文本中文字符富文本編輯器 tinymce 獲取文本內容、設置文本內容 cmd 批處理 bat 讀取文本內容替換指定文件內的指定字符文件讀取C++文件讀寫操作（二）逐字符讀取文本和逐行讀取文本 python 讀取文本文件 Pandas讀取文本 python3_ re.sub()去除特殊字符 C#字符串操作取文本左邊取文本右邊取文本中間取文本中間到List集合指定文本倒序