python將文本中的非漢字去掉,空白行去掉


通過此方法去掉文本中非漢字,並將空白的行刪除:

首先是分兩步來實現:

需要處理的文本文件:

當時這個

3
00:00:04,02 --> 00:00:05,13
有喝多的武將

4
00:00:05,18 --> 00:00:07,05
一看許姬很漂亮

5
00:00:07,12 --> 00:00:09,06
就欲行非禮就拽上去

6
00:00:09,09 --> 00:00:10,21
這許姬手也挺快

7
00:00:11,03 --> 00:00:12,14
黑咕隆能看不見呢

8
00:00:12,14 --> 00:00:13,08
順手誇

9
00:00:13,13 --> 00:00:17,05
把這武將頭盔頂上那鷹帶給摘下來了

10
00:00:17,12 --> 00:00:19,03
哎就是頭盔上綁着帶了
處理前

下面代碼實現去掉文件中非漢字:

import re
def del_no_china(infile, outfile):
    infopen = open(infile, 'r', encoding="utf-8")
    outfopen = open(outfile, 'w', encoding="utf-8")
    lines = infopen.readlines()
    for line in lines:
        g = line.encode().decode()
        k = re.findall('[\u4e00-\u9fa5]', g)
        s = ''.join(k)
        if s.split():
            outfopen.writelines(s)
        else:
            outfopen.writelines("")
        outfopen.writelines("\n")
    infopen.close()
    outfopen.close()
del_no_china("處理前.txt", "處理中.txt")

上面的代碼執行結果如下:

當時這個



有喝多的武將



一看許姬很漂亮



就欲行非禮就拽上去



這許姬手也挺快



黑咕隆能看不見呢



順手誇



把這武將頭盔頂上那鷹帶給摘下來了



哎就是頭盔上綁着帶了
處理中

下面的代碼實現去掉上面文本中的空白行:

def delblankline(infile, outfile):
    infopen = open(infile, 'r', encoding="utf-8")
    outfopen = open(outfile, 'w', encoding="utf-8")
    lines = infopen.readlines()
    for line in lines:
        if line.split():
            outfopen.writelines(line)
        else:
            outfopen.writelines("")
    infopen.close()
    outfopen.close()
delblankline("處理中.txt", "處理后.txt")

上面代碼執行結果如下:

當時這個
有喝多的武將
一看許姬很漂亮
就欲行非禮就拽上去
這許姬手也挺快
黑咕隆能看不見呢
順手誇
把這武將頭盔頂上那鷹帶給摘下來了
哎就是頭盔上綁着帶了
處理后

 

兩步合在一起的代碼為:

import re
def del_no_china(infile, outfile):
    infopen = open(infile, 'r', encoding="utf-8")
    outfopen = open(outfile, 'w', encoding="utf-8")
    lines = infopen.readlines()
    for line in lines:
        g = line.encode().decode()
        print(g)
        k = re.findall('[\u4e00-\u9fa5]', g)
        s = ''.join(k)
        #print(s)
        if s.split():
            outfopen.writelines(s)
        else:
            outfopen.writelines("")
        outfopen.writelines('\n')  #實現換行
    infopen.close()
    outfopen.close()
del_no_china("處理前.txt", "處理中.txt")
#第一個函數的作用是:去掉文本中的非漢字,字符!

def delblankline(infile, outfile):
    infopen = open(infile, 'r', encoding="utf-8")
    outfopen = open(outfile, 'w', encoding="utf-8")
    lines = infopen.readlines()
    for line in lines:
        if line.split():
            outfopen.writelines(line)
        else:
            outfopen.writelines("")
    infopen.close()
    outfopen.close()
delblankline("處理中.txt", "處理后.txt")
#第二個函數的作用是:去掉文本中的空白行。

最終效果也是一樣的!

 


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM