從網上字幕庫下載得到的字幕文件壓縮包如下圖所示:
然后使用7z解壓到當前目錄,得到的部分文件列表如下(一共有228個字幕文件):
打開其中一個文件查看格式如下:
字符編碼是 ANSI 的,需要先批量轉換為 UTF-8的格式,可參考: https://www.cnblogs.com/abc789/p/12148402.html
然后再使用如下Python腳本批量轉換為json格式。
1 # -*- coding: UTF-8 -*- 2 #coding=utf-8 3 from os.path import abspath, dirname 4 import re 5 import sys 6 import shutil 7 import os 8 def proSrtFile(path,srtFile,pathDest): 9 result = list() 10 f = open(path + '\\' + srtFile,'r') 11 lines = f.readlines() 12 new_line='' 13 is_eng = 0 14 is_chn = 0 15 chn = '' 16 eng = '' 17 result.append('[') 18 for line in lines: 19 line = line.strip() 20 line = line.lstrip() 21 line = line.replace('...',' ') 22 line = line.replace('\"','\'') 23 line = line.replace('.','. ') 24 line = line.replace(',',', ') 25 line = line.replace('?','? ') 26 line = line.replace(':',': ') 27 line = line.replace(';','; ') 28 line = line.replace('!','! ') 29 line = line.replace('. . ','. ') 30 if '-->' in line: 31 continue 32 line = line.replace('-','') 33 if line == '': 34 # 如果中文或英文之一為空則不輸出 35 if eng != '' and chn != '': 36 result.append('{"eng":"' + eng + '","chn":"' + chn + '"},') 37 is_eng = 0 38 is_chn = 0 39 eng = '' 40 chn = '' 41 continue 42 43 # 如果這行是數字則寫入上一行並重新開始: 44 if line.isdigit(): 45 is_eng = 1 46 is_chn = 0 47 eng = '' 48 chn = '' 49 continue 50 if is_eng == 1: 51 eng = line 52 is_eng = 0 53 is_chn = 1 54 continue 55 if is_chn == 1: 56 chn = line 57 is_eng = 0 58 is_chn = 0 59 continue 60 61 f.close() 62 result.append(']') 63 open(pathDest + '\\' + srtFile[0:14] +'.json','w').write('%s' % '\n'.join(result)) 64 65 pathSrc="D:\\data\\corona_projects\\LearnEnglishSentences_json\\data\\friends\\001\\" 66 pathDest="D:\\data\\corona_projects\\LearnEnglishSentences_json\\eng" 67 indexList= list() 68 for root, dirs, files in os.walk(pathSrc): 69 for fn in files: 70 if fn[-4:] == '.srt': 71 72 print fn 73 indexList.append('{"title":"'+fn[0:14]+'","file":"' + fn[0:14] +'.json"},') 74 proSrtFile(root,fn,pathDest) 75 76 open(pathDest + '\\' + 'x_temp_index.xml','w').write('%s' % '\n'.join(indexList)) 77 78 print '--------------------------------------------' 79 print pathDest 80 print pathDest + '\\' + 'x_temp_index.xml' 81
在運行之前,注意修改以上腳本的第 65 行,66行 pathSrc 存儲了已經轉碼為UTF8的原始字幕文件的目錄,pathDest存儲了轉換為json格式后輸出的目錄。
其中一個轉換完畢的json文件格式如下圖所示:
主要對源文件做了如下轉換:英文用eng表示,中文用chn表示,每句中文,英文合並到了一行。並存儲成了json格式。