Abstract
Open-text semantic parsers are designed to interpret any statement in natural language by inferring a corresponding meaning representation (MR – a formal representation of its sense).
開放文本語義分析器被設計為通過推斷相應的意義表示(MR -其意義的正式表示)來解釋自然語言中的任何語句。
Unfortunately, large scale systems cannot be easily machine-learned due to a lack of directly supervised data.
不幸的是,由於缺乏直接監督的數據,大型系統不容易進行機器學習。
We propose a method that learns to assign MRs to a wide range of text (using a dictionary of more than 70,000 words mapped to more than 40,000 entities) thanks to a training scheme that combines learning from knowledge bases (e.g. WordNet) with learning from raw text.
我們提出了一種方法,通過結合從知識庫(如WordNet)學習和從原始文本學習的培訓計划,學習如何將MRs分配到廣泛的文本中(使用映射到超過40,000個實體的超過70,000個單詞的字典)。
The model jointly learns representations of words, entities and MRs via a multi-task training process operating on these diverse sources of data.
該模型通過在這些不同數據源上運行的多任務訓練過程,聯合學習單詞、實體和MRs的表示形式。
Hence, the system ends up providing methods for knowledge acquisition and word sense disambiguation within the context of semantic parsing in a single elegant framework.
因此,該系統最終在一個簡潔的框架中提供了知識獲取和詞義消歧的方法。
Experiments on these various tasks indicate the promise of the approach.
對這些不同任務的實驗表明了該方法的可行性。
1 Introduction
Text classification is a classic topic for natural language processing, in which one needs to assign predefined categories to free-text documents.
文本分類是自然語言處理的一個經典主題,在這個主題中,需要為自由文本文檔分配預定義的類別。
The range of text classification research goes from designing the best features to choosing the best possible machine learning classifiers.
文本分類研究的范圍從設計最好的特征到選擇最好的機器學習分類器。
To date, almost all techniques of text classification are based on words, in which simple statistics of some ordered word combinations (such as n-grams) usually perform the best [12].
到目前為止,幾乎所有的文本分類技術都是基於單詞的,其中一些有序單詞組合(例如n-gram)的簡單統計通常執行最佳的[12]。
On the other hand, many researchers have found convolutional networks (ConvNets) [17] [18] are useful in extracting information from raw signals, ranging from computer vision applications to speech recognition and others.
另一方面,許多研究人員發現卷積網絡[17][18]在從原始信號中提取信息方面很有用,從計算機視覺應用到語音識別等。
In particular, time-delay networks used in the early days of deep learning research are essentially convolutional networks that model sequential data [1] [31].
特別是,在早期的深度學習研究中使用的時延網絡本質上是對序列數據[1][31]進行建模的卷積網絡。
In this article we explore treating text as a kind of raw signal at character level, and applying temporal (one-dimensional) ConvNets to it.
在本文中,我們將文本作為一種字符級的原始信號來處理,並將時域(一維)卷積應用於此。
For this article we only used a classification task as a way to exemplify ConvNets, ability to understand texts.
在這篇文章中,我們只使用了一個分類任務來舉例說明卷積神經網絡,即理解文本的能力。
Historically we know that ConvNets usually require large-scale datasets to work, therefore we also build several of them.
歷史上我們知道,ConvNets通常需要大型數據集才能工作,因此我們也構建了幾個這樣的數據集。
An extensive set of comparisons is offered with traditional models and other deep learning models.
與傳統模型和其他深度學習模型進行了廣泛的比較。
Applying convolutional networks to text classification or natural language processing at large was explored in literature.
文獻中探討了將卷積網絡應用於文本分類或自然語言處理。
It has been shown that ConvNets can be directly applied to distributed [6] [16] or discrete [13] embedding of words, without any knowledge on the syntactic or semantic structures of a language.
已有研究表明,卷積神經網絡可以直接應用於分布式[6] [16]或離散的單詞嵌入[13],而不需要了解語言的句法或語義結構。
These approaches have been proven to be competitive to traditional models.
這些方法已被證明與傳統模型具有競爭力。
There are also related works that use character-level features for language processing.
還有一些相關的作品使用字符級的特性來進行語言處理。
These include using character-level n-grams with linear classifiers [15], and incorporating character-level features to ConvNets [28] [29].
這些方法包括使用帶有線性分類器[15]的字符級n-gram,以及將字符級特征合並到[28] [29]中。
In particular, these ConvNet approaches use words as a basis, in which character-level features extracted at word [28] or word n-gram [29] level form a distributed representation.
特別是,這些ConvNet方法使用單詞作為基礎,在單詞[28]或單詞n-gram[29]級別提取的字符級特征形成分布式表示。
Improvements for part-of-speech tagging and information retrieval were observed.
對詞性標注和信息檢索進行了改進。
This article is the first to apply ConvNets only on characters.
本文是第一個僅將ConvNets應用於字符的文章。
We show that when trained on large- scale datasets, deep ConvNets do not require the knowledge of words, in addition to the conclusion from previous research that ConvNets do not require the knowledge about the syntactic or semantic structure of a language.
我們發現,當在大規模數據集上進行訓練時,deep ConvNets不需要詞匯知識,此外,從以前的研究中得出的結論是,ConvNets不需要語言的句法或語義結構的知識。
This simplification of engineering could be crucial for a single system that can work for different languages, since characters always constitute a necessary construct regardless of whether segmentation into words is possible.
這種簡化的工程可能是至關重要的單一系統,可以工作在不同的語言,因為字符總是構成一個必要的結構,不管分割成單詞是否可能。
Working on only characters also has the advantage that abnormal character combinations such as misspellings and emoticons may be naturally learnt.
只研究漢字也有一個好處,那就是可以很自然地學會拼寫錯誤和表情符號等不正常的字符組合。
An early version of this work entitled “Text Understanding from Scratch” was posted in Feb 2015 as arXiv:1502.01710. The present paper has considerably more experimental results and a rewritten introduction
2015年2月,“文本理解從零開始”作為arXiv:1502.01710發布。本論文有大量的實驗結果和一個重寫的介紹
2 Character-level Convolutional Networks
In this section, we introduce the design of character-level ConvNets for text classification. The design is modular, where the gradients are obtained by back-propagation [27] to perform optimization.
在這一節中,我們將介紹用於文本分類的字符級卷積神經網絡的設計。設計是模塊化的,其中梯度是通過反向傳播[27]進行優化。
2.1 Key Modules
The main component is the temporal convolutional module, which simply computes a 1-D convolution.
主要的組成部分是時域卷積模塊,它簡單地計算一個一維卷積。
Suppose we have a discrete input function g(x)->R and a discrete kernel function f(x).
假設我們有一個離散的輸入函數g(x)->R和一個離散的核函數f(x)
The convolution h(y) between and g(x) with stride d is defined as
定義h(y)與g(x)與stride d的卷積為

where c = k - d+l is an offset constant.
其中c = k - d+l為偏移常數。
Just as in traditional convolutional networks in vision, the module is parameterized by a set of such kernel functions (i = 1,2, m and j = 1,2, ,n) which we call weights, on a set of inputs gi(x) and outputs hj(y).
就像傳統的卷積網絡一樣,該模塊是由一組這樣的內核函數(i = 1,2, ... , m和j = 1,2, ... ,n)參數化的,我們稱這些函數為權值,對一組輸入gi(x)和輸出hj(y)。
We call each g_i (or h_j) input (or output) features, and m (or n) input (or output) feature size.
我們將每個g_i(或h_j)輸入(或輸出)特性稱為g_i,將m(或n)輸入(或輸出)特性稱為m。
The outputs h_j is obtained by a sum over i of the convolutions between g_i and f_ij.
輸出h_j是通過g_i和f_ij之間的卷積對i求和得到的。
One key module that helped us to train deeper models is temporal max-pooling.
幫助我們訓練更深層次模型的一個關鍵模塊是時間最大池。
It is the 1-D version of the max-pooling module used in computer vision [2].
它是計算機視覺[2]中使用的最大池模塊的一維版本。
Given a discrete input function g(x) , the max-pooling function h(y) of g(x) is defined as
給定離散輸入函數g(x),定義g(x)的最大池函數h(y)為

where c = k - d + 1 is an offset constant.
其中c = k - d + 1是偏移常數。
This very pooling module enabled us to train ConvNets deeper than 6 layers, where all others fail.
這個池化模塊使我們能夠訓練超過6層的卷積神經網絡,而其他的都失敗了。
The analysis by [3] might shed some light on this.
[3]的分析可能會對此提供一些啟示。
The non-linearity used in our model is the rectifier or thresholding function h(x) = max(0, x), which makes our convolutional layers similar to rectified linear units (ReLUs) [24].
我們模型中使用的非線性是整流器或閾值函數h(x) = max(0, x),這使得我們的卷積層類似於整流線性單元(ReLUs)[24]。
The algorithm used is stochastic gradient descent (SGD) with a minibatch of size 128, using momentum [26] [30] 0.9 and initial step size 0.01 which is halved every 3 epoches for 10 times.
該算法采用隨機梯度下降法(SGD),最小批量為128,利用動量[26][30]0.9和初始步長0.01,每3個周期減半10次。
Each epoch takes a fixed number of random training samples uniformly sampled across classes.
每個epoch取固定數量的隨機訓練樣本,這些樣本均勻地跨類采樣。
This number will later be detailed for each dataset sparately.
這個數字稍后將對每個數據集進行詳細說明。
The implementation is done using Torch 7 [4].
實現是使用Torch 7[4]完成的。
2.2 Character quantization
Our models accept a sequence of encoded characters as input.
我們的模型接受一個編碼字符序列作為輸入。
The encoding is done by prescribing an alphabet of size m for the input language, and then quantize each character using 1-of-m encoding (or “one-hot" encoding).
編碼是通過為輸入語言指定一個大小為m的字母表,然后使用1-of-m編碼(或“one-hot”編碼)對每個字符進行量化來完成的。
Then, the sequence of characters is transformed to a sequence of such m sized vectors with fixed length /0- Any character exceeding length /0 is ignored, and any characters that are notin the alphabet including blank characters are quantized as all-zero vectors.
然后,將字符序列轉換為具有固定長度/0的m大小的向量序列——忽略任何超過長度/0的字符,將不屬於字母表的任何字符(包括空白字符)量化為全零向量。
The character quantization order is backward so that the latest reading on characters is always placed near the begin of the output, making it easy for fully connected layers to associate weights with the latest reading.
字符量化順序是向后的,因此對字符的最新讀取總是放在輸出的開頭附近,這使得完全連接的層很容易將權重與最新讀取關聯起來。
The alphabet used in all of our models consists of 70 characters, including 26 english letters, 10 digits, 33 other characters and the new line character. The non-space characters are:
我們所有的模型中使用的字母表由70個字符組成,包括26個英文字母、10個數字、33個其他字符和新行字符。非空格字符為:
Later we also compare with models that use a different alphabet in which we distinguish between upper-case and lower-case letters.
稍后,我們還將與使用不同字母表的模型進行比較,在這些模型中,我們將區分大小寫字母。
2.3 Model Design
We designed 2 ConvNets - one large and one small.
我們設計了兩個卷積神經網絡——一個大的,一個小的。
They are both 9 layers deep with 6 convolutional layers and 3 fully-connected layers.
它們都是9層,6個卷積層,3個全連接層。
Figure 1 gives an illustration.
圖1給出了一個說明。

The input have number of features equal to 70 due to our character quantization method, and the input feature length is 1014.
由於我們的字符量化方法,輸入的特征數等於70,輸入的特征長度為1014。
It seems that 1014 characters could already capture most of the texts of interest.
似乎1014個字符已經可以捕獲大多數感興趣的文本。
We also insert 2 dropout [10] modules in between the 3 fully-connected layers to regularize.
我們還插入了兩個dropout[10]模塊之間的3個全連接層,以規范。
They have dropout probability of 0.5.
它們的dropout 概率是0。5。
Table 1 lists the configurations for convolutional layers, and table 2 lists the configurations for fully-connected (linear) layers.
表1列出了卷積層的配置,表2列出了全連接(線性)層的配置。
Table 1: Convolutional layers used in our experiments.
表1:實驗中使用的卷積層。
The convolutional layers have stride 1 and pooling layers are all non-overlapping ones, so we omit the description of their strides.
卷積層有stride 1,池化層都是非重疊的,所以我們省略了對它們的stride的描述。

We initialize the weights using a Gaussian distribution.
我們使用高斯分布初始化權值。
The mean and standard deviation used for initializing the large model is (0,0.02) and small model (0,0.05).
初始化大模型的均值和標准偏差分別為(0,0.02)和(0,0.05)。
Table 2: Fully-connected layers used in our experiments.
表2:我們實驗中使用的全連通層。
The number of output units for the last layer is determined by the problem.
最后一層的輸出單元的數量由問題決定。
For example, for a 10-class classification problem it will be 10.
例如,對於一個10類的分類問題,它將是10。
For different problems the input lengths may be different (for example in our case = 1014), and so are the frame lengths.
對於不同的問題,輸入長度可能不同(例如在本例中為1014),幀長度也可能不同。
From our model design, it is easy to know that given input length the output frame length after the last convolutional layer (but before any of the fully-connected layers) is l_6 = (l_o — 96)/27.
從我們的模型設計中,很容易知道給定的輸入長度,在最后一個卷積層(但在任何一個全連接層之前)之后的輸出幀長度是l_6 = (l_o - 96)/27。
This number multiplied with the frame size at layer 6 will give the input dimension the first fully-connected layer accepts.
這個數字與第6層的幀大小相乘,將為第一個全連接層提供輸入維度。
2.4 Data Augmentation using Thesaurus
Many researchers have found that appropriate data augmentation techniques are useful for controlling generalization error for deep learning models.
許多研究者發現適當的數據擴充技術對於控制深度學習模型的泛化誤差是有用的。
These techniques usually work well when we could find appropriate invariance properties that the model should possess.
當我們能夠找到模型應該具有的適當的不變性時,這些技術通常是有效的。
In terms of texts, it is not reasonable to augment the data using signal transformations as done in image or speech recognition, because the exact order of characters may form rigorous syntactic and semantic meaning.
在文本方面,使用圖像或語音識別中的信號轉換來擴充數據是不合理的,因為字符的精確順序可能形成嚴格的語法和語義意義。
Therefore, the best way to do data augmentation would have been using human rephrases of sentences, but this is unrealistic and expensive due the large volume of samples in our datasets.
因此,進行數據擴充的最佳方法是使用人類的語句重述,但是由於我們的數據集中有大量的樣本,這是不現實和昂貴的。
As a result, the most natural choice in data augmentation for us is to replace words or phrases with their synonyms.
因此,我們在數據擴充時最自然的選擇就是用同義詞替換單詞或短語。
We experimented data augmentation by using an English thesaurus, which is obtained from the mytheas component used in LibreOffice project.
我們使用英語同義詞典進行數據擴充實驗,該同義詞典來自LibreOffice項目中使用的mytheas組件。
That thesaurus in turn was obtained from Word- Net [7], where every synonym to a word or phrase is ranked by the semantic closeness to the most frequently seen meaning.
而同義詞詞典則是由Word- Net[7]獲得的,在那里,一個單詞或短語的每個同義詞都根據其最常出現的語義的緊密程度進行排序。
To decide on how many words to replace, we extract all replaceable words from the given text and randomly choose r of them to be replaced.
為了確定要替換多少個單詞,我們從給定的文本中提取所有可替換的單詞,並隨機選擇要替換的單詞r。
The probability of number r is determined by a geometric distribution with parameter p in which P[r]〜 pr.
數r的概率由參數p的幾何分布決定,其中p [r] ~ pr為參數。
The index s of the synonym chosen given a word is also determined by a another geometric distribution in which P[s].
給定一個詞所選擇的同義詞的索引s也由另一個幾何分布決定,其中P[s]。
This way, the probability of a synonym chosen becomes smaller when it moves distant from the most frequently seen meaning.
通過這種方式,當一個同義詞與最常見的意思相距較遠時,它被選擇的可能性就會變小。
We will report the results using this new data augmentation technique with p = 0.5 and q = 0.5.
我們將使用p = 0.5和q = 0.5的新數據擴展技術報告結果。
3 Comparison Models
To offer fair comparisons to competitive models, we conducted a series of experiments with both traditional and deep learning methods.
為了與競爭模型進行公平的比較,我們用傳統和深度學習方法進行了一系列的實驗。
We tried our best to choose models that cp provide comparaple and competitive results, and the results are reported faithfully without any model selection.
我們盡量選擇cp提供比較和競爭結果的模型,結果如實報告,沒有任何模型選擇。
3.1 Traditional Methods
We refer to traditional methods as those that using a hand-crafted feature extractor and a linear classifier.
我們將傳統的方法稱為使用手工制作的特征提取器和線性分類器的方法。
The classifier used is a multinomial logistic regression in all these models.
在所有這些模型中使用的分類器是一個多項邏輯回歸。
Bag-of-words and its TFIDF.
For each dataset, the bag-of-words model is constructed by selecting 50,000 most frequent words from the training subset.
對於每個數據集,通過從訓練子集中選擇50,000個最頻繁的單詞來構建單詞袋模型。
For the normal bag-of-words, we use the counts of each word as the features.
對於普通的單詞包,我們使用每個單詞的計數作為特征。
For the TFIDF (term-frequency inverse-document-frequency) [14] version, we use the counts as the term-frequency.
對於TFIDF (term-frequency inverse-document-frequency)[14]版本,我們使用計數作為term-frequency。
The inverse document frequency is the logarithm of the division between total number of samples and number of samples with the word in the training subset.
逆文檔頻率是訓練子集中樣本總數與樣本個數之比的對數。
The features are normalized by dividing the largest feature value.
通過分割最大的特征值來標准化特征。
Bag-of-ngrams and its TFIDF.
The bag-of-ngrams models are constructed by selecting the 500,000 most frequent n-grams (up to 5-grams) from the training subset for each dataset.
通過從每個數據集的訓練子集中選擇50萬個最頻繁的n-g(最多5個)來構建神經網絡包模型。
The feature values are computed the same way as in the bag-of-words model.
特征值的計算方法與bag-of-words模型相同。
Bag-of-means on word embedding.
We also have an experimental model that uses k-means on word2vec [23] learnt from the training subset of each dataset, and then use these learnt means as representatives of the clustered words.
我們還建立了一個實驗模型,該模型從每個數據集的訓練子集中獲取word2vec[23]上的k-means,然后將這些k-means作為聚類詞的代表。
We take into consideration all the words that appeared more than 5 times in the training subset.
我們考慮了所有在訓練子集中出現超過5次的單詞。
The dimension of the embedding is 300.
嵌入的尺寸是300。
The bag-of-means features are computed the same way as in the bag-of-words model.
均值袋裝特征的計算方法與詞語袋裝模型的計算方法相同。
The number of means is 5000.
平均數是5000。
3.2 Deep Learning Methods
Recently deep learning methods have started to be applied to text classification.
近年來,深度學習方法開始應用於文本分類。
We choose two simple and representative models for comparison, in which one is word-based ConvNet and the other a simple long-short term memory (LSTM) [11] recurrent neural network model.
我們選擇了兩種簡單而有代表性的模型進行比較,一種是基於單詞的ConvNet,另一種是簡單的長短時記憶[11]遞歸神經網絡模型。
Word-based ConvNets.
Among the large number of recent works on word-based ConvNets for text classification, one of the differences is the choice of using pretrained or end-to-end learned word representations.
在最近大量基於單詞的文本分類的ConvNets著作中,區別之一是選擇使用預先訓練的或端到端學習的單詞表示。
We offer comparisons with both using the pretrained word2vec [23] embedding [16] and using lookup tables [5].
我們提供與使用預先訓練的word2vec[23]嵌入[16]和使用查找表[5]的比較。
The embedding size is 300 in both cases, in the same way as our bag- of-means model.
在這兩種情況下,嵌入的大小都是300,與我們的經濟能力袋模型相同。
To ensure fair comparison, the models for each case are of the same size as our character-level ConvNets, in terms of both the number of layers and each layer's output size.
為了確保公平的比較,每種情況的模型在層數和每個層的輸出大小方面都與我們的字符級ConvNets相同。
Experiments using a thesaurus for data augmentation are also conducted.
還進行了使用同義詞庫進行數據擴充的實驗。
Long-short term memory.
We also offer a comparison with a recurrent neural network model, namely long-short term memory (LSTM) [11].
並與遞歸神經網絡模型即長短時記憶(LSTM)[11]進行了比較。
The LSTM model used in our case is word-based, using pretrained word2vec embedding of size 300 as in previous models.
在我們的例子中使用的LSTM模型是基於單詞的,使用預先訓練的word2vec嵌入,大小為300,與以前的模型一樣。
The model is formed by taking mean of the outputs of all LSTM cells to form a feature vector, and then using multinomial logistic regression on this feature vector.
該模型對所有LSTM細胞的輸出取均值形成一個特征向量,然后對該特征向量進行多項邏輯回歸。
The output dimension is 512.
輸出維度是512。
The variant of LSTM we used is the common “vanilla” architecture [8] [9].
我們使用的LSTM的變體是常見的“香草”架構[8][9]。
We also used gradient clipping [25] in which the gradient norm is limited to 5.
我們還使用了梯度裁剪[25],其中梯度范數限制為5。
Figure 2 gives an illustration.
圖2給出了一個說明。
Figure 2: long-short term memory
3.3 Choice of Alphabet
For the alphabet of English, one apparent choice is whether to distinguish between upper-case and lower-case letters.
對於英語字母表,一個明顯的選擇是是否區分大小寫字母。
We report experiments on this choice and observed that it usually (but not always) gives worse results when such distinction is made.
我們報告了關於這個選擇的實驗,並觀察到,當做出這樣的區分時,通常(但不總是)會得到更糟糕的結果。
One possible explanation might be that semantics do not change with different letter cases, therefore there is a benefit of regularization.
一種可能的解釋是,語義不會隨着字母的不同而改變,因此正則化是有好處的。
4 Large-scale Datasets and Results
Previous research on ConvNets in different areas has shown that they usually work well with large- scale datasets, especially when the model takes in low-level raw features like characters in our case.
以前在不同領域對ConvNets的研究表明,它們通常可以很好地處理大規模數據集,特別是當模型采用低級原始特性(如我們的示例中的字符)時。
However, most open datasets for text classification are quite small, and large-scale datasets are splitted with a significantly smaller training set than testing [21].
然而,大多數用於文本分類的開放數據集都非常小,而且大型數據集的訓練集比測試[21]小得多。
Therefore, instead of confusing our community more by using them, we built several large-scale datasets for our experiments, ranging from hundreds of thousands to several millions of samples.
因此,我們沒有使用它們來迷惑我們的社區,而是為我們的實驗建立了幾個大規模的數據集,范圍從數十萬到數百萬個樣本。
Table 3 is a summary.
表3是總結。
Table 3: Statistics of our large-scale datasets.
表3:大型數據集的統計數據。
Epoch size is the number of minibatches in one epoch
Epoch 大小是指在一個Epoch 內的小批數量

AG's news corpus.
We obtained the AG's corpus of news article on the web.
我們在網上獲得了AG的新聞語料庫。
It contains 496,835 categorized news articles from more than 2000 news sources.
它包含來自2000多個新聞來源的496835篇分類新聞文章。
We choose the 4 largest classes from this corpus to construct our dataset, using only the title and description fields.
我們從這個文集中選擇4個最大的類來構建我們的數據集,只使用標題和描述字段。
The number of training samples for each class is 30,000 and testing 1900.
每個班的訓練樣本數量為30000個,測試1900個。
Sogou news corpus.
This dataset is a combination of the SogouCA and SogouCS news corpora [32], containing in total 2,909,551 news articles in various topic channels.
該數據集是SogouCA和SogouCS新聞語料庫[32]的組合,包含了各種主題頻道的2909 551篇新聞文章。
We then labeled each piece of news using its URL, by manually classifying the their domain names.
然后,我們使用其URL標記每條新聞,通過手動分類它們的域名。
This gives us a large corpus of news articles labeled with their categories.
這就為我們提供了一個巨大的新聞語料庫,上面標注着它們的類別。
There are a large number categories but most of them contain only few articles.
類別很多,但大多數只包含很少的文章。
We choose 5 categories - “sports”,“foiance” “entertainment”, “automobile” and "technology”.
我們選擇了5個類別——“體育”、“越軌”、“娛樂”、“汽車”和“科技”。
The number of training samples selected for each class is 90,000 and testing 12,000.
每個班選擇的培訓樣本數量為90,000個,測試12,000個。
Although this is a dataset in Chinese, we used pypinyin package combined with jieba Chinese segmentation system to produce Pinyin - a phonetic romanization of Chinese.
雖然這是一個中文的數據集,但是我們使用了pypinyin package和jieba Chinese segmentation system來產生拼音,拼音是漢語的拼音。
The models for English can then be applied to this dataset without change.
然后,可以將英語模型應用於此數據集而無需更改。
The fields used are title and content.
使用的字段是標題和內容。
Table 4: Testing errors of all the models.
表4:各模型的測試誤差。
Numbers are in percentage.
數字以百分數表示。
“Lg” stands for Targe” and “Sm” stands for “small”.
Lg代表Targe, Sm代表small。
“w2v" is an abbreviation for “word2vec”,and "Lk” for "lookup table”.
“w2v”是“word2vec”的縮寫,“Lk”是“查找表”的縮寫。
“Th” stands for thesaurus.
" Th "代表同義詞詞典。
ConvNets labeled "Full” are those that distinguish between lower and upper letters
標記為“Full”的ConvNets是那些區分上下字母的

DBPedia ontology dataset.
DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia [19].
DBpedia是一個眾包社區,旨在從Wikipedia[19]中提取結構化信息。
The DBpedia ontology dataset is constructed by picking 14 nonoverlapping classes from DBpedia 2014.
DBpedia本體數據集是通過從DBpedia 2014中選擇14個不重疊類來構建的。
From each of these 14 ontology classes, we randomly choose 40,000 training samples and 5,000 testing samples.
從這14個本體類中,我們隨機選擇40000個訓練樣本和5000個測試樣本。
The fields we used for this dataset contain title and abstract of each Wikipedia article.
我們為這個數據集使用的字段包含每個Wikipedia文章的標題和摘要。
Yelp reviews.
This dataset contains 1,569,264 samples that have review texts.
該數據集包含1,569,264個示例,其中包含審查文本。
Two classification tasks are constructed from this dataset - one predicting full number of stars the user has given, and the other predicting a polarity label by considering stars 1 and 2 negative, and 3 and 4 positive.
從這個數據集可以構造兩個分類任務——一個是預測用戶給出的完整的星數,另一個是通過考慮星1和2為負,星3和4為正來預測一個極性標簽。
The full dataset has 130,000 training samples and 10,000 testing samples in each star, and the polarity dataset has 280,000 training samples and 19,000 test samples in each polarity.
完整的數據集在每個星上有130,000個訓練樣本和10,000個測試樣本,而極性數據集在每個星上有280,000個訓練樣本和19,000個測試樣本。
Yahoo! Answers dataset.
Answers Comprehensive Questions and Answers version 1.0 dataset through the Yahoo!
回答全面的問題和答案版本1.0的數據集通過雅虎!
Webscope program.
Webscope程序。
The corpus contains 4,483,032 questions and their answers.
語料庫包含了4,483,032個問題及其答案。
We constructed a topic classification dataset from this coipus using 10 largest main categories.
我們使用10個最大的主要類別從這個coipus構建了一個主題分類數據集。
Each class contains 140,000 training samples and 5,000 testing samples.
每班包含14萬個訓練樣本和5000個測試樣本。
The fields we used include question title, question content and best answer.
我們使用的字段包括問題標題、問題內容和最佳答案。
Amazon reviews.
We obtained an Amazon review dataset from the Stanford Network Analysis Project (SNAP), which spans 18 years with 34,686,770 reviews from 6,643,669 users on 2,441,053 products [22].
我們從斯坦福網絡分析項目(SNAP)獲得了一個Amazon評論數據集,該數據集跨越18年,涉及2,441,053個產品[22]的6,643,669個用戶的34,686,770條評論。
Similarly to the Yelp review dataset, we also constructed 2 datasets - one full score prediction and another polarity prediction.
與Yelp評論數據集類似,我們也構建了兩個數據集——一個滿分預測和另一個極性預測。
The full dataset contains 600,000 training samples and 130,000 testing samples in each class, whereas the polarity dataset contains 1,800,000 training samples and 200,000 testing samples in each polarity sentiment.
完整的數據集包含60萬個訓練樣本和13萬個測試樣本,而極性數據集包含180萬個訓練樣本和20萬個測試樣本。
The fields used are review title and review content.
使用的字段是評論標題和評論內容。
Table 4 lists all the testing errors we obtained from these datasets for all the applicable models.
表4列出了我們從這些數據集中獲得的所有適用模型的所有測試錯誤。
Note that since we do not have a Chinese thesaurus, the Sogou News dataset does not have any results using thesaurus augmentation.
注意,由於我們沒有中文同義詞典,搜狗新聞數據集沒有使用同義詞典擴充的任何結果。
We labeled the best result in blue and worse result in red.
我們用藍色標出最好的結果,用紅色標出最差的結果。
5 Discussion

Figure 3: Relative errors with comparison models
To understand the results in table 4 further, we offer some empirical analysis in this section.
為了進一步理解表4中的結果,我們在本節中提供了一些實證分析。
To facilitate our analysis, we present the relative errors in figure 3 with respect to comparison models.
為了便於分析,我們在圖3中給出了相對於比較模型的相對誤差。
Each of these plots is computed by taking the difference between errors on comparison model and our character-level ConvNet model, then divided by the comparison model error.
每個圖的計算方法是取比較模型和我們的字符級ConvNet模型的誤差之差,然后除以比較模型誤差。
All ConvNets in the figure are the large models with thesaurus augmentation respectively.
圖中所有的ConvNets是大模型分別與同義詞典擴增。
Character-level ConvNet is an effective method.
The most important conclusion from our experiments is that character-level ConvNets could work for text classification without the need for words.
從我們的實驗中得出的最重要的結論是,字符級卷積神經網絡可以在不需要單詞的情況下進行文本分類。
This is a strong indication that language could also be thought of as a signal no different from any other kind.
這是一個強有力的跡象,表明語言也可以被認為是一個信號,沒有不同於任何其他類型。
Figure 4 shows 12 random first-layer patches learnt by one of our character-level ConvNets for DBPedia dataset.
圖4顯示了12個隨機的第一層補丁,它們是由DBPedia數據集的一個字符級卷積神經網絡學習到的。

Figure 4: First layer weights. For each patch, height is the kernel size and width the alphabet size
Dataset size forms a dichotomy between traditional and ConvNets models.
The most obvious trend coming from all the plots in figure 3 is that the larger datasets tend to perform better.
從圖3的所有圖中可以看出,最明顯的趨勢是較大的數據集往往表現得更好。
Traditional methods like n-grams TFIDF remain strong candidates for dataset of size up to several hundreds of thousands, and only until the dataset goes to the scale of several millions do we observe that character-level ConvNets start to do better.
像n-grams TFIDF這樣的傳統方法仍然是大到幾十萬的數據集的有力候選者,只有當數據集達到幾百萬的規模時,我們才能觀察到字符級的ConvNets開始做得更好。
ConvNets may work well for user-generated data.
User-generated data vary in the degree of how well the texts are curated.
用戶生成的數據在文本管理的程度上各不相同。
For example, in our million scde datasets, Amazon reviews tend to be raw user-inputs, whereas users might be extra careful in their writings on Yahoo!
例如,在我們的一百萬個scde數據集中,Amazon評論往往是原始的用戶輸入,而用戶在Yahoo!
Answers, Plots comparing word-based deep models (figures 3c, 3d and 3e) show that character-level ConvNets work better for less curated user-generated texts.
答案,比較基於單詞的深層模型的圖(圖3c, 3d和3e)表明,字符級的ConvNets更適合於管理較少的用戶生成文本。
This property suggests that ConvNets may have better applicability to real-world scenarios.
這一特性表明,ConvNets可能對現實場景具有更好的適用性。
However, further analysis is needed to validate the hypothesis that ConvNets are truly good at identifying exotic character combinations such as misspellings and emoticons, as our experiments alone do not show any explicit evidence.
然而,我們還需要進一步的分析來驗證這樣的假設,即ConvNets確實擅長識別外來字符組合,如拼寫錯誤和表情符號,因為我們的實驗沒有任何明確的證據。
Choice of alphabet makes a difference.
Figure 3f shows that changing the alphabet by distinguishing between uppercase and lowercase letters could make a difference.
圖3f顯示,通過區分大寫字母和小寫字母來改變字母表可能會有所不同。
For million-scale datasets, it seems that not making such distinction usually works better.
對於數百萬規模的數據集,似乎不做這樣的區分通常效果更好。
One possible explanation is that there is a regularization effect, but this is to be validated.
一種可能的解釋是存在正則化效應,但這有待驗證。
Semantics of tasks may not matter.
Our datasets consist of two kinds of tasks: sentiment analysis (Yelp and Amazon reviews) and topic classification (it others).
我們的數據集包括兩類任務:情感分析(Yelp和Amazon評論)和主題分類(it其他)。
This dichotomy in task semantics does not seem to play a role in deciding which method is better.
這種任務語義上的二分法似乎並不能決定哪種方法更好。
Bag-of-means is a misuse of word2vec [20].
One of the most obvious facts one could observe from table 4 and figure 3a is that the bag-of-means model performs worse in every case.
從表4和圖3a中可以觀察到的一個最明顯的事實是,均值袋裝模型在每種情況下的性能都更差。
Comparing with traditional models, this suggests such a simple use of a distributed word representation may not give us an advantage to text classification.
與傳統的模式相比,這表明如此簡單地使用分布式單詞表示可能不會給我們文本分類帶來優勢。
However, our experiments does not speak for any other language processing tasks or use of word2vec in any other way.
然而,我們的實驗不涉及任何其他語言處理任務或以任何其他方式使用word2vec。
There is no free lunch.
Our experiments once again verifies that there is not a single machine learning model that can work for all kinds of datasets.
我們的實驗再次證明,沒有一個單一的機器學習模型可以適用於所有類型的數據集。
The factors discussed in this section could all play a role in deciding which method is the best for some specific application.
本節中討論的因素都可能在決定哪種方法最適合某些特定的應用方面發揮作用。
6 Conclusion and Outlook
This article offers an empirical study on character-level convolutional networks for text classification.
本文對用於文本分類的字符級卷積網絡進行了實證研究。
We compared with a large number of traditional and deep learning models using several large- scale datasets.
我們使用幾個大規模的數據集與大量的傳統和深度學習模型進行了比較。
On one hand, analysis shows that character-level ConvNet is an effective method.
分析表明,字符級卷積神經網絡是一種有效的方法。
On the other hand, how well our model performs in comparisons depends on many factors, such as dataset size, whether the texts are curated and choice of alphabet.
另一方面,我們的模型在比較中的表現取決於許多因素,如數據集的大小、文本是否編排和字母的選擇。
In the future, we hope to apply character-level ConvNets for a broader range of language processing tasks especially when structured outputs are needed.
在未來,我們希望將字符級的卷積神經網絡應用於更廣泛的語言處理任務,特別是在需要結構化輸出的情況下。
Acknowledgement
We gratefully acknowledge the support of NVIDIA Corporation with the donation of 2 Tesla K40 GPUs used for this research.
我們非常感謝英偉達公司的支持,捐贈了2台用於本研究的特斯拉K40 gpu。
We gratefully acknowledge the support of Amazon.com Inc for an AWS in Education Research grant used for this research.
我們非常感謝亞馬遜公司對本次研究使用的AWS教育研究補助金的支持。
References
[1] L. Bottou, F. Fogelman Soulie, P. Blanchet, and J. Lienard. Experiments with time delay networks and dynamic time warping for speaker independent isolated digit recognition. In Proceedings of EuroSpeech 89, volume 2, pages 537–540, Paris, France, 1989.
[2] Y.-L. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level features for recognition. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2559–2566. IEEE, 2010.
[3] Y.-L. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in visual recognition. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 111–118, 2010.
[4] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like environment for machine learning. In BigLearn, NIPS Workshop, number EPFL-CONF-192376, 2011.
[5] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. Natural language processing (almost) from scratch. J. Mach. Learn. Res., 12:2493–2537, Nov. 2011.
[6] C. dos Santos and M. Gatti. Deep convolutional neural networks for sentiment analysis of short texts. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers*, pages 69–78, Dublin, Ireland, August 2014. Dublin City University and Association for Computational Linguistics.
[7] C. Fellbaum. Wordnet and wordnets. In K. Brown, editor, Encyclopedia of Language and Linguistics, pages 665–670, Oxford, 2005. Elsevier.
[8] A. Graves and J. Schmidhuber. Framewise phoneme classifification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5):602–610, 2005.
[9] K. Greff, R. K. Srivastava, J. Koutn´ık, B. R. Steunebrink, and J. Schmidhuber. LSTM: A search space odyssey. CoRR, abs/1503.04069, 2015.
[10] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
[11] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, Nov. 1997.
[12] T. Joachims. Text categorization with suport vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning, pages 137–142. Springer-Verlag, 1998.
[13] R. Johnson and T. Zhang. Effective use of word order for text categorization with convolutional neural networks. CoRR, abs/1412.1058, 2014.
[14] K. S. Jones. A statistical interpretation of term specifificity and its application in retrieval. Journal of Documentation, 28(1):11–21, 1972.
[15] I. Kanaris, K. Kanaris, I. Houvardas, and E. Stamatatos. Words versus character n-grams for anti-spam filtering. International Journal on Artifificial Intelligence Tools, 16(06):1047–1067, 2007.
[16] Y. Kim. Convolutional neural networks for sentence classifification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1746–1751, Doha, Qatar, October 2014. Association for Computational Linguistics.
[17] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4):541–551, Winter 1989.
[18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
[19] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web Journal, 2014.
[20] G. Lev, B. Klein, and L. Wolf. In defense of word embedding for generic text representation. In C. Biemann, S. Handschuh, A. Freitas, F. Meziane, and E. Mtais, editors, Natural Language Processing and Information Systems, volume 9103 of Lecture Notes in Computer Science, pages 35–50. Springer International Publishing, 2015.
[21] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. The Journal of Machine Learning Research, 5:361–397, 2004.
[22] J. McAuley and J. Leskovec. Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the 7th ACM Conference on Recommender Systems, RecSys ’13, pages 165–172, New York, NY, USA, 2013. ACM.
[23] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. 2013.
[24] V. Nair and G. E. Hinton. Rectifified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 807–814, 2010.
[25] R. Pascanu, T. Mikolov, and Y. Bengio. On the diffificulty of training recurrent neural networks. In ICML 2013, volume 28 of JMLR Proceedings, pages 1310–1318. JMLR.org, 2013.
[26] B. Polyak. Some methods of speeding up the convergence of iteration methods. {USSR} Computational Mathematics and Mathematical Physics, 4(5):1 – 17, 1964.
[27] D. Rumelhart, G. Hintont, and R. Williams. Learning representations by back-propagating errors. Nature, 323(6088):533–536, 1986.
[28] C. D. Santos and B. Zadrozny. Learning character-level representations for part-of-speech tagging. In Proceedings of the 31st International Conference on Machine Learning (ICML-14)*, pages 1818–1826, 2014.
[29] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil. A latent semantic model with convolutional-pooling
structure for information retrieval. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, pages 101–110. ACM, 2014.
[30] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton. On the importance of initialization and momentum
in deep learning. In S. Dasgupta and D. Mcallester, editors, Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28, pages 1139–1147. JMLR Workshop and Conference Proceedings, May 2013.
[31] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang. Phoneme recognition using time-delay neural networks. Acoustics, Speech and Signal Processing, IEEE Transactions on, 37(3):328–339, 1989.
[32] C. Wang, M. Zhang, S. Ma, and L. Ru. Automatic online news issue construction in web environment. In Proceedings of the 17th International Conference on World Wide Web, WWW ’08, pages 457–466, New York, NY, USA, 2008. ACM.
