Causal Corpus 事件因果關系語料統計


Causal Corpus 事件因果關系語料統計

本文是對因果關系抽取領域數據庫標注及開源情況的統計。除了對因果關系的標注,一些類似的語料也包含在內,從而為語料的使用提供靈活性,可以根據不同的目標選取不同的語料庫。

領域簡介

因果關系通常標注為 ( cause , effect , signal ) 三元組,cause 和 effect 分別代表原因事件和結果事件,signal 是語言學從因果結構的觸發詞,例如 because, so, thus 等等。

需要知道的是不同的因果語料對於因果關系的定義以及對事件的定義有很大差異,從而導致至今沒有一個大規模的統一語料庫支撐該領域開展開放域的研究。如果給出好的定義也是學術界討論的焦點。

因果事件語料通常作為因果事件抽取、因果推斷等任務的基礎,允許使用規則、機器學習、深度學習等方法對事件鏈進行分析。

采樣策略

本文采用的語料搜集方法是基於領域關鍵詞(如 causal, relation, causality )從 Google Scholar 獲取種子論文集,根據文獻之間的引述關系,不斷拓展相關文檔范圍,最終得到領域相關的語料集合。

對於 arxiv 暫不收錄,只針對已發表的文章進行統計。

統計分析

名稱 年份 規模(因果關系數量) 開源情況 備注
SemEval-2007 task 4 2007 210 ~
The Penn Discourse Treebank 2.0 2008 ~ 沒有專門對因果進行標注。因果被記為 contingency relationship 的子類。顯式因果,且觸發詞不完整,無法完全的表述因果,很多情況沒有標記。BECauSE Corpus 2.0相對其更加完善。
Bethard et al., 2008 2008 - paper中鏈接以失效 標注了一個小語料庫,針對被 ’and' 連接的事件binary 因果標注。
SemEval-2010 task 8 2010 1,331 每條句子只標注一對因果事件,即使還存在其他因果事件。實體不標注完整信息,只標注head。
Richer Event Descriptions 2014 1,147 對THYME病例語料標注的豐富,添加了事件共指注釋,同時實現了相鄰句之間的事件關系標注,對因果進行區分, ‘PRECONDITION’ and ‘CAUSE’
Causal-TimeBank 2014 298 提出一種更加廣泛覆蓋的語言學的方法來豐富 TimeML 語料庫,使其包含因果關系和觸發詞。要求事件是TimeML中標注的事件,基於語言學特征進行標注。guideline 不夠精確,更多地依賴於主觀概念。
The Chinese Discourse TreeBank 2015 261 找到的唯二中文語料。
CaTeRS 2016 約700 320篇小說,1600個句子,2708個事件,2715個關系,13種類型。實體不標注完整信息,只標注head。不是標注現實世界的因果,而是故事中結合人的推理能夠得到的因果結論。側重於script and narrative structure learning
AltLex 2016 9,190 利用PDTB和Wikipedia語料,使用distant supervision demonstrates方法,提出了一種自動構建因果標注集的方法,文末作者提到了他沒有對標注的質量進行細致的驗證。只是作為一個組件參與分類器從而提升最終性能。
BECauSE Corpus 2.0 2017 1,803 顯式因果。與其他標注方案的一致性高,語言學因果結構覆蓋完整。同時平行標注了其他關系,允許同一事件對包含多種關系。對不同關系間的重疊進行討論。是目前為止找到的最好的語料。
Event StoryLine Corpus 2017 5,519 PLOT_ LINK 該語料對故事進行標注,標注條目PLOT_LINK 表達 explanatory relations ,即說明性的、幫助讀者理解故事敘述架構的關系信息,標注結果和因果非常相似,但是出發點又有不同。這種關系的目的是使(新聞)故事中事件的連貫性或邏輯聯系變得清晰,為事件之間的一種松散的因果或時序關系,一件事的提及解釋了/證明了另一件事的發生。
HIT-CDTB ? 2,138(顯式)+1,526(隱式) HIT篇章關系語料。存疑。

對於各個語料的具體分析尚未整理完善,有需要的看官可以郵件聯系我。

參考資料

  1. Girju R, Nakov P, Nastase V, et al. Semeval-2007 task 04: Classification of semantic relations between nominals[C]//Proceedings of the 4th International Workshop on Semantic Evaluations. Association for Computational Linguistics, 2007: 13-18.
  2. Prasad R, Dinesh N, Lee A, et al. The Penn Discourse TreeBank 2.0[C]//LREC. 2008.
  3. Bethard S, Corvey W J, Klingenstein S, et al. Building a Corpus of Temporal-Causal Structure[C]//LREC. 2008.
  4. Hendrickx I, Kim S N, Kozareva Z, et al. Semeval-2010 task 8: Multi-way classification of semantic relations between pairs of nominals[C]//Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions. Association for Computational Linguistics, 2009: 94-99.
  5. O’Gorman T, Wright-Bettner K, Palmer M. Richer Event Description: Integrating event coreference with temporal, causal and bridging annotation[C]//Proceedings of the 2nd Workshop on Computing News Storylines (CNS 2016). 2016: 47-56.
  6. Mirza P, Sprugnoli R, Tonelli S, et al. Annotating causality in the TempEval-3 corpus[C]//EACL 2014 Workshop on Computational Approaches to Causality in Language (CAtoCL). Association for Computational Linguistics, 2014: 10-19.
  7. Zhou Y, Xue N. The Chinese Discourse TreeBank: a Chinese corpus annotated with discourse relations[J]. Language Resources and Evaluation, 2015, 49(2): 397-431.
  8. Mostafazadeh N, Grealish A, Chambers N, et al. CaTeRS: Causal and temporal relation scheme for semantic annotation of event structures[C]//Proceedings of the Fourth Workshop on Events. 2016: 51-61.
  9. Hidey C, McKeown K. Identifying causal relations using parallel Wikipedia articles[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016: 1424-1433.
  10. Dunietz J, Levin L, Carbonell J. The BECauSE corpus 2.0: Annotating causality and overlapping relations[C]//Proceedings of the 11th Linguistic Annotation Workshop. 2017: 95-104.
  11. Caselli T, Vossen P. The event storyline corpus: A new benchmark for causal and temporal relation extraction[C]//Proceedings of the Events and Stories in the News Workshop. 2017: 77-86.
  12. T. N. de Silva, X. Zhibo, Z. Rui, M. Kezhi, Causal relation identification using convolutional neural networks and knowledge based features, World Academy of Science, Engineering and Technology, International Journal of Computer, Electrical, Automation, Control and Information Engineering 11 (6) (2017) 697–702.
  13. C. Kruengkrai, K. Torisawa, C. Hashimoto, J. Kloetzer, J. Oh, M. Tanaka, Improving event causality recognition with multiple background knowledge sources using multi-column convolutional neural networks, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., 2017, pp. 3466–3473.
  14. C. Kruengkrai, K. Torisawa, C. Hashimoto, J. Kloetzer, J. Oh, M. Tanaka, Improving event causality recognition with multiple background knowledge sources using multi-column convolutional neural networks, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., 2017, pp. 3466–3473.
  15. C. Kruengkrai, K. Torisawa, C. Hashimoto, J. Kloetzer, J. Oh, M. Tanaka, Improving event causality recognition with multiple background knowledge sources using multi-column convolutional neural networks, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., 2017, pp. 3466–3473.
  16. C. Kruengkrai, K. Torisawa, C. Hashimoto, J. Kloetzer, J. Oh, M. Tanaka, Improving event causality recognition with multiple background knowledge sources using multi-column convolutional neural networks, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA., 2017, pp. 3466–3473.
  17. J. Dunietz, J. G. Carbonell, L. S. Levin, Deepcx: A transition-based approach for shallow semantic parsing with complex constructional triggers, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, 2018, pp. 1691–1701.

共享協議

本文由 ArrogantL 整理並在 CC BY-NC-SA 3.0 協議下發布。有任何問題請郵件聯系 arrogant262@gmail.com

請各位遵循 Markdown: License 及其它參考文獻的共享協議來使用、修改和發布。


免責聲明!

本站轉載的文章為個人學習借鑒使用,本站對版權不負任何法律責任。如果侵犯了您的隱私權益,請聯系本站郵箱yoyou2525@163.com刪除。



 
粵ICP備18138465號   © 2018-2025 CODEPRJ.COM