純學術 的識別表格的文章:
http://hrb-br.com/5007404/20190321A0B99Y00.html
https://github.com/doc-analysis/TableBank
該研究中,來自北航和微軟亞研的研究者聯合創建了一個基於圖像的表格檢測和識別新型數據集 TableBank,該數據集是通過對網上的 Word 和 Latex 文檔進行弱監督而建立的。該數據集包含 417K 個高質量標注表格,通過此數據集作者利用深度神經網絡 SOTA 模型建立了數個強大的基線,從而助力更多研究將深度學習方法應用到表格檢測與識別任務中。目前 TableBank 已開源。
本文給出該數據集的下載地址,如果有人研究表格識別問題,這個下載鏈接會方便一些。(本鏈接是從官方獲得的,官方下載較慢,於是分享下自己保存的地址)
鏈接:**********************************************
提取碼: ****
--------------------------------------------------------------------------------------------------------
Because some data has copyright issues and should not be released, we filtered all the data and excluded them. We also retrain all the baseline model on the changed dataset and list them on the leaderboard website.
Leaderboard: https://doc-analysis.github.io/
If you use the corpus in published work, please cite it:
@article{li2019tablebank,
title={TableBank: Table Benchmark for Image-based Table Detection and Recognition},
author={Li, Minghao and Cui, Lei and Huang, Shaohan and Wei, Furu and Zhou, Ming and Li, Zhoujun},
journal={arXiv preprint arXiv:1903.01949},
year={2019}
}
-----------------------------------------------------------------------------------
Related Resources
- [Gilani et al., 2017] A. Gilani, S. R. Qasim, I. Malik, and F. Shafait. Table detection using deep learning. In Proc. of ICDAR 2017, volume 01, pages 771–776, Nov 2017.