做OCR圖文識別,在linux系統上發布時,需要安裝tesseract環境。網上信息比較雜,基於各種linux系統做的Dockerfile,其表現也是五花八門,搞不清白。以下是我經過一兩天的摸索的成果,可以有效的部署環境,希望對大家有用。過程大致分為三個階段:1、制作基礎鏡像包,安裝tesseract環境;2、上傳tessdata語言包到服務器上,供tesseract識別時對照;3、制作應用程序的鏡像,掛載tessdata語言包目錄到/usr/local/share/tessdata,同時設置docker容器的環境變量TESSDATA_PREFIX;
一、准備基礎鏡像的Dockerfile文件。需要相關資源文件 tesseract-4.1.1.tar.gz,leptonica-1.80.0.tar.gz
https://github.com/tesseract-ocr/tesseract/releases/tag/4.1.1
http://www.leptonica.org/source/leptonica-1.80.0.tar.gz
FROM mamohr/centos-java LABEL ANTHOR="siman(214382122@qq.com)" VERSION="1.0.0" BUILD_DATE="2020-09-01" \ RESOURCES="https://github.com/tesseract-ocr/tesserac http://www.leptonica.org/index.html https://github.com/tesseract-ocr/tessdata" \ DESCRIPTION="This image integrated and edited the running environment of tesseract-4.1.1 and leptonica-1.80.0, \ and made it based on CentOS system. Based on this basic image, you can run your own tess4j jar application" # 環境變量(tesseract) ENV LD_LIBRARY_PATH="/usr/local/lib" \ LIBLEPT_HEADERSDIR="/usr/local/include" \ PKG_CONFIG_PATH="/usr/local/lib/pkgconfig" # 安裝tesseract環境 ADD tesseract-4.1.1.tar.gz / ADD leptonica-1.80.0.tar.gz / RUN yum -y install file automake libicu-devel libpango1.0-dev libcairo-dev libjpeg-devel libpng-devel libtiff-devel zlib-devel libtool gcc-c++ make \ && cd /leptonica-1.80.0 && ./configure && make && make install \ && cd /tesseract-4.1.1 && ./autogen.sh && ./configure && make && make install \ && rm -rf /leptonica-1.80.0 /tesseract-4.1.1 # 時區設置 RUN ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime RUN echo 'Asia/Shanghai' >/etc/timezone
二、創建基礎鏡像包
docker build -t tess/centos-java:v1.0 .
三、安裝tessdata包
鏈接: https://pan.baidu.com/s/1XAvPkTdUXuFq-q2InDREhQ 提取碼: 6vjp
四、制作自己的springboot-ocr服務鏡像包,設置環境變量TESSDATA_PREFIX
FROM tess/centos-java:v1.0 LABEL ANTHOR="siman(214382122@qq.com)" VERSION="1.0.0" BUILD_DATE="2020-09-01" VOLUME /tmp ADD simm-framework-test-1.0.jar app.jar EXPOSE 8080 ENV TESSDATA_PREFIX="/usr/local/share/tessdata" # 啟動入口 ENTRYPOINT ["java","-jar","/app.jar"]
五、啟動容器,並掛載tessdata目錄
docker run -it -v /usr/tessdata:/usr/local/share/tessdata -p 8080:8080 --name="ocr-api" ocr-api:v1.0