在OCR項目調研過程發現一個開源工具gosseract,識別效果不錯;
按部就班准備環境,先mac環境安裝tesseract(gosseract依賴):
brew install tesseract
$ tesseract -v tesseract 4.1.3 leptonica-1.82.0 libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.1 : libopenjp2 2.4.0 Found AVX2 Found AVX Found FMA Found SSE
第一次安裝很順利,成功。
隨着業務需求增加,需要進行語言訓練,因此需要安裝訓練工具, 選擇卸載重裝:
$ brew install --with-training-tools tesseract Usage: brew install [options] formula|cask [...] Install a formula or cask. Additional options specific to a formula may be appended to the command. ... Error: invalid option: --with-training-tools
提示此安裝方式已廢棄。所以選擇編譯安裝方式:
安裝依賴
# Packages which are always needed. brew install automake autoconf libtool brew install pkgconfig brew install icu4c brew install leptonica # Packages required for training tools. brew install pango # Optional packages for extra features. brew install libarchive # Optional package for builds using g++. brew install gcc
下載解壓
https://github.com/tesseract-ocr/tesseract/releases
安裝
cd tesseract-5.1.0 ./autogen.sh mkdir build cd build # Optionally add CXX=g++-8 to the configure command if you really want to use a different compiler. ../configure PKG_CONFIG_PATH=/usr/local/opt/icu4c/lib/pkgconfig:/usr/local/opt/libarchive/lib/pkgconfig:/usr/local/opt/libffi/lib/pkgconfig make -j # Optionally install Tesseract. sudo make install # Optionally build and install training tools. make training sudo make training-install
問題:
安裝好之后,編譯項目報錯:
2022/03/31 15:32:10 ERROR ▶ 0004 Failed to build the application: # ocr /usr/local/go/pkg/tool/darwin_amd64/link: running clang++ failed: exit status 1 Undefined symbols for architecture x86_64: "tesseract::TessBaseAPI::Init(char const*, char const*, tesseract::OcrEngineMode, char**, int, GenericVector<STRING> const*, GenericVector<STRING> const*, bool)", referenced from: Init(void*, char*, char*) in 000023.o _Init in 000023.o _GetDataPath in 000023.o "tesseract::TessBaseAPI::Recognize(ETEXT_DESC*)", referenced from: _GetBoundingBoxesVerbose in 000023.o _GetBoundingBoxes in 000023.o ld: symbol(s) not found for architecture x86_64 clang: error: linker command failed with exit code 1 (use -v to see invocation)
僅觀察報錯內容,沒發現是版本問題,經過多次卸載重裝后發現是版本太高導致的,於是重新安裝了4.1.3版本后服務正常編譯。
卸載方式可以手動刪除安裝文件,或者通過命令:
brew uninstall tesseract
但是在后續安裝tesseract是會出現各種問題,如下:
$ brew install tesseract==4.1.3 Warning: No available formula with the name "tesseract==4.1.3". Did you mean tesseract? ==> Searching for similarly named formulae... This similarly named formula was found: tesseract To install it, run: brew install tesseract ==> Searching for a previously deleted formula (in the last month)... Error: No previously deleted formula found. ==> Searching taps on GitHub... Error: No formulae found in taps. liumeng@liumengdeMacBook-Pro Pictures % brew install tesseract ==> Downloading https://ghcr.io/v2/homebrew/core/tesseract/manifests/4.1.3 Already downloaded: /Users/liumeng/Library/Caches/Homebrew/downloads/9597a8ae2cb676cd25c79cf252f4eb8759b9cf3d472c57f7c764e086c5f8f6e2--tesseract-4.1.3.bottle_manifest.json ==> Downloading https://ghcr.io/v2/homebrew/core/tesseract/blobs/sha256:1b67091dce98b42c6c561981a01738fe01c19ac69a1dc4de6d8e43fe885177f0 Already downloaded: /Users/liumeng/Library/Caches/Homebrew/downloads/cf8d3fbb1aea1cc629c6873a25b11d732c90ff23bfa4c44ba23d0ce5c24e907a--tesseract--4.1.3.big_sur.bottle.tar.gz ==> Pouring tesseract--4.1.3.big_sur.bottle.tar.gz Error: The `brew link` step did not complete successfully The formula built, but is not symlinked into /usr/local Could not symlink include/tesseract/apitypes.h /usr/local/include/tesseract is not writable. You can try again using: brew link tesseract ==> Caveats This formula contains only the "eng", "osd", and "snum" language data files. If you need any other supported languages, run `brew install tesseract-lang`. ==> Summary 🍺 /usr/local/Cellar/tesseract/4.1.3: 65 files, 29.7MB
查看報錯信息,需要如下操作:
$ brew link tesseract Linking /usr/local/Cellar/tesseract/4.1.3... Error: Could not symlink include/tesseract/apitypes.h /usr/local/include/tesseract is not writable.
此時需要先刪除一些文件:
$ sudo rm -rf /usr/local/include/tesseract
繼續如下操作:
$ brew link tesseract Linking /usr/local/Cellar/tesseract/4.1.3... Error: Could not symlink share/tessdata/configs/alto Target /usr/local/share/tessdata/configs/alto already exists. You may want to remove it: rm '/usr/local/share/tessdata/configs/alto' To force the link and overwrite all conflicting files: brew link --overwrite tesseract To list all files that would be deleted: brew link --overwrite --dry-run tesseract
給了三種操作方法。
如下操作:
$ sudo rm -rf /usr/local/share/tessdata/configs/alto $ brew link --overwrite --dry-run tesseract Would remove: /usr/local/share/tessdata/configs/ambigs.train ... /usr/local/lib/libtesseract.dylib -> /usr/local/lib/libtesseract.5.dylib /usr/local/lib/pkgconfig/tesseract.pc liumeng@liumengdeMacBook-Pro Pictures % tesseract -v zsh: command not found: tesseract liumeng@liumengdeMacBook-Pro Pictures % brew install tesseract Updating Homebrew... ==> Auto-updated Homebrew! Updated 1 tap (homebrew/cask). ==> Updated Casks Updated 7 casks. Warning: tesseract 4.1.3 is already installed, it's just not linked. To link this version, run: brew link tesseract $ brew link --overwrite tesseract Linking /usr/local/Cellar/tesseract/4.1.3... Error: Could not symlink share/tessdata/configs/alto /usr/local/share/tessdata/configs is not writable.
繼續刪除:
$ sudo rm -rf /usr/local/share/tessdata/configs $ brew link --overwrite tesseract Linking /usr/local/Cellar/tesseract/4.1.3... Error: Could not symlink share/tessdata/tessconfigs/batch /usr/local/share/tessdata/tessconfigs is not writable. $ sudo rm -rf /usr/local/share/tessdata/tessconfigs $ brew link --overwrite tesseract Linking /usr/local/Cellar/tesseract/4.1.3... 12 symlinks created.
驗證:
$ tesseract -v tesseract 4.1.3 leptonica-1.82.0 libgif 5.2.1 : libjpeg 9d : libpng 1.6.37 : libtiff 4.3.0 : zlib 1.2.11 : libwebp 1.2.1 : libopenjp2 2.4.0 Found AVX2 Found AVX Found FMA Found SSE
項目編譯正常,結束!