Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Register
  • Sign in
  • erp5 erp5
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Merge requests 142
    • Merge requests 142
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Jobs
  • Commits
Collapse sidebar
  • nexedinexedi
  • erp5erp5
  • Merge requests
  • !1420

Lighter processing for OCR activities

  • Review changes

  • Download
  • Patches
  • Plain diff
Merged Jérome Perrin requested to merge fix/tesseract-lighter-activities into master May 20, 2021
  • Overview 9
  • Commits 2
  • Pipelines 8
  • Changes 3

When running OCR, we sometimes have issues because processing is "too heavy":

  • use 2 or 3 Go of disk space for a one page PDF created by erp5_document_scanner, because we convert pdf -> png -> tiff before sending to tesseract. Modern Ghostscript supports running tesseract directly, so we use it if it's available.
  • use 300% of CPU. Fixed by setting OMP_THREAD_LIMIT when running tesseract. This will only apply when OCR from Images. OCR embedded in Ghostscript does not seem to need this.
  • ... and often crash, so is restarted. This is fixed by updated tesseract.

Updates of ghostscript and tesseract are part of https://lab.nexedi.com/nexedi/slapos/merge_requests/985

Edited May 31, 2021 by Jérome Perrin
Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: fix/tesseract-lighter-activities
GitLab Nexedi Edition | About GitLab | About Nexedi | 沪ICP备2021021310号-2 | 沪ICP备2021021310号-7