Skip to content

GitLab

  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in / Register
  • erp5 erp5
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Merge requests 141
    • Merge requests 141
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Jobs
  • Commits
Collapse sidebar
  • nexedi
  • erp5erp5
  • Merge requests
  • !1420

Merged
Created May 20, 2021 by Jérome Perrin@jeromeOwner3 of 3 tasks completed3/3 tasks

Lighter processing for OCR activities

  • Overview 9
  • Commits 2
  • Pipelines 8
  • Changes 3

When running OCR, we sometimes have issues because processing is "too heavy":

  • use 2 or 3 Go of disk space for a one page PDF created by erp5_document_scanner, because we convert pdf -> png -> tiff before sending to tesseract. Modern Ghostscript supports running tesseract directly, so we use it if it's available.
  • use 300% of CPU. Fixed by setting OMP_THREAD_LIMIT when running tesseract. This will only apply when OCR from Images. OCR embedded in Ghostscript does not seem to need this.
  • ... and often crash, so is restarted. This is fixed by updated tesseract.

Updates of ghostscript and tesseract are part of https://lab.nexedi.com/nexedi/slapos/merge_requests/985

Edited May 31, 2021 by Jérome Perrin
Assignee
Assign to
Reviewer
Request review from
Time tracking
Source branch: fix/tesseract-lighter-activities
GitLab Nexedi Edition | About GitLab | About Nexedi | 沪ICP备2021021310号-2 | 沪ICP备2021021310号-7