Hi,
Trying to run subtitle edit in Ubuntu 20.4 with docker.
MONO_LOG_LEVEL="debug" MONO_LOG_MASK="all" xvfb-run -a mono /app/se/SubtitleEdit.exe /convert "Some subtitle file.eng.sup" subrip /FixCommonErrors >log
It just get stuck on OCR... : 0%
I get this error:
Mono: process_create: Exec prog [/usr/bin/tesseract] args ["/tmp/8afea31f-94ef-4daf-a4db-4ab4527be8aa.png" "/tmp/bec771bf-b841-4f21-8b27-9e7038e7624b" -l -psm 6 hocr]
It doesnt set -l eng in the tesseract command, which causes it to exit with error status 1
Running tesseract manually without -l eng:
$ tesseract phototest-rotated-R.png output -l -psm 6 hocr
Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/-psm.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language '-psm'
Tesseract couldn't load any languages!
Could not initialize tesseract.
Any ideas how to solve it?
thx for the info - I've tried to fix it here: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.18/SubtitleEditBeta.zip
Does that work?
(I'm on windows atm, so it's not tested)
Thanks, it sets english correctly now, but that commit forces english if the language isn't detected?
Is it possible to add an optional parameter to /convert, so I can specify other languages manually, like swe?
Also there were another error:
Mono: process_create: Exec prog [/usr/bin/tesseract] args ["/tmp/6235c55d-0871-47aa-86ab-3fcf0634aff8.png" "/tmp/353002fe-9a6c-45bb-97c0-79e0461b28cb" -l eng -psm 6 hocr]
Error, unknown command line argument '-psm'
the psm parameter should have two dashes, --psm.
Beta updated: https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.18/SubtitleEditBeta.zip
Should now use Tesseract 4 style parameters on Linux.
About the language, try /ocrdb:<ocr db/dictionary>, so that would like /ocrdb:eng or /ocrdb:swe.
Hi,
It works now, thanks, and /ocrdb:swe works as well.
$ /usr/bin/tesseract phototest-rotated-R.png output -l eng --psm 6 --oem 0 hocr
Error: Tesseract (legacy) engine requested, but components are not present in /usr/share/tesseract-ocr/4.00/tessdata/eng.traineddata!!
Failed loading language 'eng'
Tesseract couldn't load any languages!
Could not initialize tesseract.
I got that error when using the included dictionaries with Ubuntu, so I had to fetch legacy compatible eng.traineddata from the Tesseract github repo.( if anyone else see's this error)
OK, default engine mode is now "3" (Default, based on what is available)
https://github.com/SubtitleEdit/subtitleedit/releases/download/3.5.18/SubtitleEditBeta.zip