Medusa: Subtitles languages not always recognised in ISO 639-2 3 character

Created on 15 Sep 2020  路  36Comments  路  Source: pymedusa/Medusa

Describe the bug
Most episodes these days have embedded subtitles which I extract as srt files but the language codes for these is most of the times in ISO 639-2 3 character.
For English this works perfect.
Files
image

Medusa
image

But as you can see the Dutch subtitle is not recognized while files that are formated with the 2 character language code are picked up.
Files
image

Medusa
image

Medusa (please complete the following information):

Medusa Info: | Branch: master Commit: b352bb6924afcdfafce176a540d53ce405ca1312 Version: 0.4.3 Database: 44.16
-- | --
Python Version: | 3.7.6 (tags/v3.7.6:43364a7ae0, Dec 19 2019, 00:42:30) [MSC v.1916 64 bit (AMD64)]
SSL Version: | OpenSSL 1.1.1d 10 Sep 2019
OS: | Windows-10-10.0.17763-SP0
Locale: | nl_NL.cp1252

All 36 comments

ISO 639-2 defines two different codes for the Dutch language.
The formal code is "nld"
And "dut" is just a synoniem which is not recognized.

I understand but all embedded subs use the dut format to my knowledge

To be precise, these are not embedded, a embedded subtile is when the sub in embedded within the container so within the . mkv of .mp4.

Sorry for the confusion, in my case they are the embedded subs but I extract and process them with Subtitle Edit before Medusa imports the episode.
So they end up as external but the naming comes from the embedded subs.

Hope I make sense 馃榾

No, embedded subs are only a stream.
The name is created by the software you use to extract it.
Why do you extract it?
That seems useless to me.

Reason for doing so is that I parse them with https://github.com/SubtitleEdit/subtitleedit to remove all Hearing Impaired entries

So if I understand correctly not extracting the subs is the goal but removing the Hearing Impaired subs from the source is the goal?
Then removing them should be sufficient without subtracting all subs, or am I missing something?

And the standard for subtitle filename extensions is not using the tree letter Iso code (e.g. .eng and .dut) but the tow letter Iso code (e.g. .nl and .en)
I don't know subtitle edit but maybe you can change to the two letter code.

If not I can make a small python script for you to extract the subtitles with a two code as extension and deleting the one with hearing impaired.

My flow is as follows.

  1. Download finishes
  2. Script is kicked off
  3. Files are copied to a staging area and unpacked
  4. I then run https://github.com/willforde/mkvstrip to strip out all embedded subs that are not EN or NL (I hate 200 different subs in a file 馃槃)
  5. I run Subtitle Edit on the MKV which will extract the embed subs still present and fix subtitles by removing Hearing Impaired, fix common errors, etc
  6. Subtitle Edit will indeed extract using the 3-letter country code.
  7. When all is finished I call Medusa API to start the import.

My goal is to have Medusa recognize the Dutch subs when importing.
So if you have a python script that will convert 3-letter country codes to 2-letter country codes or have Medusa understand dut I'm a happy camper 馃槃

Not only Medusa does not support the 3 letter code, also all media players expect the two letter code as extension of subtitles files.

Still not sure why you don't keep the subs in the .mkv
I can make you a script dat does this:

Input: Video file with all kinds of subs in there
Output: Video file with only Dutch and English subs in there (without hearing Impaired subs)

or:

Input: Video file with all kinds of subs in there
Output: Dutch and English subtitle files with two letter extension and skipping the hearing Impaired subs.

or:
Input: Video file with all kinds of subs in there
Output: Dutch and English subtitle files with two letter extension and skipping the hearing Impaired subs and a new Video File without subtitles.

or:
A script that renames subtitles files with 3 lettercode to subtitles files with two lettercode.

The last is the simplest option to make, you can even use a .bat script to do that.
something like:

rename *dut.srt *nl.srt
rename *eng.srt *en.srt

Medusa is nog going to support that. We use libs that parse the language code. So we would need to make exceptions in python libs and js libs?

Batch doesn't like that 馃槈, I've tried
Will end up with
test.dut.nl.srt
I extract them because I want to edit the subs and strip out unwanted HI and other things like song lyrics etc

You must not use . in front of the dut.
The point (.) is greedy.

So this is not working.
rename *.dut.srt *.nl.srt

but this works
rename *dut.srt *nl.srt

You must not use . in front of the dut.
The point (.) is greedy.

So this is not working.
rename *.dut.srt *.nl.srt

but this works
rename *dut.srt *nl.srt

For me with Windows 10 rename *dut.srt *nl.srt will give me
test.dut.srtnl.srt

OK try this command:

rename ???????????????????????????????????????????????.dut.srt ???????????????????????????????????????????????.nl.srt

Make sure to use enough question marks (?) to catch even the longest name.
Too much question mark does not matter too little you miss files with longer names.

This is some high-tech scripting and it works 馃槂

Not high tech just avoiding some Microsoft stupid rename wildcard implementations.

To comment on

Not only Medusa does not support the 3 letter code, also all media players expect the two letter code as extension of subtitles files.

Kodi works perfectly with the .dut.srt files but I'll add in your rename command to fix it.
Thanks for your help.

@BenjV It did work in my test but it fails in production, it looks to be because there are multiple dots
It appears that Batch is very picky https://superuser.com/questions/475874/how-does-the-windows-rename-command-interpret-wildcards

C:\TEMP\Torrent\PROCD\TV\The.Boys.S02E08.1080p.WEB.H264-CAKES>dir
 Volume in drive C has no label.
 Volume Serial Number is D88D-6860

 Directory of C:\TEMP\Torrent\PROCD\TV\The.Boys.S02E08.1080p.WEB.H264-CAKES

09-10-2020  09:30    <DIR>          .
09-10-2020  09:30    <DIR>          ..
09-10-2020  09:30    <DIR>          Sample
09-10-2020  09:30            70.862 the.boys.s02e08.1080p.web.h264-cakes.#4.eng.srt
09-10-2020  09:30            61.729 the.boys.s02e08.1080p.web.h264-cakes.dut.srt
09-10-2020  09:30            70.862 the.boys.s02e08.1080p.web.h264-cakes.eng.srt
09-10-2020  07:12     4.325.977.430 the.boys.s02e08.1080p.web.h264-cakes.mkv
09-10-2020  07:12               254 the.boys.s02e08.1080p.web.h264-cakes.nfo
09-10-2020  07:12             2.922 the.boys.s02e08.1080p.web.h264-cakes.srr
               6 File(s)  4.326.184.059 bytes
               3 Dir(s)  4.795.503.951.872 bytes free

C:\TEMP\Torrent\PROCD\TV\The.Boys.S02E08.1080p.WEB.H264-CAKES>ren ???????????????????????????????????????????????.dut.srt ???????????????????????????????????????????????.nl.srt
The system cannot find the file specified.

C:\TEMP\Torrent\PROCD\TV\The.Boys.S02E08.1080p.WEB.H264-CAKES>dir
 Volume in drive C has no label.
 Volume Serial Number is D88D-6860

 Directory of C:\TEMP\Torrent\PROCD\TV\The.Boys.S02E08.1080p.WEB.H264-CAKES

09-10-2020  09:30    <DIR>          .
09-10-2020  09:30    <DIR>          ..
09-10-2020  09:30    <DIR>          Sample
09-10-2020  09:30            70.862 the.boys.s02e08.1080p.web.h264-cakes.#4.eng.srt
09-10-2020  09:30            61.729 the.boys.s02e08.1080p.web.h264-cakes.dut.srt
09-10-2020  09:30            70.862 the.boys.s02e08.1080p.web.h264-cakes.eng.srt
09-10-2020  07:12     4.325.977.430 the.boys.s02e08.1080p.web.h264-cakes.mkv
09-10-2020  07:12               254 the.boys.s02e08.1080p.web.h264-cakes.nfo
09-10-2020  07:12             2.922 the.boys.s02e08.1080p.web.h264-cakes.srr
               6 File(s)  4.326.184.059 bytes
               3 Dir(s)  4.795.627.683.840 bytes free

C:\TEMP\Torrent\PROCD\TV\The.Boys.S02E08.1080p.WEB.H264-CAKES>

OK, I can make a small python script that does the renaming for you.
How do you want it to function?

  1. Rename all files in the current directory.
  2. Rename all files and get the directory via a commandline parameter
  3. Rename a specific file via a parameter on the commandline
  4. Something else

Thank you very much for that offer but I figured it out by using Bulk Rename CLI
https://www.bulkrenameutility.co.uk/Download.php#DownloadBulkRenameCommand

C:\TEMP\Torrent\PROCD\TV\The.Boys.S02E08.1080p.WEB.H264-CAKES>%brc64% /DIR:"C:\TEMP\Torrent\PROCD\TV\The.Boys.S02E08.1080p.WEB.H264-CAKES" /PATTERN:"*.srt" /REPLACECI:.dut:.nl /REPLACECI:.eng:.en


Processing Folder C:\TEMP\Torrent\PROCD\TV\The.Boys.S02E08.1080p.WEB.H264-CAKES\
Filename the.boys.s02e08.1080p.web.h264-cakes.#4.eng.srt would be renamed to the.boys.s02e08.1080p.web.h264-cakes.#4.en.srt
Filename the.boys.s02e08.1080p.web.h264-cakes.dut.srt would be renamed to the.boys.s02e08.1080p.web.h264-cakes.nl.srt
Filename the.boys.s02e08.1080p.web.h264-cakes.eng.srt would be renamed to the.boys.s02e08.1080p.web.h264-cakes.en.srt

Ok, glad to be of a little assistance.

Really appreciate the offer!

@BenjV since you offered, would you be willing to take a look at https://github.com/jobrien2001/mkvstrip ?
The python script uses mkvmerge to remove unwanted subtitle and audio languages that might be part of the mkv but the script is crashing more and more and the original author does not respond.

It seems to be related to character encoding in the subtitle names (I think)
Here 2 json outputs of files that are crashing or not working.
1.txt
2.txt

And errors? Or trace back?

On the 1.json it just does nothing even with debug on in the script it will just stop.

Some of the errors I managed to "fix" by changing line 223 to

        process = subprocess.Popen(command, stdout=subprocess.PIPE, universal_newlines=True, encoding="utf8", errors='ignore')

For the second mkv

C:\TEMP\Torrent\PROCD>"C:\Python37\python.exe" %mkvstrip% -b %MKVMergeLocation% -v -l eng,dut -s eng,dut -r Forced C:\TEMP\Torrent\PROCD\TV\1
Searching for MKV files to process.
Warning: This may take some time...
Checking C:\TEMP\Torrent\PROCD\TV\1\Tehran.S01E01.Emergency.Landing.in.Tehran.1080p.ATVP.WEB-DL.DDP5.1.H.264-NTb.mkv

C:\TEMP\Torrent\PROCD>

Did a trace with python (first time for everything 馃槃 ) and it seems on the first file all subtitle languages are not recognised

 --- modulename: mkvstrip, funcname: __init__
mkvstrip.py(204):         self.lang = track_data["properties"].get("language", "und")
mkvstrip.py(205):         self.codec = track_data["codec"]
mkvstrip.py(206):         self.type = track_data["type"]
mkvstrip.py(207):         self.id = track_data["id"]
mkvstrip.py(208):         self.name = track_data["properties"].get("track_name")
mkvstrip.py(209):         self.forced = track_data["properties"].get("forced_track")
mkvstrip.py(243):             track_map[track_obj.type].append(track_obj)
mkvstrip.py(241):         for track_data in json_data["tracks"]:
mkvstrip.py(242):             track_obj = Track(track_data)

I can write a python script for you that uses ffmpeg to extract the subtitles from the video.
Not that complicated at all.

Something like:
Input: Videofile
Output: Videofile without subs and Dutch + English sub files
And of course skipping the Hearing Impaired subs.

Or I could mux those subs also into the output video.

What I want is to remove all embedded audio streams and subtitles that do not match the language I set (for me EN and NL)
Keeping the HI since some shows will only have the full English subtitle in the HI track since the normal English track might only be the Spanish-speaking parts, Narco for instance.

I strip out the HI and other crap with SubtitleEdit, so I always end up with a clean English and Dutch subtitle

Input: Videofile
Embedded Audio: eng, dut, ger
Embedded Subs: eng, dut, ger

Output: Videofile
Embedded Audio: eng, dut
Embedded Subs: (eng, dut) or (none)
Extracted SRT: eng, dut

Ok, I can do that.
Do you want to keep the original input file for example renamed with and .old extension or shall I delete it?

Delete it

I think I know why the python script does nothing on the Tehran episode.
The only Audio Track is Hebrew so it will skip removing the subs.

C:\TEMP\Torrent\PROCD>"C:\Python37\python.exe" %mkvstrip% -b %MKVMergeLocation% -v -l eng,dut -s eng,dut -r Forced -t C:\TEMP\Torrent\PROCD\TV\1
Searching for MKV files to process.
Warning: This may take some time...
Checking C:\TEMP\Torrent\PROCD\TV\1\Tehran.S01E01.Emergency.Landing.in.Tehran.1080p.ATVP.WEB-DL.DDP5.1.H.264-NTb.mkv
REMOVE:  Track #1: heb - E-AC-3 - Name:None - Forced:False
REMOVE:  Track #2: heb - SubRip/SRT - Name:Forced - Forced:True
REMOVE:  Track #3: ara - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #4: bul - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #5: chi - SubRip/SRT - Name:Simplified Mandarin - Forced:False
REMOVE:  Track #6: chi - SubRip/SRT - Name:Traditional Mandarin - Forced:False
REMOVE:  Track #7: cze - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #8: dan - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #9: ger - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #10: gre - SubRip/SRT - Name:None - Forced:False
KEEP:  Track #11: eng - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #12: spa - SubRip/SRT - Name:Latin America - Forced:False
REMOVE:  Track #13: spa - SubRip/SRT - Name:Spain - Forced:False
REMOVE:  Track #14: est - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #15: fin - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #16: fre - SubRip/SRT - Name:Canada - Forced:False
REMOVE:  Track #17: fre - SubRip/SRT - Name:France - Forced:False
REMOVE:  Track #18: heb - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #19: heb - SubRip/SRT - Name:SDH - Forced:False
REMOVE:  Track #20: hin - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #21: hun - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #22: ind - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #23: ita - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #24: jpn - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #25: kor - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #26: lit - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #27: lav - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #28: may - SubRip/SRT - Name:None - Forced:False
KEEP:  Track #29: dut - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #30: nor - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #31: pol - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #32: por - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #33: por - SubRip/SRT - Name:Brazil - Forced:False
REMOVE:  Track #34: rus - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #35: slo - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #36: slv - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #37: swe - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #38: tam - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #39: tel - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #40: tha - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #41: tur - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #42: ukr - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #43: vie - SubRip/SRT - Name:None - Forced:False
REMOVE:  Track #44: chi - SubRip/SRT - Name:Cantonese - Forced:False

This is an example where the Forced EN subtitles are only for the non English parts and I actually need the SDH ones
https://partnerhelp.netflixstudios.com/hc/en-us/articles/224198488-What-is-a-Forced-Narrative-Subtitle-

Remuxing: The.Boys.S02E08.What.I.Know.2160p.AMZN.WEBRip.DDP5.1.x265-NTb.mkv
Title: None
============================
Retaining subtitle track(s):
    Track #3: eng - SubRip/SRT - Name:SDH - Forced:False
    Track #20: dut - SubRip/SRT - Name:None - Forced:False
Removing subtitle track(s):
    Track #2: eng - SubRip/SRT - Name:Forced - Forced:True
    Track #4: ara - SubRip/SRT - Name:None - Forced:False
    Track #5: dan - SubRip/SRT - Name:None - Forced:False
    Track #6: ger - SubRip/SRT - Name:None - Forced:False
    Track #7: spa - SubRip/SRT - Name:Latinoam茅rica - Forced:False
    Track #8: spa - SubRip/SRT - Name:Espa帽a - Forced:False
    Track #9: fin - SubRip/SRT - Name:None - Forced:False
    Track #10: fil - SubRip/SRT - Name:None - Forced:False
    Track #11: fre - SubRip/SRT - Name:None - Forced:False
    Track #12: heb - SubRip/SRT - Name:None - Forced:False
    Track #13: hin - SubRip/SRT - Name:None - Forced:False
    Track #14: ind - SubRip/SRT - Name:None - Forced:False
    Track #15: ita - SubRip/SRT - Name:None - Forced:False
    Track #16: jpn - SubRip/SRT - Name:None - Forced:False
    Track #17: kor - SubRip/SRT - Name:None - Forced:False
    Track #18: may - SubRip/SRT - Name:None - Forced:False
    Track #19: nor - SubRip/SRT - Name:Norsk Bokm氓l - Forced:False
    Track #21: pol - SubRip/SRT - Name:None - Forced:False
    Track #22: por - SubRip/SRT - Name:Brasil - Forced:False
    Track #23: por - SubRip/SRT - Name:Portugal - Forced:False
    Track #24: rus - SubRip/SRT - Name:None - Forced:False
    Track #25: swe - SubRip/SRT - Name:None - Forced:False
    Track #26: tam - SubRip/SRT - Name:None - Forced:False
    Track #27: tel - SubRip/SRT - Name:None - Forced:False
    Track #28: tha - SubRip/SRT - Name:None - Forced:False
    Track #29: tur - SubRip/SRT - Name:None - Forced:False

ok I will extract Dutch, German and English subtitles.
If no normal English subs then I will extract the SDH subtitles.
Extract nothing in none of the above is present.
Create a Video file wil a video stream, English, Dutch or German audio stream and no subtitle stream in the container.

Be aware that the stream identifiers are set by the creator of the video file and that they sometimes are sloppy or just use other names then the ISO identifiers.

Also that example you gave is very strange it has a an English sub but that is just a lsmall part of the movie and an SDH English sub for the whole movie but that SDH in actually a normal subtitle.
No way that a script can anticipate on such strange configurations.

I don't want German 馃榾
That is indeed what sometimes is a bit irritating.
That is why I throw away the Subs that have the name Forced as that is 99% of the time only the English translation for foreign speech.
When using the SDH I can strip out all the HI and other stuff with Subtitle Edit and have two clean and perfect subtitles to my liking 馃榾.

Forced subtitles are use for situation where you watch a video without subtitles in for example English and if somebody is speaks a few line in French.
Then for that French part they use a forced English subtitle so only that part is subtitled.

Correct, that is why I strip out the Forced sub, we like to have English subs for all the spoken parts.

Request to be added to Babelfish: https://github.com/Diaoul/babelfish

Was this page helpful?
0 / 5 - 0 ratings

Related issues

xorinzor picture xorinzor  路  4Comments

Shootersss picture Shootersss  路  3Comments

Rouzax picture Rouzax  路  4Comments

wimpyrbx picture wimpyrbx  路  5Comments

sebeksd picture sebeksd  路  3Comments