Universalmediaserver: Filenames with non-ASCII character cannot be parsed on OS X 10

Created on 13 Aug 2014  ·  32Comments  ·  Source: UniversalMediaServer/UniversalMediaServer

Files with filenames with non-ASCII characters cannot be parsed, that is, the log shows "The file {} could not be parsed. It will be hidden", where {} is the pathname. There is nothing special about the video file itself, as I can rename the file using only ASCII characters, and UMS will show and play it normally.

This was tested on OS X 10.9 using UMS 4.0.2 (Java 8).

confirmed

All 32 comments

Same problem remains on OS X10.11.1 and UMS 5.3.0 (Java8u66)

If filename contains non-ASCII character (like Chinese), log shows "could not be parsed. It will be hidden", and file can not be seen in the list.

This problem is caused by a bug in MediaInfo. They won't "prioritize" it until someone pays them to do so.

A possible workaround is to set mediainfo = false in the renderer configuration for the affected renderer(s).

Was about to log a new bug report when I came across this one. Seems this is still a problem.

On OS X you can type the Æ character by using Shift + Option + '

Variant 1: /Volumes/Media/PathWithÆCharacter/Movie.mkv

  • Clients will display the PathWithÆCharacter folder.
  • Clients will not display any files within the folder.
  • UMS Logs tab contains the message:

time INFO The file /Volumes/Media/PathWithÆCharacter/Movie.mkv could not be parsed. It will be hidden

Variant 2: /Volumes/Media/PathWithoutCharacter/MovieWithÆCharacter.mkv

  • The PathWithoutCharacter folder will appear in clients, but the MovieWithÆCharacter file will not.
  • UMS Logs tab contains the message:

time INFO The file /Volumes/Media/PathWithoutCharacter/MovieWithÆCharacter.mkv could not be parsed. It will be hidden

Renaming both folders and files to remove Æ characters allows them to behave normally. Setting mediainfo = false is not a suitable workaround as it breaks client configs that depend on it.

This was tested on:

  • OS X 10.10.5 Yosemite using UMS v5.4.0 (Java 8)
  • OS X 10.10.5 Yosemite using UMS v6.0.0 (Java 8)

@antman2 As explained in the post above yours the bug is not in UMS but in MediaInfo that UMS uses. The people behind MediaInfo won't fix it until someone pays them. The bug report and further information is available here.

The only think you can do on the UMS side is to stop using MediaInfo. If UMS will use MediaInfo or not depends on the renderer profile - you need to edit your renderer configuration file to not use MediaInfo. There are disadvantages to this though, MediaInfo is used to get information about the media files so that UMS can make the right decisions about how to handle that particular file. Without that information, some files probably won't play.

@Nadahar: As you're no doubt aware UMS 6.0.0 upgraded MediaInfo to 0.7.81. I was just letting you know that this upgrade didn't fix this issue, even though other MediaInfo-related issues might now be resolved.

@antman2 If you look at the MediaInfo bug report I've linked to it's not started, and thus it's not fixed in 0.7.81. It probably won't be in any future versions either unless someone decides to pay the MediaInfo author to do it.

I personally don't agree with running open source projects this way, especially when it comes to bugs (a feature request would be more understandable), but we're as powerless as anyone to influence the MediaInfo author. He does accept patches though, so anyone that knows C++ and can fix the bug can submit a patch. The choice given is simply: Fix it yourself or pay someone to do it.

@Nadahar: Yep, I totally agree that it seems like a cumby way to handle what smells like a bug report. I understand, however, that people have a life away from software development and that sometimes life changes so they just don't have the time to devote to their favourite projects after hours any more.

Looking at the source is already on my to-do list although I haven't looked at C/C++ in probably 30 years (used to maintain an IPS's circlemud a long, long time ago).

@antman2 you know to maintain a project for all minority OSes is sometimes difficult. I am speaking about the MediaInfo not about the UMS where we try to make our best. My recommendation until the MediaInfo will adopt it is to avoid using such strange characters in the folders and files names and use the standard ASCII characters or follow the @Nadahar suggestion.

@valib I don't know what the bug is, but I won't consider unicode support in OS X to be a "rare case". In both Norwegian and Danish Æ, Ø and Å is as common as A, B and C.

I consider the bug to be a fundamental one, and since unicode is supported for Windows it must be a pretty simple one to fix as well.

@Nadahar thanks for explanation about that character I saw it for the first time :-)
So it is similar to some characters in the Czech and in that case I wonder why it works for Windows and not for OS X. Anyway agree with you that it is a fundamental bug.

Well, ok, I've had a good look through the libmediainfo 0.7.82 (AllInclsuive) source and I don't see anything wrong with it. Despite the bug report.

The problem is that the JNA bindings in net.pms.dlna.MediaInfo aren't doing the right thing for OS X. I'm guessing that they were generated on a Windows machine in the presence of a _UNICODE define. So, a couple of updates to make things work...

In MediaInfo.java, update MediaInfoDLL_Internal.getFunctionName to:

    @Override
    public String getFunctionName(NativeLibrary lib, Method method) {
        // e.g. MediaInfo_New(), MediaInfo_Open() ...
        String result = "MediaInfo_" + method.getName();
        if (method.getName() == "OpenA") {
            result = "MediaInfoA_Open";
        }
        return result;
    }

In MediaInfo.java, add an OpenA method to complement MediaInfoDLL_Internal's Open method:

    // File
    int Open(Pointer Handle, WString file);
    int OpenA(Pointer Handle, String file);

In MediaInfo.java, replace the current Open method with:

    // File
    /**
     * Open a file and collect information about it (technical information and tags).
     *
     * @param File_Name full name of the file to open
     * @return 1 if file was opened, 0 if file was not not opened
     */
    public int Open(String File_Name) {
        if (Platform.isMac()) {
            return MediaInfoDLL_Internal.INSTANCE.OpenA(Handle, File_Name);
        } else {
            return MediaInfoDLL_Internal.INSTANCE.Open(Handle, new WString(File_Name));
        }
    }

I don't think you'll need to make similar changes to the other functions with WString/String parameters as they're MediaInfo parameters names and AFAIK they're all ASCII and shouldn't pose a problem. But if you want to you can use nm to enumerate the available functions and implement them as per MediaInfoDLL.h:

nm -g libmediainfo.dylib | grep _MediaInfo_
nm -g libmediainfo.dylib | grep _MediaInfoA_

Cutting out all the guff in MediaInfoLib/Source/Example/HowToUse_Dll.JNA.java and testing against Universal Media Server.app/Contents/Resource/libmediainfo.dylib I was able to open and gather information on filenames such as the following:

ByFileName ("Aeon Flux (2005)/Aeon Flux (2005).mkv"); //plain old ASCII
ByFileName ("Æon Flux (2005)/Æon Flux (2005).mkv");
ByFileName ("\u00C6on Flux (2005)/\u00C6on Flux (2005).mkv");
ByFileName ("Ghost In The Shell - Stand Alone Complex [攻殻機動隊]/01 run rabbit junk (高橋ひでゆき).mp3"); //Japanese

Now Java and C++ aren't my usual strong points, so you might want to consult some higher powers to confirm that this is the/a right thing to do, but I hope this helps.

@antman2 This is really great work, I'll look into it and see what if I "understand" the changes. If not I will ask you here.

Using MediaInfoA_Open means to use Legacy version. See https://mediaarea.net/cs/MediaInfo/Support/SDK/Quick_Start But if it will help why not. Definitelly it needs testing.

@antman2 I don't quite understand, from what I can see you fall back to ANSI. Why is that necessary for OS X? That will make it codepage dependent, and that's not something we'd really want. Also, I'm not sure about this, but I think the bug is the same for Linux.

Is Unicode somehow different on OS X, or is the problem the unicode "encoding" (UTF-8, UTF-16, UTF-32)?

@Nadahar: Try not to think about code pages and your life will be better. :) My reading of the libmediainfo code is that they do what all good library authors do and subscribe to the UTF-8 Everywhere mantra.

The MediaInfo_Char parameters are wchar_t for _UNICODE otherwise they are char. Internally libmediainfo uses libzen's Ztring class to pass string data around - wchar_t's are converted to these as soon as possible, char are converted to these as soon as possible and are assumed to be UTF-8 source format.

I just want to ignore JNA's WString class. I haven't written unit tests to prove it but it behaves like it breaks UTF-8 Strings by blindly padding them out with 0x00 bytes instead of doing a real UTF-8 to UTF-16 conversion. In other words I think a UTF-8 Æ gets converted from 0xC3 0x86 to 0x00C3 0x0086 instead of to 0x00C6. Can't really blame it here as there's no way to know what encoding a String may be using. Maybe it does the RightThing(tm) on the Windows JRE but I don't have one of those to test with.

I didn't think about linux, but it wouldn't surprise me that it would be broken there as well. If that's the case then this version of MediaInfo.Open() should work for them as well:

    // File
    /**
     * Open a file and collect information about it (technical information and tags).
     *
     * @param File_Name full name of the file to open
     * @return 1 if file was opened, 0 if file was not not opened
     */
    public int Open(String File_Name) {
        if (!Platform.isWindows()) {
            return MediaInfoDLL_Internal.INSTANCE.OpenA(Handle, File_Name);
        } else {
            return MediaInfoDLL_Internal.INSTANCE.Open(Handle, new WString(File_Name));
        }
    }

@antman2 I try to avoid code pages and use unicode where possible, but not thinking about it isn't what I consider a good idea :wink:

Java String is UTF-16 internally as far as I know, and WString is just a simple wrapper for String. Nothing is UTF-8 internally in Java, you have to explicitly convert to UTF-8 byte array or "stream" (which is essentially the same, bytes) to get UTF-8. Therefore I don't understand how passing String to the ANSI call can work. It's somewhat unclear to me at what stage and how the UTF-16 String is converted to single character bytes, whether the result is UTF-8 or Latin1. A problem here is that Æ also is 0x00C6 in Latin1/8859-1. Given that you use nordic characters your non-unicode codepage is probably 8859-1, and thus there's no way to tell if UTF-8 or simple ANSI is actually being used.

In your test I see that you use chinese characters, but was that done calling the ANSI version (MediaInfoA_Open)?

I'm sorry to confirm this bug (probably) also on linux (ubuntu 14.04 with ext4 FS) on latest stable 6.1.0.
A lot of entries never showed up in DLNA list until I wrote a small python script to remove all the non-ascii characters from filenames. But since this is a destructive one-way operation and the filenames were valid I'm not too happy with that solution :(

Is there a way you guys can work around the media info bug? Kinda crappy way to handle bug reports by this guy - especially for something as basic as filename handling.

I wonder why PS3 media did not encounter such issue.

I apologize if it is not appropriate...but if it can help to solve it without a long waiting for a MediaInfo's bug fix it's a little price to paid.
https://github.com/sbraz/pymediainfo/issues/22
https://github.com/sbraz/pymediainfo/commit/ec3b541cd2e766fa94756a0da8d103ca1ffc48a4
https://github.com/sbraz/pymediainfo/commit/56d407397053522e1822d06b98a8470754ed29c9
This guy have a different way
https://www.filebot.net/forums/viewtopic.php?f=8&t=766

@Sami32 Great work Sami, both approaches are promising but I'm not sure which one is the best. The first one is a "hack", but if it works.. The second has another benefit: As far as I remember there are issues with MediaInfo and network shares as well. I don't think MediaInfo will read files on a network share (on Windows?). If UMS reads the file and passes the bytes, that problem would be gone. The problem with that approach is large files (movies). I'm guessing that MediaInfo only reads some of the file, and/or seeks in the file while parsing for formats where the headers doesn't provide all necessary information. That possibility is lost and UMS would have to read several gigabytes and pass that on. I don't know if there's a way to just pass "parts of a file" that will work reliably, and how should UMS figure out what parts to send..?

@Thanks :-)
As i don't have any MAC i cannot test it myself, but if you can remember which version of MediaInfo was able to deal with unicode i could have a check on the code source and see if submit a patch could be easily done ?

@Sami32 As far as I know this has never worked for MediaInfo, the bug is still unresolved. MediaInfo deals with Unicode on Windows, but not for Linux and OS X when the lib is called. MediaInfo as an application has no problems with it, the problem is probably just that MediaInfo doesn't convert UTF-16/32 to UTF-8 before sending it to the OS for Linux and OS X.

@Nadahar As told in #466 it's working fine with Java 6, could be similar to @Rednoa NFD trick, since Java 6 have a different way to process unicode.

Just an aside - since mediainfo won't parse some formats e.g. DSF, I've introduced a fallback on ffmpeg in #987 . My life is much better now ;)

@Sami32 I'd say that's an oversimplification 😛

It might be a way to fix it using C/C++, but I have no idea how/where to call these functions from Java. The problem is that they aren't part of the MediaInfo library, but probably some standard C calls. They aren't available to us directly, and I'm not sure how to figure out what to do to call them. The "caller" is actually the JNA native binary, so we probably have to get that to call the function - which is something we can't do unless there's a way to do that in JNA already (which it might be).

@nadahar Yes, i do love simplification :stuck_out_tongue_winking_eye:
Why not ask the users to change their locale :wink:

@Sami32 How? Both Linux and OS X has UTF-8 as the "default" character set as far as I know, and so does UMS. This isn't that simple obviously.

@Nadahar Oversimplification ? it seem that you liked the informations given in the _MediaInfo_ thread...

@Sami32 I did 👍 It's just that you said "Here a way to fix it". That wasn't true, but it gave me the information I needed 😄

@Nadahar Yes, you're right, my english is too weak, i should have write "Here the informations need to fix it". Sorry.

Anyway, i'm happy when i can help ;-)
Good job by the way :+1:

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Nadahar picture Nadahar  ·  9Comments

SubJunk picture SubJunk  ·  9Comments

SubJunk picture SubJunk  ·  3Comments

SubJunk picture SubJunk  ·  8Comments

SubJunk picture SubJunk  ·  3Comments