Beets: Formatting function to ASCIIfy punctuation only

Created on 12 May 2017 · 8Comments · Source: beetbox/beets

I've been using beets for a couple of years now and I love it. There's a minor annoyance for me that I've noticed since the beginning and have more or less ignored, but I thought I'd finally ask if there's anything I can do about it. Apologies if I've missed an existing solution in the config guide or setup.

Problem

When beets imported my library it Unicode-ified a lot of previously plaintext ASCII tags & filenames. For example, "El-P - I'll Sleep When You're Dead" becomes "El‐P - I’ll Sleep When You’re Dead" (in both files & tags.)

These look almost the same, but the punctuation is Unicode-ified:

The dash which was U+002D (ASCII "hyphen-minus") is now U+2010 ("hyphen")
The apostrophe U+0027 is now a right single quotation mark U+2019.

This isn't beets' doing, if I download the JSON results for the musicbrainz link then these UTF-8 characters are used there.

Similar things apply for other punctuation marks, this is just a good example as it has two of them. :)

The annoyance is:

Not all players can render UTF-8 tags properly (Kodi on Android seems to struggle, seems related to #1893)
Some (most?) players will not return results containing the different glyph in the tag if you type a simple punctuation character in the search field. ie typing "hyphen-minus" on the keyboard will not match "hyphen". (I use quodlibet and it treats these as different.)
I use Linux and the command line renders the UTF-8 characters fine, but I have the same "gotcha" when I go to type the glyphs.
Musicbrainz doesn't seem to be entirely consistent in how it applies these. For example, I have some tags "El-P" and some tags "El‐P" (UTF-8 hyphen vs. ASCII hyphen-minus).

I know that I can fix this for files by enabling "asciify", and it looks like this was dealt with for the Lyrics plugin in #270. However as well as Latin-character albums I also have a bunch with names in non-Latin script, so I actually want Unicode for things which I can't effectively represent in ASCII.

I guess my dream feature would be a "sanitise punctuation" option where these almost-the-same-as-an-ASCII-character punctuation glyphs get swapped for their ASCII versions in both tags and filenames, but anything else gets left as UTF-8.

I understand that this is a lot more to do with the design of Unicode than the design of beets (and that some people actually care about the distinction between hyphen-minus and hyphen, I just don't care in this case!)

I'd be happy to look into writing a patch for a feature like the above, if that's potentially acceptable. The approach discussed in #270 for lyrics (ie find-replace) seems applicable.

Setup

OS: Linux
Python version: 3.6.1
beets version: 1.4.3
Turning off plugins made problem go away (yes/no):

feature

Source

projectgus

Most helpful comment

@imiric I have the same desire, and have a hacky fix that works for my purposes. I have a local version of the beets repo that I have patched with these changes:

--- a/beets/autotag/__init__.py
+++ b/beets/autotag/__init__.py
@@ -26,6 +26,9 @@ from .hooks import AlbumInfo, TrackInfo, AlbumMatch, TrackMatch  # noqa
 from .match import tag_item, tag_album, Proposal  # noqa
 from .match import Recommendation  # noqa

+from unidecode import unidecode
+
 # Global logger.
 log = logging.getLogger('beets')

@@ -35,10 +38,12 @@ log = logging.getLogger('beets')
 def apply_item_metadata(item, track_info):
     """Set an item's metadata from its matched TrackInfo object.
     """
-    item.artist = track_info.artist
+    item.artist = unidecode(track_info.artist)
     item.artist_sort = track_info.artist_sort
     item.artist_credit = track_info.artist_credit
-    item.title = track_info.title
+    item.title = unidecode(track_info.title)
     item.mb_trackid = track_info.track_id
     if track_info.artist_id:
         item.mb_artistid = track_info.artist_id
@@ -62,14 +67,16 @@ def apply_metadata(album_info, mapping):
     """Set the items' metadata to match an AlbumInfo object using a
     mapping from Items to TrackInfo objects.
     """
     for item, track_info in mapping.items():
         # Album, artist, track count.
         if track_info.artist:
-            item.artist = track_info.artist
+            item.artist = unidecode(track_info.artist)
         else:
-            item.artist = album_info.artist
-        item.albumartist = album_info.artist
-        item.album = album_info.album
+            item.artist = unidecode(album_info.artist)
+        item.albumartist = unidecode(album_info.artist)
+        item.album = unidecode(album_info.album)

         # Artist sort and credit names.
         item.artist_sort = track_info.artist_sort or album_info.artist_sort
@@ -102,7 +109,7 @@ def apply_metadata(album_info, mapping):
                     item[suffix] = value

         # Title.
-        item.title = track_info.title
+        item.title = unidecode(track_info.title)

This ensures things like dashes, quotes, etc. are simplified to ASCII.

lee-reinhardt on 4 Jun 2019

👍2

All 8 comments

Hi! Thanks for the discussion—this is a fairly frequent question, but it's not usually as clearly elaborated as it is here.

It sounds like there are two separate issues:

Just ASCIIfying a pre-defined set of punctuation, like “ to ". You might imagine defining a cousin to %asciify{} called %asciify_punct{} or something.
Applying these changes to tags, not just files. This is more or less the domain of the the longstanding request in #488 for a way to apply our powerful templating system to actually modify metadata, including doing that automatically on import.

Does that sound like an accurate synopsis?

sampsyo on 12 May 2017

As a stopgap, you may be interested in the "replace" section of config.yaml. It works solely on paths and not tags. The slash may not be needed, I edited my config which uses many weird escape characters.

replace:
    '[\‐]': -

Sampsyo's summary is great. #1 looks like the way to go, especially with asciify_punct. I'm not a beets contributor / maintainer, so my opinion isn't as important as the people who dig into the code and make it work. Then in the long term, 488 would also be awesome, but if it were easy it probably would be done already.

RollingStar on 14 May 2017

Hi @sampsyo & @RollingStar ,

Thanks for the great synopsis @sampsyo and the suggestion @RollingStar .

I think the synopsis is accurate, in as much as those two changes would solve this for me perfectly. I hadn't seen 488, thanks for the heads-up.

projectgus on 14 May 2017

Cool. I'm marking this as a feature request for the first part: a version of "asciify" that only affects punctuation.

sampsyo on 14 May 2017

👍2

Any news for this? I'd like to see it affecting tags as well, as Last.FM seems to not auto-correct U+2019 to U+0027 and vice-versa.

RodrigoLeiteF on 22 Jul 2018

👍1

Apologies for bumping this issue, but it would really be great to have this working as the previous comment suggests.

Thanks for the great tool!

imiric on 4 Jun 2019

@imiric I have the same desire, and have a hacky fix that works for my purposes. I have a local version of the beets repo that I have patched with these changes:

--- a/beets/autotag/__init__.py
+++ b/beets/autotag/__init__.py
@@ -26,6 +26,9 @@ from .hooks import AlbumInfo, TrackInfo, AlbumMatch, TrackMatch  # noqa
 from .match import tag_item, tag_album, Proposal  # noqa
 from .match import Recommendation  # noqa

+from unidecode import unidecode
+
 # Global logger.
 log = logging.getLogger('beets')

@@ -35,10 +38,12 @@ log = logging.getLogger('beets')
 def apply_item_metadata(item, track_info):
     """Set an item's metadata from its matched TrackInfo object.
     """
-    item.artist = track_info.artist
+    item.artist = unidecode(track_info.artist)
     item.artist_sort = track_info.artist_sort
     item.artist_credit = track_info.artist_credit
-    item.title = track_info.title
+    item.title = unidecode(track_info.title)
     item.mb_trackid = track_info.track_id
     if track_info.artist_id:
         item.mb_artistid = track_info.artist_id
@@ -62,14 +67,16 @@ def apply_metadata(album_info, mapping):
     """Set the items' metadata to match an AlbumInfo object using a
     mapping from Items to TrackInfo objects.
     """
     for item, track_info in mapping.items():
         # Album, artist, track count.
         if track_info.artist:
-            item.artist = track_info.artist
+            item.artist = unidecode(track_info.artist)
         else:
-            item.artist = album_info.artist
-        item.albumartist = album_info.artist
-        item.album = album_info.album
+            item.artist = unidecode(album_info.artist)
+        item.albumartist = unidecode(album_info.artist)
+        item.album = unidecode(album_info.album)

         # Artist sort and credit names.
         item.artist_sort = track_info.artist_sort or album_info.artist_sort
@@ -102,7 +109,7 @@ def apply_metadata(album_info, mapping):
                     item[suffix] = value

         # Title.
-        item.title = track_info.title
+        item.title = unidecode(track_info.title)

This ensures things like dashes, quotes, etc. are simplified to ASCII.

lee-reinhardt on 4 Jun 2019

👍2

The post above was a great starting point for me. My copy is calling a little utility function to only decode the punctuation:

def pundecode(text):
    result = u""
    for character in text:
        if character.isalpha():
            result += character
        else:
            result += unidecode(character)
    return result