Beets: Importer duplicate detection: configurable field set

Created on 6 Dec 2014  路  8Comments  路  Source: beetbox/beets

Currently, beets uses only album and artist. I'd like to use year too. Maybe make it configurable?

(this discussion started in #1131)

feature mediumsize

Most helpful comment

I was looking into adding this but it turns out it might run a lot deeper than I initially thought. The importer uses [beets.importer.ImportTask#chosen_ident] (or [beets.importer.SingletonImportTask#chosen_ident]) to get the artist and album name (or title of track if single item) of the item being imported. For SingletonImportTask this isn't really a problem, as it uses the Item itself which has a year field.

The main issue seems to be with modifying how [beets.importer.ImportTask#chosen_ident] works, as it uses class instance variables which store the current artist and album for the import.

Both cur_artist and cur_album are populated in [beets.importer.ImportTask#lookup_candidates], which gets them from [beets.autotag.media.tag_album]. This in turn gets them from [beets.autotag.media.current_metadata] which does provide the year in it's return value.

Now we could just fetch cur_year from likelies (the return value of [beets.autotag.media.current_metadata]) and return it alongside the others, however this seems like it might end up being a bit of a hack. The reason I'm saying this is that we end up using the cur_artist and cur_album to look up the album within [beets.autotag.media.tag_album], using (what I believe to be) external sources such as MusicBrainz and plugins. If we don't use cur_year when looking up and just return it then it feels like a bit of a hack to me.

We can't even easily use cur_year when looking up the album, as it uses the candidates hook (see here) which in turn dispatches to plugins, meaning any changes made will be breaking.

I may have completely missed something simple here, but from my preliminary look it doesn't seem like this will be an easy thing to add 馃槩.

What do you think @sampsyo?

All 8 comments

Thanks for filing this. To summarize the other discussion, you wanted to add year so that albums with the same name could be considered non-duplicate.

I have a similar issue where a single has the same title as an album for the same artist, in most cases the year will be the same too so it's great to see that you've widened the solution to configure fields without limitation, allows me to use albumtype, which will be ideal for this scenario.

Any chance to see this implemented?

BTW, when importing in --singletons mode it seems that the duplicate detection code uses ony author and title, and this bug gets quite annoying. e.g. for jazz music it's common to have many version of the same song by the same artist.

I don't mean this to sound dismissive, but you can make the chances much higher by doing some work on it yourself! The current status is that we think this is a good idea and important; it's just a matter of finding someone with the time to implement it.

I was looking into adding this but it turns out it might run a lot deeper than I initially thought. The importer uses [beets.importer.ImportTask#chosen_ident] (or [beets.importer.SingletonImportTask#chosen_ident]) to get the artist and album name (or title of track if single item) of the item being imported. For SingletonImportTask this isn't really a problem, as it uses the Item itself which has a year field.

The main issue seems to be with modifying how [beets.importer.ImportTask#chosen_ident] works, as it uses class instance variables which store the current artist and album for the import.

Both cur_artist and cur_album are populated in [beets.importer.ImportTask#lookup_candidates], which gets them from [beets.autotag.media.tag_album]. This in turn gets them from [beets.autotag.media.current_metadata] which does provide the year in it's return value.

Now we could just fetch cur_year from likelies (the return value of [beets.autotag.media.current_metadata]) and return it alongside the others, however this seems like it might end up being a bit of a hack. The reason I'm saying this is that we end up using the cur_artist and cur_album to look up the album within [beets.autotag.media.tag_album], using (what I believe to be) external sources such as MusicBrainz and plugins. If we don't use cur_year when looking up and just return it then it feels like a bit of a hack to me.

We can't even easily use cur_year when looking up the album, as it uses the candidates hook (see here) which in turn dispatches to plugins, meaning any changes made will be breaking.

I may have completely missed something simple here, but from my preliminary look it doesn't seem like this will be an easy thing to add 馃槩.

What do you think @sampsyo?

Hi, @jackwilsdon! It's great to hear from you.

You're right; this is a bit complicated. Here's the fundamental problem, I think: find_duplicates needs to have a notion of the "artist" and "album" for the newly-imported music. When the metadata actually comes from MusicBrainz or another metadata source, this isn't too hard鈥攚e can just read the information off of the AlbumInfo or TrackInfo object. This is reflected in the second branch of the if in both implementations of chosen_ident.

But when we have an as-is import, where the information comes from the current metadata on the files, things are more complicated. How do we get the "current artist" for a collection of tracks? We can only guess it by taking the majority. That's why there's the coupling with current_metadata, which is also used to determine the initial search criteria.

I think the right approach is probably to make current_metadata (and likelies) turn up more fields. Instead of just having a fixed pair cur_artist and cur_album, we could have a flexible dict full of all the fields we could possibly want to use for duplicate detection. Does that make sense?

Hey @sampsyo, sorry for pretty much disappearing off the face of the earth recently! 馃槅

Does current_metadata not already turn up multiple fields? From the code it looks like it supports these fields:

```python
fields = ['artist', 'album', 'albumartist', 'year', 'disctotal', 'mb_albumid', 'label', 'catalognum', 'country', 'media', 'albumdisambig']
````

Regarding your last point; would tag_album accept a list of fields to use when tagging? How would this work with the candidates hook?

Well I'll be damned; I totally forgot that current_metadata already did that! Of course, we use this for metadata matching. Good point!

Currently, tag_album returns cur_artist and cur_album, which come from current_metadata, along with the Proposal object. I think my proposal amounts to returning the entire dictionary of current metadata from that function. Then, in lookup_candidates in beets.importer, where we currently have:

self.cur_artist = artist
self.cur_album = album

we'll instead have:

```py
self.cur_metadata = cur_metadata
````

to save that for later. Finally, chosen_ident can then use any fields from this to generate the identifier for each album.

Does that make any sense?

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Stunner picture Stunner  路  3Comments

hashhar picture hashhar  路  3Comments

cgtobi picture cgtobi  路  3Comments

robot3498712 picture robot3498712  路  3Comments

ctrueden picture ctrueden  路  3Comments