Openlibrary: Delete 58,000 bad author names (CD catalog IDs)

Created on 10 Aug 2012 · 20Comments · Source: internetarchive/openlibrary

Noticed a pattern of really bad errors of author names due to either spam or import problems. The names seem to be all associated with CD's. Some examples include:

Count Cddssa 27022 Basie
Jimmy Cddelt 24980 Buffett
Steve Cdduny 27009 Lacy (with url: http://openlibrary.org/authors/OL3407397A/Steve_Cdduny_27009_Lacy)

To get some sense of the pervasiveness of the issue I did a search for "cdm*" and 6,743 hits were returned.
Spot checks show that records were Created April 30, 2008 from anonymous import.

Data Data Cleanup authors openlibrary-client

Source

mikemaehr

Most helpful comment

Update: I had been actively working on this issue and fixing up the data, but left it to focus on Solr issues that were basically blocking progress.

After thinking about it some more, I want to say that _all bulk data cleanup tasks are currently blocked by lack of timely Solr updates_

I had been plugging away at this, and other unicode clean up and resolutions, but there was no way for anyone, including myself, to see the effect on the site since search results were not updating consistently. The only way was to wait a month and grep the data dumps to see the technical improvement, that wouldn't be noticed by users of the site, who would still see all the bad data in results.

I downloaded the latest author dump and re-ran @tfmorris's regex above -

egrep -Ec " Cd[a-zA-Z0-9]{2,4} " ol_dump_authors_2018-02-28.txt

RESULT for Feb 2018: _140_ !!! (not a typo: one hundred and forty, down from ~55K last time we had reliable data)

so this is basically done, but even I had no idea until this morning, even though it was my fixes from weeks ago.

My recent PR #861 fixes a major author indexing bug where all author deletes were not being actioned by Solr because the keys in the delete request were the wrong format. This directly relates to this issue and explains why none of these authors were ever removed from search. I think there are potentially more undiscovered issues that need fixing with Solr.

In the spirit of being clear, actionable, and bold for 2018 Q2, I say that until we have that confidence in up-to- date Solr indexes, all bulk data clean-up tasks are blocked.

I've been doing many of these in the recent past with the ol-client, and its a bit disheartening to see no real effect on the site, and the 1 month feedback time on _any_ evidence of result is not a workable process. It was great to develop and test ol-client, but to actually see results Solr needs attention, which we have already prioritised for Q2.

@mekarpeles , tagging you for comment / official support :wink:

hornc on 14 Mar 2018

👍2

All 20 comments

It seems the first and last words in the "name" are the real name. And there are several of these for Steve Lacy alone...
As they indeed appear to be (mostly?) CDs, it shouldn't be extremely hard to find and delete(?) these items and their authors.

bencomp on 17 Aug 2012

I agree for the most part with bencomp. I do think it raises some questions about what is actually a book/work and is problematic in cases related to CDs and music (e.g., music play-along books w/cd, books with exclusively music notation). My guess is this might have been discussed, perhaps some guidelines have even already been established?

mikemaehr on 18 Aug 2012

I referred to this issue on the OL-discuss mailing list: http://www.mail-archive.com/[email protected]/msg00728.html

@tfmorris commented and included a link to the GoodReads page about what classifies as book and what doesn't. I found that pretty inspirational. There are no real rules on OL, although in 2010, the (now retired) project leader @george08 hinted that music CDs do not belong in the OL catalogue: https://bugs.launchpad.net/openlibrary/+bug/609801

bencomp on 21 Aug 2012

Names of this form can also occur with real print authors' audio books, so some care needs to be taken if we're going to preserve the audio books. They still exist in the database as of 2016.

tfmorris on 30 Jan 2016

The CD names are abbreviate forms of music publishers. Collecting some here to help with removal:

[x] cdravn: https://openlibrary.org/publishers/RAVEN_RECORDS 36 works
[ ] cdkhi: https://openlibrary.org/publishers/KOCH_ENTERTAINMENT 1000+
[ ] Cdcdis: https://openlibrary.org/publishers/CAROLINE_DISTRIBUTION 1000+
[ ] Cddemn: DEMON/PHANTOM_SOUND
[x] Cdsorr: https://openlibrary.org/publishers/SOAR 22 works
[x] Cdajaz: https://openlibrary.org/publishers/ACID_JAZZ 11 works
[ ] Cdfmx: https://openlibrary.org/publishers/FREMEAUX Also publishes books!
[x] Cdtro: https://openlibrary.org/publishers/TROJAN 72 works
[x] Cdaucl: https://openlibrary.org/publishers/AURA_CLASSICS 17 works
[x] Cddssa: https://openlibrary.org/publishers/DIRECT_SOURCE_MUSIC 265 works
[x] Cddelt : https://openlibrary.org/publishers/DELTA_MUSIC 150 works
[x] https://openlibrary.org/publishers/CPO_RECORDS CPO RECORDS 108 works

Author Merge showing many CDs:
https://openlibrary.org/recentchanges/2010/10/21/merge-authors/41698673

Other items to remove:

comment to add:
removing non-book Audio CD

hornc on 27 Feb 2017

Another bad author pattern for electronic accessories:

[x] https://openlibrary.org/search?q=author%3Acarton+qty

hornc on 28 Feb 2017

Many of these show
"April 30, 2008 Created by an anonymous user Inital record created, from an amazon.com record."
Following the link for Amazon's record details gets you a 404: "Sorry, we couldn't find that page."

LeadSongDog on 28 Feb 2017

Another CD publisher:

[x] Cdcata : https://openlibrary.org/publishers/CAMERATA

LeadSongDog on 28 Feb 2017

thanks @LeadSongDog . Can I be assigned to this issue? I'm currently working on scripting the finsing and removal of these sorts or works. I can't self-assign

hornc on 1 Mar 2017

Thanks for stepping up, @hornc but I'm not able to do the assignments myself. Much of github's functionality is limited for me by a locked-down browser. ;-( Perhaps @tfmorris would do the honors?

LeadSongDog on 1 Mar 2017

@LeadSongDog @hornc I'm only a "Contributor" while @hornc is a "Collaborator" so should have more powers than I, but if he can't do it, I'd suggest pinging @mekarpeles or one of the other "in" crowd to get what you need done accomplished.

tfmorris on 27 Apr 2017

Could someone with the appropriate powers add "CD" to the title somewhere please so that we know what it refers to?

I did a quick search and there are 58565 author records which match this regex:

grep -E " Cd[a-zA-Z0-9]{2,4} "

I think they're all candidates for deletion. Some of the "authors" have multiple works (all Audio CDs), but mostly its one work plus one edition so this would be a bulk delete of a little less than 200K records.

As an aside, can we ratify George's decision that music CDs are not appropriate for OpenLibrary.

tfmorris on 31 Dec 2017

@hornc given this is a data related issue, I'm for moving this over to https://github.com/internetarchive/openlibrary-client

mekarpeles on 8 Jan 2018

I have just deleted 3415 authors of orphaned audio CDs falling into this pattern (after making sure the editions linked to them were removed too).

hornc on 29 Jan 2018

How can we rephrase this issue as something we can fix, rather than something we can/should clean up? Lets please open w/ a clear call to action

cc: @hornc

mekarpeles on 13 Mar 2018

@mekarpeles did you actually move the issue to the other repository? I can't find it there.

This is indeed a data issue, consisting of a one-time cleanup of bad data and making sure somehow that more bad data like this doesn't get in. Only looking at the latter part doesn't fix the problem.

How about prepending the issue title with "Clean up" to make it actionable?

bencomp on 14 Mar 2018

@mekarpeles The call to action is:

Delete ~55K bad author records with names matching the regex grep -E " Cd[a-zA-Z0-9]{2,4} " and related music CDs (ie 58565 less the 3415 that @hornc already deleted)

tfmorris on 14 Mar 2018

👍1

Update: I had been actively working on this issue and fixing up the data, but left it to focus on Solr issues that were basically blocking progress.

After thinking about it some more, I want to say that _all bulk data cleanup tasks are currently blocked by lack of timely Solr updates_

I downloaded the latest author dump and re-ran @tfmorris's regex above -

egrep -Ec " Cd[a-zA-Z0-9]{2,4} " ol_dump_authors_2018-02-28.txt

RESULT for Feb 2018: _140_ !!! (not a typo: one hundred and forty, down from ~55K last time we had reliable data)

so this is basically done, but even I had no idea until this morning, even though it was my fixes from weeks ago.

In the spirit of being clear, actionable, and bold for 2018 Q2, I say that until we have that confidence in up-to- date Solr indexes, all bulk data clean-up tasks are blocked.

@mekarpeles , tagging you for comment / official support :wink:

hornc on 14 Mar 2018

👍2

Illustration of how this data fix is not noticeable on the site:
https://openlibrary.org/search/authors?q=Cd

6K results, most are records targeted by this issue, of those I can't find one that hasn't been already deleted. Still looks like a data problem, but is Solr.

hornc on 14 Mar 2018

https://github.com/internetarchive/openlibrary/issues/152#issuecomment-373170438

In the spirit of being clear, actionable, and bold for 2018 Q2, I say that until we have that confidence in up-to- date Solr indexes, all bulk data clean-up tasks are blocked.

I think this is spot on, also cc'ing in @bfalling so we can make this a priority for Q2.

Also, an enormous round of applause and thank @hornc for championing this, for providing an update on progress, and also some clear steps we can move forward on, as well as @tfmorris for giving us the regex, @bencomp for making sure we moved the issue to openlibrary-client, and @mikemaehr for opening this originally.

And for the community's patience :)

Again, to reiterate @bfalling, we should prioritize w/ clear goals a path to unblock the community (re: search, merging, and data-cleanup) from solr.

mekarpeles on 14 Mar 2018

Was this page helpful?

0 / 5 - 0 ratings