Noticed a pattern of really bad errors of author names due to either spam or import problems. The names seem to be all associated with CD's. Some examples include:
Count Cddssa 27022 Basie
Jimmy Cddelt 24980 Buffett
Steve Cdduny 27009 Lacy (with url: http://openlibrary.org/authors/OL3407397A/Steve_Cdduny_27009_Lacy)
To get some sense of the pervasiveness of the issue I did a search for "cdm*" and 6,743 hits were returned.
Spot checks show that records were Created April 30, 2008 from anonymous import.
It seems the first and last words in the "name" are the real name. And there are several of these for Steve Lacy alone...
As they indeed appear to be (mostly?) CDs, it shouldn't be extremely hard to find and delete(?) these items and their authors.
I agree for the most part with bencomp. I do think it raises some questions about what is actually a book/work and is problematic in cases related to CDs and music (e.g., music play-along books w/cd, books with exclusively music notation). My guess is this might have been discussed, perhaps some guidelines have even already been established?
I referred to this issue on the OL-discuss mailing list: http://www.mail-archive.com/[email protected]/msg00728.html
@tfmorris commented and included a link to the GoodReads page about what classifies as book and what doesn't. I found that pretty inspirational. There are no real rules on OL, although in 2010, the (now retired) project leader @george08 hinted that music CDs do not belong in the OL catalogue: https://bugs.launchpad.net/openlibrary/+bug/609801
Names of this form can also occur with real print authors' audio books, so some care needs to be taken if we're going to preserve the audio books. They still exist in the database as of 2016.
The CD names are abbreviate forms of music publishers. Collecting some here to help with removal:
Author Merge showing many CDs:
https://openlibrary.org/recentchanges/2010/10/21/merge-authors/41698673
Other items to remove:
comment to add:
removing non-book Audio CD
Another bad author pattern for electronic accessories:
Many of these show
"April 30, 2008 Created by an anonymous user Inital record created, from an amazon.com record."
Following the link for Amazon's record details gets you a 404: "Sorry, we couldn't find that page."
Another CD publisher:
thanks @LeadSongDog . Can I be assigned to this issue? I'm currently working on scripting the finsing and removal of these sorts or works. I can't self-assign
Thanks for stepping up, @hornc but I'm not able to do the assignments myself. Much of github's functionality is limited for me by a locked-down browser. ;-( Perhaps @tfmorris would do the honors?
@LeadSongDog @hornc I'm only a "Contributor" while @hornc is a "Collaborator" so should have more powers than I, but if he can't do it, I'd suggest pinging @mekarpeles or one of the other "in" crowd to get what you need done accomplished.
Could someone with the appropriate powers add "CD" to the title somewhere please so that we know what it refers to?
I did a quick search and there are 58565 author records which match this regex:
grep -E " Cd[a-zA-Z0-9]{2,4} "
I think they're all candidates for deletion. Some of the "authors" have multiple works (all Audio CDs), but mostly its one work plus one edition so this would be a bulk delete of a little less than 200K records.
As an aside, can we ratify George's decision that music CDs are not appropriate for OpenLibrary.
@hornc given this is a data related issue, I'm for moving this over to https://github.com/internetarchive/openlibrary-client
I have just deleted 3415 authors of orphaned audio CDs falling into this pattern (after making sure the editions linked to them were removed too).
How can we rephrase this issue as something we can fix, rather than something we can/should clean up? Lets please open w/ a clear call to action
cc: @hornc
@mekarpeles did you actually move the issue to the other repository? I can't find it there.
This is indeed a data issue, consisting of a one-time cleanup of bad data and making sure somehow that more bad data like this doesn't get in. Only looking at the latter part doesn't fix the problem.
How about prepending the issue title with "Clean up" to make it actionable?
@mekarpeles The call to action is:
Delete ~55K bad author records with names matching the regex grep -E " Cd[a-zA-Z0-9]{2,4} " and related music CDs (ie 58565 less the 3415 that @hornc already deleted)
Update: I had been actively working on this issue and fixing up the data, but left it to focus on Solr issues that were basically blocking progress.
After thinking about it some more, I want to say that _all bulk data cleanup tasks are currently blocked by lack of timely Solr updates_
I had been plugging away at this, and other unicode clean up and resolutions, but there was no way for anyone, including myself, to see the effect on the site since search results were not updating consistently. The only way was to wait a month and grep the data dumps to see the technical improvement, that wouldn't be noticed by users of the site, who would still see all the bad data in results.
I downloaded the latest author dump and re-ran @tfmorris's regex above -
egrep -Ec " Cd[a-zA-Z0-9]{2,4} " ol_dump_authors_2018-02-28.txt
RESULT for Feb 2018: _140_ !!! (not a typo: one hundred and forty, down from ~55K last time we had reliable data)
so this is basically done, but even I had no idea until this morning, even though it was my fixes from weeks ago.
My recent PR #861 fixes a major author indexing bug where all author deletes were not being actioned by Solr because the keys in the delete request were the wrong format. This directly relates to this issue and explains why none of these authors were ever removed from search. I think there are potentially more undiscovered issues that need fixing with Solr.
In the spirit of being clear, actionable, and bold for 2018 Q2, I say that until we have that confidence in up-to- date Solr indexes, all bulk data clean-up tasks are blocked.
I've been doing many of these in the recent past with the ol-client, and its a bit disheartening to see no real effect on the site, and the 1 month feedback time on _any_ evidence of result is not a workable process. It was great to develop and test ol-client, but to actually see results Solr needs attention, which we have already prioritised for Q2.
@mekarpeles , tagging you for comment / official support :wink:
Illustration of how this data fix is not noticeable on the site:
https://openlibrary.org/search/authors?q=Cd
6K results, most are records targeted by this issue, of those I can't find one that hasn't been already deleted. Still looks like a data problem, but is Solr.
https://github.com/internetarchive/openlibrary/issues/152#issuecomment-373170438
In the spirit of being clear, actionable, and bold for 2018 Q2, I say that until we have that confidence in up-to- date Solr indexes, all bulk data clean-up tasks are blocked.
I think this is spot on, also cc'ing in @bfalling so we can make this a priority for Q2.
Also, an enormous round of applause and thank @hornc for championing this, for providing an update on progress, and also some clear steps we can move forward on, as well as @tfmorris for giving us the regex, @bencomp for making sure we moved the issue to openlibrary-client, and @mikemaehr for opening this originally.
And for the community's patience :)
Again, to reiterate @bfalling, we should prioritize w/ clear goals a path to unblock the community (re: search, merging, and data-cleanup) from solr.
Most helpful comment
Update: I had been actively working on this issue and fixing up the data, but left it to focus on Solr issues that were basically blocking progress.
After thinking about it some more, I want to say that _all bulk data cleanup tasks are currently blocked by lack of timely Solr updates_
I had been plugging away at this, and other unicode clean up and resolutions, but there was no way for anyone, including myself, to see the effect on the site since search results were not updating consistently. The only way was to wait a month and grep the data dumps to see the technical improvement, that wouldn't be noticed by users of the site, who would still see all the bad data in results.
I downloaded the latest author dump and re-ran @tfmorris's regex above -
RESULT for Feb 2018: _140_ !!! (not a typo: one hundred and forty, down from ~55K last time we had reliable data)
so this is basically done, but even I had no idea until this morning, even though it was my fixes from weeks ago.
My recent PR #861 fixes a major author indexing bug where all author deletes were not being actioned by Solr because the keys in the delete request were the wrong format. This directly relates to this issue and explains why none of these authors were ever removed from search. I think there are potentially more undiscovered issues that need fixing with Solr.
In the spirit of being clear, actionable, and bold for 2018 Q2, I say that until we have that confidence in up-to- date Solr indexes, all bulk data clean-up tasks are blocked.
I've been doing many of these in the recent past with the ol-client, and its a bit disheartening to see no real effect on the site, and the 1 month feedback time on _any_ evidence of result is not a workable process. It was great to develop and test ol-client, but to actually see results Solr needs attention, which we have already prioritised for Q2.
@mekarpeles , tagging you for comment / official support :wink: