Beets: Fix lyrics scraping on genius.com

Created on 13 Aug 2020  路  6Comments  路  Source: beetbox/beets

Problem

The lyrics plugin cannot get the lyric text on some pages of genius.com.
Example:
https://genius.com/Ed-sheeran-nothing-on-you-lyrics
The plugin expects this:

<div class="lyrics">

When running

$ beet -vv lyrics

the above page will result in:

Traceback (most recent call last):
[...]
File "/usr/lib/python3.7/site-packages/beetsplug/lyrics.py", line 375, in lyrics_from_song_api_path
    lyrics = html.find("div", class_="lyrics").get_text()
AttributeError: 'NoneType' object has no attribute 'get_text'

Reason is that genious.com seems to have several pages where instead of a single div the
text is stored in a list of divs where the class name starts with Lyrics_Container:

<div class="Lyrics__Container-sc-1ynbvzw-2 jgQsqn">...</div>
<div class="Lyrics__Container-sc-1ynbvzw-2 jgQsqn">...</div>
<div class="Lyrics__Container-sc-1ynbvzw-2 jgQsqn">...</div>

Setup

  • OS: linux
  • Python version: 3.7.3
  • beets version: 1.4.9
  • Turning off plugins made problem go away (yes/no): yes (lyrics plugin)

To reproduce, enable the lyrics plugin, set the genius as only importer and try to
scrape a fairly large music collection.

lyrics:
    bing_lang_from: []
    auto: yes
    sources: genius
    bing_client_secret: REDACTED
    bing_lang_to:
    google_API_key: REDACTED
    google_engine_ID: REDACTED
    genius_api_key: REDACTED
    fallback:
    force: no
    local: no
directory: REDACTED/music
ignore_hidden: yes
asciify_paths: yes

import:
    move: no
    write: yes
    incremental: no
    resume: no

plugins: lyrics

Here is a patch that worked for me:

--- lyrics.py.orig      2020-08-12 20:10:01.000000000 +0200
+++ lyrics.py   2020-08-12 20:10:01.000000000 +0200
@@ -370,11 +370,21 @@
         # Remove script tags that they put in the middle of the lyrics.
         [h.extract() for h in html('script')]

-        # At least Genius is nice and has a tag called 'lyrics'!
-        # Updated css where the lyrics are based in HTML.
-        lyrics = html.find("div", class_="lyrics").get_text()
-
-        return lyrics
+        # Genius has the lyrics either in multiple divs with class attributes
+        # beginning with "Lyrics__Container", or in a single div with class
+        # attribute "lyrics"
+        lyric_tag = html.find("div", class_="lyrics")
+        if lyric_tag is None:
+            class_matcher = re.compile("^Lyrics__Container")
+            lyric_tags = html.find_all("div", class_=class_matcher)
+            if not lyric_tags:
+                self._log.debug(u'Genius page {0} has no lyric tags', page_url)
+                return None
+            lyrics = u'\n\n'.join(tag.get_text() for tag in lyric_tags)
+        else:
+            lyrics = lyric_tag.get_text()
+        # remove leading and trailing whitespace
+        return lyrics.strip()

     def fetch(self, artist, title):
         search_url = self.base_url + "/search"
bug needinfo

All 6 comments

Seems great! Would you mind transforming this patch into a pull request?

Hey, could I pick this issue up, and leverage the patch @wummel provided in order to have a sure shot solution?

That would be awesome!

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

I think it still not solved.

I believe that's true. If anybody has the bandwidth to take the above patch and open a quick PR with it, we can get the process of fixing things started!

Was this page helpful?
0 / 5 - 0 ratings

Related issues

foways picture foways  路  5Comments

bartkl picture bartkl  路  3Comments

bammerlaan picture bammerlaan  路  4Comments

Freso picture Freso  路  4Comments

clounie picture clounie  路  3Comments