Dataverse: Harvesting - Broken dataset title links for non-Dataverse/OAI-PMH repositories

Created on 15 Aug 2018 · 11Comments · Source: IQSS/dataverse

When Dataverse harvests from some non-Dataverse sources (e.g. ICPSR, DataCite), clicking on the dataset link doesn't take users to the source's dataset page.

You can see an example in this dataverse on Harvard Dataverse, where a set of metadata records from DataCite was harvested. Clicking on the dataset title link (https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.6141/tw-srda-af010014-1) take you to DataCite's 404 Not Found page (https://oai.datacite.org/dataset.xhtml?persistentId=doi:10.6141/tw-srda-af010014-1). Clicking on the DOI link (https://doi.org/10.6141/tw-srda-af010014-1) in the citation box takes you to the source's dataset page.

(On Harvard Dataverse's "ICPSR Harvested Dataverse", clicking on the dataset titles correctly takes you to the sources' dataset pages. I'm not sure how those dataset records were collected into that ICPSR Harvested Dataverse. But they're out of sync with the records that ICPSR makes available over OAI-PMH, and this bug is prevent Harvard Dataverse from updating the datasets it's harvested from ICPSR.)

Could we investigate what's going on?

These are two github issues that might be related:
https://github.com/IQSS/dataverse/issues/4831
https://github.com/IQSS/dataverse/issues/4707

Harvesting Medium Bug

Source

jggautier

Most helpful comment

This should be addressed through a configuration change (not code change hopefully) of the harvesting client and verify

This seems to be the case. I'm helping resolve a different harvesting issue and I noticed that when setting up a harvesting client, the last of 4 steps is to choose the "Archive Type." When I set up harvesting of non-Dataverse repositories, I left the Archive Type as Dataverse v4+, and it seems like I was expected to choose "Generic OAI resource (DC)".

Screen Shot 2020-02-12 at 12 24 11 PM

ICPSR datasets are in the middle of being harvesting into this dataverse on Demo Dataverse, and the dataset title links I've clicked take me to the records on ICPSR's website. (I chose "Generic OAI resource (DC)" since I didn't know why there was an option specifically for ICPSR.)

I think that needing to choose "Generic OAI resource (DC)", in step 4, also implies that in step 2, I was expected to choose oai_dc as the metadata format.

We should identify what clients we expect to change in Harvard Dataverse

The non-Dataverse repositories that Harvard Dataverse should re-harvest are:

SRDA (which is a set in DataCite's oai-pmh feed, and Harvard Dataverse might be having a problem getting the large list of sets in DataCite's feed, as @pdurbin wrote earlier)
ICPSR (since the existing records in Harvard Dataverse are stale, and hopefully using OAI-PMH can keep them up to date)

We should add/verify docs

The docs (http://guides.dataverse.org/en/latest/admin/harvestclients.html) don't mention each step. I'm not sure if there's a need for them to.

In that screenshot, the modal window describes the importance of choosing the right Archive Type (I think I just overlooked it since I don't normally set up harvesting from non-Dataverse repositories):

Screen Shot 2020-02-12 at 2 23 46 PM

Maybe not making Dataverse v4.x as a default will force the user to think about the Archive Type.

jggautier on 12 Feb 2020

❤1 👍1

All 11 comments

Stumbled upon this issue again today. We are up to 2,470 datasets that result in 404's when you click on the dataset title link in the search card.

The resulting 404 URL is formatted with the assumption that there is a "dataset.xhtml" page on the DataCite site.

https://oai.datacite.org/dataset.xhtml?persistentId=doi:10.6141/TW-SRDA-E89101-1

Looking at the code, it appears there is return remoteArchiveUrl; logic but it appears that it isn't being properly applied.

search-include-fragment.xhtml

<!--DATASET CARDS-->
                    <div class="datasetResult clearfix" jsf:rendered="#{result.type == 'datasets'}">
                        <div class="card-title-icon-block">
                            ...
                            <a href="#{!SearchIncludeFragment.rootDv and !result.isInTree ? result.datasetUrl : widgetWrapper.wrapURL(result.datasetUrl)}" target="#{(!SearchIncludeFragment.rootDv and !result.isInTree and widgetWrapper.widgetView) or result.harvested ? '_blank' : ''}">
                                <h:outputText value="#{result.title}" style="padding:4px 0;" rendered="#{result.titleHighlightSnippet == null}"/>
                                ...

SolrSearchResult.java

    public String getDatasetUrl() {
        String failSafeUrl = "/dataset.xhtml?id=" + entityId + "&versionId=" + datasetVersionId;
        if (identifier != null) {
            /**
             * Unfortunately, colons in the globalId (doi:10...) are converted
             * to %3A (doi%3A10...). To prevent this we switched many JSF tags
             * to a plain "a" tag with an href as suggested at
             * http://stackoverflow.com/questions/24733959/houtputlink-value-escaped
             */
            String badString = "null";
            if (!identifier.contains(badString)) {
                if (entity != null && entity instanceof Dataset) {
                    if (this.isHarvested() && ((Dataset)entity).getHarvestedFrom() != null) {
                        String remoteArchiveUrl = ((Dataset) entity).getRemoteArchiveURL();
                        if (remoteArchiveUrl != null) {
                            return remoteArchiveUrl;
                        }
                        return null;
                    }
                }
                if (isDraftState()) {
                    return "/dataset.xhtml?persistentId=" + identifier + "&version=DRAFT";
                }
                return "/dataset.xhtml?persistentId=" + identifier;
            } else {
                logger.info("Dataset identifier/globalId contains \"" + badString + "\" perhaps due to https://github.com/IQSS/dataverse/issues/1147 . Fix data in database and reindex. Returning failsafe URL: " + failSafeUrl);
                return failSafeUrl;
            }
        } else {
            logger.info("Dataset identifier/globalId was null. Returning failsafe URL: " + failSafeUrl);
            return failSafeUrl;
        }
    }

mheppler on 18 Jul 2019

The URLs assume a dataset link for each harvested record which is probably the issue
This should be addressed through a configuration change (not code change hopefully) of the harvesting client and verify
We should identify what clients we expect to change in Harvard Dataverse
We should add/verify docs

djbrooke on 5 Feb 2020

@landreev at standup I mentioned that I was chatting with @jggautier about this issue this morning.

One thing we noticed is that when I tried to set up a harvesting client from https://oai.datacite.org/oai it was taking FOREVER after I clicked "Next".

Actually, the same thing happens when I click "Next" under "Edit Harvesting Client" like this. It just spins and spins:

Screen Shot 2020-02-10 at 2 32 15 PM

But then Julian clued me in to the fact that https://oai.datacite.org/oai has over 2000 sets. In order to know how long I should wait (10 minutes or so?) I hacked in a counter like this:

$ git diff
diff --git a/src/main/java/edu/harvard/iq/dataverse/harvest/client/oai/OaiHandler.java b/src/main/java/edu/harvard/iq/dataverse/harvest/client/oai/OaiHandler.java
index e4642fe0a..bd805bef2 100644
--- a/src/main/java/edu/harvard/iq/dataverse/harvest/client/oai/OaiHandler.java
+++ b/src/main/java/edu/harvard/iq/dataverse/harvest/client/oai/OaiHandler.java
@@ -157,7 +157,10 @@ public class OaiHandler implements Serializable {

         List<String> sets = new ArrayList<>();

+        int count = 0;
         while ( setIter.hasNext()) {
+            count++;
+            System.out.println("on set " + count);
             Set set = setIter.next();
             String setSpec = set.getSpec();
             /*

Obviously, the code above is a hack but I guess I'd suggest adding in some more logging (logger.fine, probably) if you feel like it, while you're in this code.

Also, I can easily reproduce the bug with the client above. Here are all the parameters I used:

harvestingurl: https://oai.datacite.org/oai
harvestingset: GESIS.SRDA
harvesttype: oai
metadataprefix: oai_dc
harveststyle: dataverse

pdurbin on 10 Feb 2020

This should be addressed through a configuration change (not code change hopefully) of the harvesting client and verify

Screen Shot 2020-02-12 at 12 24 11 PM

I think that needing to choose "Generic OAI resource (DC)", in step 4, also implies that in step 2, I was expected to choose oai_dc as the metadata format.

We should identify what clients we expect to change in Harvard Dataverse

The non-Dataverse repositories that Harvard Dataverse should re-harvest are:

SRDA (which is a set in DataCite's oai-pmh feed, and Harvard Dataverse might be having a problem getting the large list of sets in DataCite's feed, as @pdurbin wrote earlier)
ICPSR (since the existing records in Harvard Dataverse are stale, and hopefully using OAI-PMH can keep them up to date)

We should add/verify docs

The docs (http://guides.dataverse.org/en/latest/admin/harvestclients.html) don't mention each step. I'm not sure if there's a need for them to.

Screen Shot 2020-02-12 at 2 23 46 PM

Maybe not making Dataverse v4.x as a default will force the user to think about the Archive Type.

jggautier on 12 Feb 2020

❤1 👍1

@jggautier, kudos for the comment with screenshots, proposing UX/UI improvements to the create harvesting client workflow in order to avoid this issue going forward.

It should be easy enough to change the dropdown menu in Step 4 to be "Select...", forcing the user to make a selection. Are you also suggesting that we could combine Step 2 and Step 4 because of the relation of Metadata Format and Archive Type fields?

mheppler on 12 Feb 2020

👍1

Thanks for reading this so quickly and so closely!

Are you also suggesting that we could combine Step 2 and Step 4 because of the relation of Metadata Format and Archive Type fields?

Not really. A Dataverse repository that wants to harvest from a non-Dataverse repository might want to choose a metadata format that's richer than Dublin Core. E.g. for harvesting ICPSR, I'm testing harvesting using DDI 2.5. I'm wondering what was meant by DC in the option "Generic OAI resource (DC)", and if DC should be removed.

jggautier on 12 Feb 2020

👍1

To summarize:

I overlooked the "Archive Type", which I should have changed to "Generic OAI resource (DC)" when harvesting from non-Dataverse repositories. I think it's because I've never had to think about changing it. Is it important that the default is "Dataverse v4+"? Should there be no default (or default is "Select...") so that the user is forced to make a selection?
Why does "Generic OAI resource (DC)" include that "(DC)", which I take to mean Dublin Core, and can the "(DC)" be removed? It's possible to harvest from non-Dataverse repositories using metadata formats other than Dublin Core.
Fixing the dataset title links for SRDA datasets is blocked by the bug that @pdurbin reported, where the large number of sets in DataCite's oai-pmh feed might somehow be preventing Dataverse from re-harvesting records in the SRDA set (GESIS.SRDA). Should this be its own GitHub issue?
In an issue in the Harvard Dataverse repo, I'll add info about re-harvesting ICPSR datasets.

jggautier on 13 Feb 2020

👍1

I'd like to clarify, and redefine, if needed, the scope of this issue. It was originally opened to reconfigure any existing harvesting clients to make the redirect links work. But it sounds like we are talking about changing the configuration dialogs. (it is of course confusing in its current form).

To summarize:

Is it important that the default is "Dataverse v4+"? Should there be no default (or default is "Select...") so that the user is forced to make a selection?

Yes, probably.

Why does "Generic OAI resource (DC)" include that "(DC)", which I take to mean Dublin Core, and can the "(DC)" be removed? It's possible to harvest from non-Dataverse repositories using metadata formats other than Dublin Core.

The only other harvesting format we (theoretically) recognize from a non-Dataverse OAI archive is DDI; in practice, it's extremely unlikely that we'll be able to parse a DDI that's produced by anything other than a Dataverse. That may have been the rationale - ?

Fixing the dataset title links for SRDA datasets is blocked by the bug that @pdurbin reported, where the large number of sets in DataCite's oai-pmh feed might somehow be preventing Dataverse from re-harvesting records in the SRDA set (GESIS.SRDA). Should this be its own GitHub issue?

If I'm reading @pdurbin's report correctly, this issue - a very long list of sets - should be making configuring a new client (or reconfiguring an existing one) very slow, or impossible. I don't think it should affect harvesting from an already configured client though. (during a harvesting run we never issue a "list sets" command). So if this archive cannot be harvested, it's probably something else.

landreev on 18 Feb 2020

Just to clarify, it is not necessary to re-harvest a remote archive, for the "archive type" change to take effect.
The redirect urls are generated in real time; so the change takes effect immediately.
For the SRDA archive, it does appear to be impossible to make the change through the UI (because of the bug with the set lists described above). But it is possible to do it in the database directly:
UPDATE harvestingclient SET harveststyle='default' WHERE name='srda'
This doesn't really fix it for the archive though; there's no 404 anymore - which is a step up - but the redirect is now showing a bland/generic OAI page on their side. That is because we can't really deduce the remote URL from what they are giving us in the DC metadata (for example: view-source:https://oai.datacite.org/oai?verb=GetRecord&identifier=doi:10.6141/tw-srda-aa000001-1&metadataPrefix=oai_dc).
What we should be doing instead is redirecting to the doi: resolver (for the dataset above - https://doi.org/10.6141/TW-SRDA-AA000001-1).
But for this we'll need a code change (will make a PR shortly).

landreev on 24 Feb 2020

If I'm reading @pdurbin's report correctly, this issue - a very long list of sets - should be making configuring a new client (or reconfiguring an existing one) very slow, or impossible.

It's not impossible, I just had to wait 10 minutes or so. I forget exactly how long. Not a great user experience, obviously. 😄 In practice, I put in some logging to so I could watch server.log and not get frustrated by not knowing how long I'd have to wait. When it got to 1800 of 2000 or whatever I knew I was getting close to the end. 😄

So at minimum I'd suggest a logger.fine line that a sysadmin and bump up in the case of a long list of sets. Basically, a cleaned up version of the hack I mentioned at https://github.com/IQSS/dataverse/issues/4964#issuecomment-584312561 😄

pdurbin on 25 Feb 2020

Made a PR.
Checked in a simple/crude solution for the "too many sets" issue. It definitely is an improvement over the current situation (which is, you cannot set up harvesting from datacite.org/cannot edit any already created clients harvesting from datacite.org). I cannot justify spending any more time on this issue - it's already a bit outside the original scope; plus I'm not aware of any other OAI archive with the same problem.

landreev on 26 Feb 2020

Was this page helpful?

0 / 5 - 0 ratings