Dataverse: Improving Dataverse's Schema.org JSON-LD schema to enable author names display in Google Dataset Search's

Created on 7 Sep 2018  Â·  15Comments  Â·  Source: IQSS/dataverse

Author names are missing from the dataset records that Google Dataset Search indexed from Dataverse.

The JSON-LD schema for authors should be updated to "@type": "Person" or other appropriate types to differentiate between organization as authors or individuals as authors.

Metadata

All 15 comments

@chewsw thanks for opening this issue. Here's the thread from the dataverse-users mailing list: https://groups.google.com/d/msg/dataverse-community/TlQPNI3Ip2E/srLf29aSBAAJ

Originally we had "@type": "Person" in the JSON-LD output (in development, before release) but in Dataverse it's possible to have organizations as authors ("Gallup Organization", "Geological Survey (U.S.)", etc.) so we took it out. Please see discussion in these two places:

Maybe outlining some more details would help with estimation:

  • This will be required metadata that the author can change from the UI or API, indicating in some way whether the author names she enters are people or organizations
  • We need a plan for how installations can add author types to existing datasets

I'm starting to think that Google Dataset Search prefers the creator property (as opposed to the author property that Dataverse uses). For every dataset landing page I've found in Google Dataset Search where Google shows the author, the creator property is used instead of the author property.

Unless someone finds something different, I'd propose that instead of the author property Dataverse uses the creator property, which can use the same sub-properties:

"creator": [
    {
      "affiliation": "affiliation",
      "@type": "Person",
      "name": "Lname, Fname"
    },
    {
      "@type": "Organization",
      "name": "Org name"
    }

(Google doesn't like affiliation when the @type organization is used with the author or creator properties.)

Also, Google's Structured Data Testing Tool is no longer showing errors when author or creator types are missing (and it defaults to "Thing" instead of Person or Organization), although I still agree that Dataverse's schema.org metadata should say whether dataset authors are people or organizations.

Today during sprint planning @jggautier explained his hunch on how switching from author to creator might help. This was while discussing #4371.

I asked for clarification in the structured data section of Google's webmaster forum.

"Creator" seems like the more used property, but DataCite is using "author" with an @type. On this dataset on Google Dataset Search), authors are displayed even though the "author" property is used, so maybe Google does want to see the specified @type.

Hi Julian,

Thanks very much for following up on this. Changing the "author" property
to "creator" seems like a good idea. Let's see how Google will respond to
this!

On Sat, Oct 6, 2018 at 12:47 AM Julian Gautier notifications@github.com
wrote:

I asked for clarification
https://productforums.google.com/forum/#!topic/webmasters/Ix1PXcY9IHc
in the structured data section of Google's webmaster forum.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/IQSS/dataverse/issues/5029#issuecomment-427428927,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ApEdgKhWjDabnbl2NOBncM-YAIKdlN6Sks5uh40jgaJpZM4WepYj
.

Author names are showing up on some but not all Google Dataset Search pages for datasets in Dataverse repositories, like this page for a dataset from the Texas Data Repository (TDR), and this page for a dataset from Harvard Dataverse. But those pages also say metadata is coming from DataCite, which publishes its own schema.org metadata and uses only the "author" property, but includes its guessed @type (e.g. this schema.org metadata from DataCite for that TDR dataset). From what I can tell so far, every Google Dataset Search page for datasets from a Dataverse repository includes author names only when the "dataset provided by" includes DataCite.

Harvard Dataverse upgraded to Dataverse 4.10.1 two days ago (Jan 8), which includes adding the "creator" property to the schema.org metadata. Once Google starts indexing more recently published datasets, we can see if authors are displayed on Google Dataset Search pages (especially when it isn't also isn't using DataCite's schema.org metadata).

Just an update: Datasets published in Harvard Dataverse after Jan 7, with the updated Schema.org metadata, are showing up in Google Dataset Search without the author names (like this one and this one). I think we can rule out any preference for "author" versus "creator" elements.

@jggautier bummer. Does that mean we should try adding "@type": "Person"? As indicated above, the Dataverse UI/API would need to allow dataset authors to choose between a person and an organization.

That or we could do what @mfenner wrote in #2243 that DataCite does, which is basically guess (with >90% accuracy).

@pdurbin and @jggautier thank you for looking into this problem. I thought I will add my findings here if that will help to make the changes in the next version.

What I discovered using the Structured dataset test tool is for NTU datasets the author fields show as Thing in the test tool. If I change the "@type": "Person" for one of the authors the tool doesn't show any error. I think we must include the "@type": "Person" for all the authors in the ld+json in the script. Then google dataset results page will show the author names under Person.

image

I see from an example of GBIF—the Global Biodiversity Information Facility dataset record in Google dataset results page display author names as "@type": "Person".

image

Please refer to the screenshot below
image

I think we need to add "@type": "Person" in Dataverse to show the author name in Google dataset page.

@Venki18 hi! Yes, the "every person and organization is a Thing" problem is well known to us, unfortunately.

I've been hoping we can use some new code added by @fcadili in pull request #4664 to pass in a string that could either be a person or an organization and the code will tell use which it is.

I haven't studied the code yet but here's a test he wrote that shows the code figuring out if a string is for an organization or a person, for example:

https://github.com/IQSS/dataverse/blob/v4.14/src/test/java/edu/harvard/iq/dataverse/export/OrganizationsTest.java

Screenshot from 2019-06-05 06-23-00

@pdurbin thank you for the quick reply. May I know how does the export for TermsOfUse work? We have been using CC-BY-NC instead of CC0 and we have changed the necessary text in Bundle.properties file. But we are using the code CC0 as it is. Hence when you guys export to ld+json format it is exported as CC0. Hence for all our datasets with waiver terms the code exports what is entered in the additional text box. For default CC0 it is taken as it is.
Is there any way to show CC-BY-NC?

@Venki18 I'm not sure but let me at least give you and @Thanh-Thanh and others some pointers to the code:

https://github.com/IQSS/dataverse/blob/v4.14/src/main/java/edu/harvard/iq/dataverse/DatasetVersion.java#L1777

It looks like if CC0 isn't specified the code will put in the free form text the user entered as an alternative to CC0.

This is somewhat off topic for this issue, of course, but I hope this helps! :smile: Please feel free to create as many issues as we need!

@Venki18 also, if you're interested @rigelk and I are talking about Schema.org JSON-LD, especially in relation to ActivityPub (#5883) in chat. You can catch up on the conversation at http://irclog.iq.harvard.edu/dataverse/2019-06-05

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Fernand0S picture Fernand0S  Â·  4Comments

bsilverstein picture bsilverstein  Â·  3Comments

djbrooke picture djbrooke  Â·  4Comments

lmaylein picture lmaylein  Â·  3Comments

rmo-cdsp picture rmo-cdsp  Â·  3Comments