Dataverse: Make dataverse pages more discoverable by search engines

Created on 6 Mar 2019 · 12Comments · Source: IQSS/dataverse

Changed the title of the issue, to indicate that the goal is to make individual dataverses more discoverable in google and other search engines.
Embedding structured metadata into the dataverse page is not necessarily the best way to achieve that.
It appears to be more practical to focus on improving the crawl rules (specifically, discouraging the bots from crawling the facets and paginated search results on dataverse pages); in combination with using a sitemap, to point the bots to all the datasets and dataverses directly.

(end update)

We already go to admirable lengths embedding some structured metadata (DC, schema.org) into our dataset pages, making individual datasets more discoverable.
It would benefit our dataverse pages to have similarly easily indexable metadata as well.

Metadata

Source

landreev

Most helpful comment

@jggautier @mheppler
Here's the promised update:
Please disregard most of the info I entered earlier... because it's all lies! Seiously, most of it is no longer relevant.
1) Improved crawler rules have been/are being addressed in #5639;
2) "empty google search cards" (when google finds a dataverse, but shows an empty card with the text "No information is available"): Whatever was causing this, it was not a problem with the page itself. Probably a result of a failed crawl. Once a dataverse page is successfully recrawled and reindexed, it no longer shows as empty on Google. For the dataverse in the example above (note that the dataverse doesn't have any user-entered description!) this is what the search result is looking like now:
Screen Shot 2019-03-11 at 11 06 42 AM

So, since there's no description, it's just showing whatever is at the top of the page. Which happens to be the number of datasets in the dataverse plus part of the description of the first dataset.

This card looks ok to me (definitely not as bad as the "no information is available..." before). But I guess the remaining question is - is there anything at all that we can do to make it any better/any more useful, if the owner of the dataverse hasn't provided any description? One thing that was suggested (by Mike), maybe we could extract some summary of the data in the dataverse from the facets on the page - since they list all the subjects/authors/categories, etc.?

So the way it would work, we could embed a "DC.description" metadata fragment into the html of the page, similarly to what we do on the dataset page. If the dataverse has a description, we use that to populate it. If not, we generate some description on the fly: "This dataverse contains datasets on the subjects of ... by the authors ..." (for example - ?)

Also, an alternative is not to bother with any of this - and instead encourage the dataverse owners to enter meaningful description text, to make their data more discoverable.

landreev on 15 Mar 2019

👍2

All 12 comments

As of now, many of our dataverse pages appear in google search results like this:
screen shot 2019-03-06 at 1 52 10 pm

landreev on 6 Mar 2019

This problem (empty google search record) appears to be more common for the dataverse urls of the "/dataverse/NAME" format, than for the "/dataverse.xhtml?alias=NAME" one:
screen shot 2019-03-06 at 1 55 07 pm

This may, or may not be related to #3130 (trailing slash in the dataverse URL resulting in a 404). But even if this were the case, adding structured metadata to the dataverse page would still be very useful, and would make the process of getting indexed in the search engines more efficient.

landreev on 6 Mar 2019

Huh. I'm surprised that the description of the dataverse isn't indexed. Ever since pull request #4879 was merged (a fix for #4468 which has a screenshot from Google search results), dataverses look much better when you link to them in Slack.

@landreev any thoughts on file landing pages? Would helping Google index them better be worth investigating, perhaps in a separate issue?

pdurbin on 6 Mar 2019

As I said, this may simply be the result of that trailing slash issue.
File pages could be something to investigate separately in the future, yes. (as of now, we are telling the bots to stay away from file pages completely).

landreev on 6 Mar 2019

Hmm, the "worldfish" dataverse has an empty record with the "alias=" URL:
screen shot 2019-03-06 at 2 11 28 pm

Then of course this may be a search result cached from before #4879 was merged. (I'm seeing that the bot has finally re-crawled this dataverse in the last few hours; so hopefully the updated entry will start appearing in searches shortly)

landreev on 6 Mar 2019

It might be worth having only one url to resolve to a dataverse page. If I'm remembering correctly, at least some search engines recommend that if there are multiple URLs with the same content, then one should have the rel=canonical meta tag. Having /dataverse.xhtml?alias=foo redirect to /dataverse/foo (and removing generated links to /dataverse.xhtml?alias=foo) might help with this.

pameyer on 6 Mar 2019

It may be a good idea to exclude the "dataverse.xhtml?alias=..." format from crawling, via robots. And to completely exclude ALL the forms of the dataverse page urls except for the canonical "/dataverse/name", without any extra (search) arguments. As of now, we allow/encourage the bot to crawl through all the facets, and through the paginated search results. This is ineffecient, and does not result in anything useful being indexed.
So the solution should be to only allow one form of the dataverse page, and one form of the dataset page. And only expose them to the bots via sitemap, without relying on crawling at all.

landreev on 6 Mar 2019

I would be remiss in my obligations as issue author if I didn't point out _Dataset - PrettyFaces URL Format #2486_ fitting in both the "dataverse URL forward slash forwarding" and the "dataverse content indexing" story. Especially if we are making changes to block /dataverse.xhtml?alias=... indexing.

It would make more sense to me, and maybe even to a search engine robot, if we had a URL formatting structure that matched the dataverse > dataset > file hierarchy of our app. Something like:

- /dataverse/example
    - /dataverse/example/datasets/doi:10.0000/DVN/XXXXXX
    - /dataverse/example/datasets/doi:10.0000/DVN/YYYYYY
    - /dataverse/example/datasets/doi:10.0000/DVN/ZZZZZZ

Maybe this is a bigger ask than I realize, but there is value to improving what we have now. I would very much like to improve the navigation experience of our app. The format of our URL's is a big part of this. Another part, which might be a conversation for another day, is the use of editMode for these pages. I was reminded of this when I found that the Add Dataset pg was indexed in Google.

screen shot 2019-03-06 at 5 31 20 pm

mheppler on 6 Mar 2019

👍1

@mheppler I think improving the format of the URLs would be a great idea; /dataverse/example/datasets/doi:10.0000/DVN/ZZZZZZ , /dataset/doi:10.0000/DVN/ZZZZZZ or dataset/doi/10.0000/DVN/ZZZZZZ seem like better matches to the conceptual model. I'm not sure about the implementation complexity, but my impression is that it's enough that it should have it's own issue.

pameyer on 6 Mar 2019

@landreev is going to update this issue and discuss with @jggautier and @mheppler.

djbrooke on 13 Mar 2019

So, since there's no description, it's just showing whatever is at the top of the page. Which happens to be the number of datasets in the dataverse plus part of the description of the first dataset.

Also, an alternative is not to bother with any of this - and instead encourage the dataverse owners to enter meaningful description text, to make their data more discoverable.

landreev on 15 Mar 2019

👍2

Also,

The issue of both the "dataverse.xhtml?alias=<name>" and "/dataverse/<name>" pages being indexed for the same dataverse has also been addressed by the improved robots + sitemap approach in #5639.

landreev on 15 Mar 2019

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Upgrade to Solr 7.3.0

djbrooke · 4Comments

Spike - Support for Stata versions 14 and 15

djbrooke · 3Comments

EPIC: small footprint container usable for development, testing and production purposes

poikilotherm · 4Comments

Improve Notice (Error) when trying to view a Previous version with non-restricted file

shlake · 4Comments

Feature Request: re-enable IP group usage with X-Forwarded-For header

poikilotherm · 3Comments