Changed the title of the issue, to indicate that the goal is to make individual dataverses more discoverable in google and other search engines.
Embedding structured metadata into the dataverse page is not necessarily the best way to achieve that.
It appears to be more practical to focus on improving the crawl rules (specifically, discouraging the bots from crawling the facets and paginated search results on dataverse pages); in combination with using a sitemap, to point the bots to all the datasets and dataverses directly.
(end update)
We already go to admirable lengths embedding some structured metadata (DC, schema.org) into our dataset pages, making individual datasets more discoverable.
It would benefit our dataverse pages to have similarly easily indexable metadata as well.
As of now, many of our dataverse pages appear in google search results like this:

This problem (empty google search record) appears to be more common for the dataverse urls of the "/dataverse/NAME" format, than for the "/dataverse.xhtml?alias=NAME" one:

This may, or may not be related to #3130 (trailing slash in the dataverse URL resulting in a 404). But even if this were the case, adding structured metadata to the dataverse page would still be very useful, and would make the process of getting indexed in the search engines more efficient.
Huh. I'm surprised that the description of the dataverse isn't indexed. Ever since pull request #4879 was merged (a fix for #4468 which has a screenshot from Google search results), dataverses look much better when you link to them in Slack.
@landreev any thoughts on file landing pages? Would helping Google index them better be worth investigating, perhaps in a separate issue?
As I said, this may simply be the result of that trailing slash issue.
File pages could be something to investigate separately in the future, yes. (as of now, we are telling the bots to stay away from file pages completely).
Hmm, the "worldfish" dataverse has an empty record with the "alias=" URL:

Then of course this may be a search result cached from before #4879 was merged. (I'm seeing that the bot has finally re-crawled this dataverse in the last few hours; so hopefully the updated entry will start appearing in searches shortly)
It might be worth having only one url to resolve to a dataverse page. If I'm remembering correctly, at least some search engines recommend that if there are multiple URLs with the same content, then one should have the rel=canonical meta tag. Having /dataverse.xhtml?alias=foo redirect to /dataverse/foo (and removing generated links to /dataverse.xhtml?alias=foo) might help with this.
It may be a good idea to exclude the "dataverse.xhtml?alias=..." format from crawling, via robots. And to completely exclude ALL the forms of the dataverse page urls except for the canonical "/dataverse/name", without any extra (search) arguments. As of now, we allow/encourage the bot to crawl through all the facets, and through the paginated search results. This is ineffecient, and does not result in anything useful being indexed.
So the solution should be to only allow one form of the dataverse page, and one form of the dataset page. And only expose them to the bots via sitemap, without relying on crawling at all.
I would be remiss in my obligations as issue author if I didn't point out _Dataset - PrettyFaces URL Format #2486_ fitting in both the "dataverse URL forward slash forwarding" and the "dataverse content indexing" story. Especially if we are making changes to block /dataverse.xhtml?alias=... indexing.
It would make more sense to me, and maybe even to a search engine robot, if we had a URL formatting structure that matched the dataverse > dataset > file hierarchy of our app. Something like:
- /dataverse/example
- /dataverse/example/datasets/doi:10.0000/DVN/XXXXXX
- /dataverse/example/datasets/doi:10.0000/DVN/YYYYYY
- /dataverse/example/datasets/doi:10.0000/DVN/ZZZZZZ
Maybe this is a bigger ask than I realize, but there is value to improving what we have now. I would very much like to improve the navigation experience of our app. The format of our URL's is a big part of this. Another part, which might be a conversation for another day, is the use of editMode for these pages. I was reminded of this when I found that the Add Dataset pg was indexed in Google.

@mheppler I think improving the format of the URLs would be a great idea; /dataverse/example/datasets/doi:10.0000/DVN/ZZZZZZ , /dataset/doi:10.0000/DVN/ZZZZZZ or dataset/doi/10.0000/DVN/ZZZZZZ seem like better matches to the conceptual model. I'm not sure about the implementation complexity, but my impression is that it's enough that it should have it's own issue.
@landreev is going to update this issue and discuss with @jggautier and @mheppler.
@jggautier @mheppler
Here's the promised update:
Please disregard most of the info I entered earlier... because it's all lies! Seiously, most of it is no longer relevant.
1) Improved crawler rules have been/are being addressed in #5639;
2) "empty google search cards" (when google finds a dataverse, but shows an empty card with the text "No information is available"): Whatever was causing this, it was not a problem with the page itself. Probably a result of a failed crawl. Once a dataverse page is successfully recrawled and reindexed, it no longer shows as empty on Google. For the dataverse in the example above (note that the dataverse doesn't have any user-entered description!) this is what the search result is looking like now:

So, since there's no description, it's just showing whatever is at the top of the page. Which happens to be the number of datasets in the dataverse plus part of the description of the first dataset.
This card looks ok to me (definitely not as bad as the "no information is available..." before). But I guess the remaining question is - is there anything at all that we can do to make it any better/any more useful, if the owner of the dataverse hasn't provided any description? One thing that was suggested (by Mike), maybe we could extract some summary of the data in the dataverse from the facets on the page - since they list all the subjects/authors/categories, etc.?
So the way it would work, we could embed a "DC.description" metadata fragment into the html of the page, similarly to what we do on the dataset page. If the dataverse has a description, we use that to populate it. If not, we generate some description on the fly: "This dataverse contains datasets on the subjects of ... by the authors ..." (for example - ?)
Also, an alternative is not to bother with any of this - and instead encourage the dataverse owners to enter meaningful description text, to make their data more discoverable.
Also,
Most helpful comment
@jggautier @mheppler

Here's the promised update:
Please disregard most of the info I entered earlier... because it's all lies! Seiously, most of it is no longer relevant.
1) Improved crawler rules have been/are being addressed in #5639;
2) "empty google search cards" (when google finds a dataverse, but shows an empty card with the text "No information is available"): Whatever was causing this, it was not a problem with the page itself. Probably a result of a failed crawl. Once a dataverse page is successfully recrawled and reindexed, it no longer shows as empty on Google. For the dataverse in the example above (note that the dataverse doesn't have any user-entered description!) this is what the search result is looking like now:
So, since there's no description, it's just showing whatever is at the top of the page. Which happens to be the number of datasets in the dataverse plus part of the description of the first dataset.
This card looks ok to me (definitely not as bad as the "no information is available..." before). But I guess the remaining question is - is there anything at all that we can do to make it any better/any more useful, if the owner of the dataverse hasn't provided any description? One thing that was suggested (by Mike), maybe we could extract some summary of the data in the dataverse from the facets on the page - since they list all the subjects/authors/categories, etc.?
So the way it would work, we could embed a "DC.description" metadata fragment into the html of the page, similarly to what we do on the dataset page. If the dataverse has a description, we use that to populate it. If not, we generate some description on the fly: "This dataverse contains datasets on the subjects of ... by the authors ..." (for example - ?)
Also, an alternative is not to bother with any of this - and instead encourage the dataverse owners to enter meaningful description text, to make their data more discoverable.