Dataverse: Usernames are case sensitive

Created on 17 Jan 2017  路  19Comments  路  Source: IQSS/dataverse

@dlmurphy asked the following at https://github.com/IQSS/dataverse/pull/3539#issuecomment-270491499

Under Create Account it mentions that the Username field supports characters "a-Z". This implies that uppercase letters are allowed and that Dataverse distinguishes between lowercase and uppercase letters in usernames. Are both of these implications accurate? Would bob11 and BOB11 be separate usernames?

I posted my answer at https://github.com/IQSS/dataverse/pull/3539#issuecomment-270658359

I tested bob11 vs. BOB11 and it's apparently allowed. There's always a "dataverseAdmin" account and I was able to create a second account as "DATAVERSEADMIN". I'll let you think about if this is a bug of a feature but I will say that we disallow this for the "alias" of a small d dataverse.

We discussed this a bit more today an I'm fairly sure that case-insensitive usernames is a usability issue. I can imagine a poor user saying "When you assign me permission to your dataset please be sure to choose @jharvard and not @JHARVARD because the upper-case version isn't me!"

1445 is highly related.

Account & User Info Medium Bug

All 19 comments

Thanks for opening this issue, Phil.

Another concern is the potential for abuse. Example: someone with the username MichelleObama (with two lowercase L's) could be spoofed by someone with the username MicheIIeObama (with two uppercase i's). It's also possible that this sort of confusion could arise by accident as well.

Yeah, we should probably fix this at some point but it hasn't been a priority. Let's make a new issue when we decide to pick this up.

This issue still exists in Dataverse. Also Julian and I noticed that the search field on the Dataverse permissions page is case sensitive. Eg: searching for "bob" will not find the "Bob11" account.

Just looking at usernames that begin with the letter 'a', I came across 16 instances of usernames that are identical except for capitalization differences.

We should display our regular message about the username already being taken when someone tries to create a username with different capitalization.

screen shot 2019-02-13 at 10 51 04 pm

I mentioned in standup that it would be great if JPA supported a case insensitive uniqueness constraint so I just opened https://github.com/eclipse-ee4j/jpa-api/issues/209

I also indicated that our previous fix was for #2598 to disallow people from choosing the same dataverse "alias" but with different cases ("mra" vs "MRA" or whatever). So I assume we'll apply a similar fix. Here's the fix we added back then:

CREATE UNIQUE INDEX dataverse_alias_unique_idx on dataverse (LOWER(alias));

https://github.com/IQSS/dataverse/blob/v4.10.1/scripts/database/reference_data.sql#L31

See also https://stackoverflow.com/questions/25743191/how-to-add-a-case-insensitive-jpa-unique-constraint

For existing accounts Gustavo is running queries to identify accounts with usernames that have case insensitive matches. Danny is deleting inactive accounts and communicating with owners of active accounts on an ad hoc basis and combining accounts or changing user names as needed to eliminate duplicates.

The goal of this ticket is that there are no additional accounts created with the same case insensitive user name as an existing account.

It has been proposed that we save all new usernames in all lower case regardless of how they are entered. On login a user may enter a mixed case user name and it will be compared as an all lower case with existing entries. (The effect on the UX is that the user may see the all lowercase username when they edit their account information.) Is this worth mentioning on account creation or on the edit account page?

Once all of the duplicate removal has been completed should we go back and update the unaffected accounts so that all user names are entirely lower case? Should the users be notified and how?

It's fine if account names are saved and displayed as lowercase. No need to notify folks in the UI or via email.

Thanks @sekmiller and @scolapasta for talking this through with me a few minutes ago.

I had assumed that we'd roll out this fix to prevent new duplicate-username accounts from being created upon signup (my previous comment) and then resolve the duplicate accounts on our own schedule, using the new endpoints in #5514 and #5515, but you suggested that it would be better to resolve at the DB level.

This means that we'll need to deploy a release with the fixes for #5514 and #5515, installations would need to clean up any conflicting user names, and then the next release could include this fix. I'm OK with this from a Harvard Dataverse perspective, but it would involve some communication overhead with other installations. Any installations that did not clean up duplicate accounts would get a constraint violation or something during the upgrade process. Anyway, we should discuss when we get a chance to catch up next.

Moving back to not started and removing from 4.11 until we determine the preferred approach.

The 4.12 release that will allow the cleanup is out (including endpoints for #5514 and #5515), so we can now go ahead with the DB approach and include it in a future release.

@sekmiller found a bug and suggested a solution (thanks!) I implemented the fix in 392d26ea1 and I'm moving this back to code review.

I'll edit this comment in a bit but here's what I wrote down for scope in the "talk after" after standup:

In scope:

  • release note
  • login with DATAVERSEADMIN (Done SEK - 5/1)
  • login with [email protected] (Already had that functionality)
  • role assignments (Already had that functionality)

Out of scope:

  • logging in with email addresses that have Chinese characters
  • renaming dataverseAdmin to DATAVERSEADMIN

discussed with @djbrooke, since "renaming dataverseAdmin to DATAVERSEADMIN" was a one line change, we went ahead and added it to this issue / PR

Do we need to add a "lower case" constraint on username on BuiltInUsers as well?

No, I think we should drop the "username" column from the "builtinuser" table. When a builtin user logs in via email, the "authenticate" method in BuiltinAuthenticationProvider already uses a getAuthenticatedUserByEmail method look up the user by email addresses from the "authenticateduser" table. We could do something similar to look up the user by "useridentifier" from the "authenticateduser" table. This would be a good follow on to the redunant columns we dropped from the "builtinuser" table in pull request #4993 for #4565.

@pdurbin I agree that would be fine (in fact we could even consider moving password to authenticateduser or authenticateduserlookup?) but with that come some validation challenges, #5579. We could handle this idea as part of that issue and out of scope here.

If so, I still don't think we need to add the constraint to builtin user since you wouldn't be able to add a case insensitive duplicate (it would fail the check of authenticated user identifier uniqueness).

@dlmurphy, the gist of the release notes message is:

  • In this release we introduce a database constraint that no longer allows usernames that are the same but with different capitalization, so you may need to do some cleanup before upgrading
  • To find if you have any usernames that need cleaning up run the duplicate user name query (it's in the useful sql queries doc I believe)
  • Once you identify the user names that need cleaning up, you should use either merge endpoint (if it鈥檚 the same person) or the rename endpoint (if they are different people)
  • After the cleanup you can upgrade

(happy to discuss in person if it's easier!)

5811 has been merged and will be included in the next release. After that release is out on Github, we can then merge this issue and include it in the following release (since the cleanup is a dependency). I'll sit on this one for now.

Was this page helpful?
0 / 5 - 0 ratings