Dataverse: Swift storage driver no longer working with MOC (because of StorageIdentifier changes)

Created on 10 Apr 2019  路  10Comments  路  Source: IQSS/dataverse

We haven't been using, or testing the Swift storage driver in the last 6 months.
We have just renewed our Massachusetts Open Cloud accounts and found out that the driver is no longer working with their swift nodes.
This is unrelated/on top of the authentication issues discussed in #5750. This happens after the successful authentication.

The storage driver appears to have been accidentally broken last fall, when we introduced the concept of using the legacy global id for the storage folder; then it looks like it was fixed by outside contributors - but in a way that fixed it for them, but left it broken for us. I'm not 100% sure this is specific to MOC, or specific to the fact that we are using DOIs and not handles... But I'm leaving more relevant details below.

All 10 comments

@carlosmcgregor Hello,
I'm looking at the code changes in SwiftAccessIO; it looks like it was repeatedly broken last fall. But yours are the latest fixes related to how the storage identifiers for DataFiles are created there...
We hadn't been using Swift for a few months here, at Harvard; we have just tested it now, and realized it's no longer working for us, with the local OpenStack cloud that we used to use in the past.

I'm assuming it is working for you (correct?). (or are you running a local fork of the code, that's different from what's currently in the develop branch?)

Did you have a specific reason to use that "swiftPseudoFolderPathSeparator", set to "/", to create multiple-level swift "folders"? Originally, we went to the trouble of purposefully replacing all the slashes in the global id with the "-" (although, to be honest, I can no longer remember why - ??)
So the idea was that for a DataFile in the dataset with the global id "doi:10.5072/FK2/LMEAIE" the swift "folder" would be 1 level deep - "doi-10-5072-FK2-LMEAIE" - with no intermediate subfolders. And the storageidentifer for the datafile would be something like "swift://endpoint1:doi-10-5072-FK2-LMEAIE:16a08c084c4-4087881655ae".

What I'm seeing now is that the storageidentifier looks like
"swift://endpoint1:doi-10-5072:FK2/LMEAIE/16a08c084c4-4087881655ae"... aside from the unescaped slashes, the token-separator ":" is now between "doi-10-5072" and "FK2/LMEAIE/16a08c084c4-4087881655ae" (??)
What actually happens when I try to upload a file, it gets successfully stored in the swift folder - in the subfolder "doi-10-5072/FK2/LMEAIE/". But then Dataverse cannot find that file on swift - so I get 404 trying to download the file through Dataverse. (but it can be downloaded from swift directly).

OK, so I'm utterly confused here... Is it really working for you, with the storageidentifiers like the above?

Also, once again - do you have a specific reason to want to have these multiple-level pseudo-folders on the swift side? I.e., I could fix it by just making it the way it was before... But then I don't want to break things for you again!

OK, well, it certainly serves us right - for not having tested this stuff in such a long time, through all that hacking that was done in that class file.

Hello @landreev ! I have only done two pull requests re: Swift driver:

The changes to swiftPseudoFolderPathSeparator predates my initial pull request. Looking back at my first bug fix, it seems like the commit that changed this caused the bug as JOSS was unable to send the proper request as Swift asks for a specific structure of pseudo-folders in its requests.

However, I do understand why this change was made to swiftPseudoFolderPathSeparator. Given the structure that you showed me, having 1 level deep folders e.g. doi-10-5072-FK2-LMEAIE would mean that a Swift container would have to be created every time a dataset is generated. From an administration point of view, this would be problematic as Swift policies are set at the container level.

Arguably, we could have a single pseudo-folder inside the main container by joining doi-10-5072/FK2 into doi-10-5072-FK2, if this seems easier to manage. The reason why we have CONTAINER/FK2/FOLDER/FILE now is because it was simpler.

The latest version, available since 4.11 I believe, is working well with all transactions (upload, download, delete). If I recall correctly, we did experience the same problem you are experiencing before, but I believe it had to do with some misconfiguration of temporary URLs. I will ask my colleagues tomorrow and let you know what the problem was!

@carlosmcgregor Thank you for the quick reply!
So, am I reading it wrong? - Netbeans is telling me that the "swiftPseudoFolderPathSeparator" line was introduced by you, on Nov. 28, in b1cf60f... - No?
But, regardless of who added it (sorry for the confusion), I think I understand what you're saying, about creating top-level folders [edit: containers] for every new dataset. As I mentioned earlier, I personally can't remember what the rationale was, behind flattening the entire folder structure to one level... so I'll ask other developers here tomorrow, if they remember what it was, and whether it's still needed. Hopefully it's no longer necessary... Our S3 driver, for the record, (that we are using actively, in production here) doesn't flatten anything and creates multiple-level pseudo folders.

But that still doesn't explain the mystery of why the storage driver in its current form working for you, for both upload and download; and not working for us (upload working, download not working).

Could you please send an example of one of your real, working storageidentifiers from the database? - I just want to make sure we really are using the same code at this point.
Thanks!

@carlosmcgregor Thanks for the tip, about the temporary URL - I didn't even realize we were generating them for all swift lookup requests, just saw it in the code... Still not seeing anything there that has any reason to be failing, quietly... But will look into it some more tomorrow.
And yes, please let us know if you uncover any info, about the problem with the temp. URLs you were having. Thank you again!

@landreev My apologies! I got it wrong. I did introduce swiftPseudoFolderPathSeparator, but this was necessary as '/' were already present in the storageidentifier.

From my description of the issue:

Manually changing the dvObject storageidentifier from e.g. swift://endpoint1:doi-10-1234-DM2/SZJXID:1673798809a-21f68803badb to swift://endpoint1:doi-10-1234-DM2:SZJXID/1673798809a-21f68803badb led to a successful run of deleteAllAuxObjects().

I am unsure on why there was a slash in doi-10-1234-DM2/SZJXID, but the cleaner solution was to do doi-10-1234/DM2/SZJXID. Looking at it now, doi-10-1234-DM2/SZJXID resembles the doi-10-1234-DM2-SZJXID that you had before. So the culprit might be https://github.com/IQSS/dataverse/commit/b0f75fdddd598b9d7eb8f32a51ba9c9422b639d9 as it changes owner.getIdentifier() to owner.getIdentifierForFileStorage() (the latter which contains the DM2/SZJXID from our example). I am unfamiliar with those two functions and, at the time when I wrote the bug fix, was unaware of the container/pseudo-folder approach that Dataverse was taking.

@landreev I found this in my notes:

swift.properties was misconfigured. Swift server was set to create temporary URLs. This functionality required Dataverse to have a hash key. Hash key was added to swift.properties, along with a timeout setting in domain.xml.

I will get the storageidentifier examples tomorrow!

@carlosmcgregor Thank you! This makes sense. The slash must have "suddenly" appeared between "DM2" and "SZJXID" when the DOI "shoulder" (the "DM2" part) was moved, from being considered part of the namespace, to be part of the dataset identifier. And because of that it escaped being escaped (sorry, you know what I'm trying to say) as the code was replacing the slashes with dashes in the namespace only...

@carlosmcgregor OK, on our end, the download issue has miraculously fixed itself overnight. This is

  • good news
  • also, weird. Because nothing was done on the Dataverse side. Something on the side of the cloud that we're using?

Sorry again for accusing you of breaking things. But I'm glad I got in touch with you, because now I know that you, and maybe other swift users, don't want to have datasets live in separate containers. But I keep thinking that there must have been some reason for the way the structure was flattened like that originally; i.e., that there was likely a reason that they wanted each dataset to have its own container, for the purposes of that early collaboration (to be able to give permissions individually to different datasets, outside of Dataverse?) That collaboration is currently dormant, but I'm wondering if there's a chance it becomes an issue again.
I'm inclined to leave it as is for now. But make a note that we'll want to make it configurable in the future, if such a need arises.

@landreev I am glad things ended up working on your end (a bit of _Deus ex machina_ never hurts)!

No worries about the accusation :) I would have been concerned too if the landscape I worked on changed so drastically - and in a rather sideways move - between a couple of releases. I appreciate you looking into what made the "/" seep its way into storageidentifier.

I fully understand now your rationale for wanting to have each dataset as a separate container. Using an Access Control List would allow for file collaboration outside Dataverse. That's a rather nifty idea, for sure!

~Please let me know if you would like for me to fix the issue with the "/" escaped being escaped so that we can go back to container level datasets. On our end, we are experimenting with the Swift driver, so no production data would be affected. If this were to become an issue down the line, we might have to write a script that re-arranges Postgres entries _and_ Swift paths. That sounds rather tortuous, to be honest, particularly because of the latter.~

Forgot to consider other members of the community who might be using Swift with this current version. Please let me know if there is any interest in the community on reverting to the container level dataset format!

@carlosmcgregor We don't want to make any changes to this right away. As for

I fully understand now your rationale for wanting to have each dataset as a separate container ...

to be precise, that's our best guess/recollection, of what the rationale was; and to the best of everybody's knowledge here, no current users are expecting the containers to be created like this.

So we'll mention it in the "potential improvements" issue; as something to add in the future, if the need arises.

And it looks like it could be achieved by just replacing the hard-coded "psuedoFolderSeparator" (and maybe renaming it "secondarySeparator" - or something like that) with a configurable JVM option.

Was this page helpful?
0 / 5 - 0 ratings