Arctos: PLEASE READ: TACC Uploads

Created on 4 Jun 2021  Â·  19Comments  Â·  Source: ArctosDB/arctos

Summary

Please let me know ASAP if you've bulk loaded files to TACC and haven't found them on https://web.corral.tacc.utexas.edu.

This only pertains to the SCP upload process (https://handbook.arctosdb.org/how_to/How-to-Upload-Media-to-TACC.html#large-batch-uploads). Uploads made through the Arctos UI are not involved in this.

Gritty Details

The webserver at TACC is being rebuilt, I've inherited the legacy "processing" data which should have been uploaded. I've pulled a directory listing and attempted to identify and remove things that I can confirm as complete or irrelevant. I think the remainder mostly involve files which have been uploaded to TACC and processed to the web, but which have not been used as Media.

I have attached the list of objects that I can't confirm as Media, and which look like they might have been intended to be.

Anything that has not yet been made available on Corral must be identified so the upload process can be completed. The attached list is not necessarily complete, just what I think is most likely to be problematic. Please let me know ASAP if you've bulk loaded files to TACC and cannot find them on https://web.corral.tacc.utexas.edu, whether they're included here or not.

I don't quite know how to proceed from here. We need to get this cleaned up if I'm to have any hope of knowing what's where, I'd rather not unnecessarily pay for storage, and I definitely don't want to delete anything that's not been processed onto the web for some reason. I don't have a firm timeline, but I think it's safe to assume that anything which doesn't get identified and moved on to the next stage will be lost at some point.

Please reach out to anyone who might have data involved in this.

Going Forward

We should consider policies for uploads to the shared Arctos directory which are not used in Media; I don't think it makes sense to pay for that storage, but I'm not sure what the options might be. Perhaps there's less of that than I currently suspect, but I stumbled over apparently-orphaned files several times in this process.

The "guidelines" at https://handbook.arctosdb.org/documentation/media.html#media-uri should be solidified; there are files which contain spaces and ? and such that cause problems with scripting. Strongly suggest first update to documentation involves "folders which contain files which contain characters other than a-Z, 0-9, _, and - are not eligible for scripting" and that be treated as policy.

Upload guidelines will likely need revised; policies or procedures may need developed or clarified (so we can avoid ever doing this again).


Help!

temp_media_proc_weird.csv.zip

Administrative Function-Media Help wanted Priority-Critical Service-related

All 19 comments

Hi Dusty,

I took a look at the .csv file and it looks like the name of the collection
is the first part of the file name? Do I have that right?

cheers,
Beth

On Thu, Jun 3, 2021 at 5:57 PM dustymc @.*> wrote:

â—† This message was sent from a non-UWYO address. Please exercise caution
when clicking links or opening attachments from external sources.

Summary

Please let me know ASAP if you've bulk loaded files to TACC and haven't
found them on https://web.corral.tacc.utexas.edu.

This only pertains to the SCP upload process (
https://handbook.arctosdb.org/how_to/How-to-Upload-Media-to-TACC.html#large-batch-uploads).
Uploads made through the Arctos UI are not involved in this.
Gritty Details

The webserver at TACC is being rebuilt, I've inherited the legacy
"processing" data which should have been uploaded. I've pulled a directory
listing and attempted to identify and remove things that I can confirm as
complete or irrelevant. I think the remainder mostly involve files which
have been uploaded to TACC and processed to the web, but which have not
been used as Media.

I have attached the list of objects that I can't confirm as Media, and
which look like they might have been intended to be.

Anything that has not yet been made available on Corral must be identified
so the upload process can be completed. The attached list is not
necessarily complete, just what I think is most likely to be problematic. Please
let me know ASAP if you've bulk loaded files to TACC and cannot find them
on https://web.corral.tacc.utexas.edu https://web.corral.tacc.utexas.edu,
whether they're included here or not.

I don't quite know how to proceed from here. We need to get this cleaned
up if I'm to have any hope of knowing what's where, I'd rather not
unnecessarily pay for storage, and I definitely don't want to delete
anything that's not been processed onto the web for some reason. I don't
have a firm timeline, but I think it's safe to assume that anything which
doesn't get identified and moved on to the next stage will be lost at some
point.

Please reach out to anyone who might have data involved in this.
Going Forward

We should consider policies for uploads to the shared Arctos directory
which are not used in Media; I don't think it makes sense to pay for that
storage, but I'm not sure what the options might be. Perhaps there's less
of that than I currently suspect, but I stumbled over apparently-orphaned
files several times in this process.

The "guidelines" at
https://handbook.arctosdb.org/documentation/media.html#media-uri should
be solidified; there are files which contain spaces and ? and such that
cause problems with scripting. Strongly suggest first update to
documentation involves "folders which contain files which contain
characters other than a-Z, 0-9, _, and - are not eligible for scripting"
and that be treated as policy.

Upload guidelines will likely need revised; policies or procedures may

need developed or clarified (so we can avoid ever doing this again).

Help!

temp_media_proc_weird.csv.zip
https://github.com/ArctosDB/arctos/files/6594774/temp_media_proc_weird.csv.zip

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/ArctosDB/arctos/issues/3641, or unsubscribe
https://github.com/notifications/unsubscribe-auth/AENT6REZLCIQ4D76FHD4DWDTRAJHRANCNFSM46BXLE4A
.

--
Elizabeth Wommack, PhD
Curator and Collections Manager of Vertebrates
University of Wyoming Museum of Vertebrates
Berry Biodiversity Conservation Center
University of Wyoming,
Laramie, WY 82071
ewommack@ @.*>uwyo.edu
www.uwymv. http://www.uwymv.edu/org
UWYMV Collection Use Policy
http://www.uwymv.org/index.php/download_file/view/43/143/

No, ish. I think CHAS uses cas, early stuff may be in uam, etc. - there's some correlation, but the paths are fundamentally whatever someone felt like creating mixed in with a little bureaucracy.

Booo - I thought that might be a quick way to find things, or to at least try and figure out how to ping people with this help request.

@ccicero - I saw a number MVZ birds or eggs in there.
@cjconroy - I saw some MVZ mammals in there.

Those UTEP ones are me and I'll get them by tomorrow.

Except when I try https://web.corral.tacc.utexas.edu/utep/Herb_Aus_Fam_AandB/UTEP_Herb_41049.CR2 I get 404. What am I missing?

Also, it seems like these are duplicates or something? https://arctos.database.museum/guid/UTEP:Herb:82809 has the media attached.

What am I missing?

Probably the date in some form in the URL, but the real answer is whatever you worked out with Chris when he loaded them; there's no rhyme or reason to the final URL, that's why I have questions.

these are duplicates or something

If they are - and they all should be - then we're done and they can get deleted. (And they're the originals - the copy should be on the webserver - but same thing.)

I'mI started looking through this as I suspected MVZ has a fair amount.
cas (CHAS) 7686
mvz 2674
nhmu 37644
oldNotUsed20180228.zip 2
reportTemplates20180828.zip 2
uam 11108
ucm 146
utep 156
uwbm 2
I added this to this week's agenda to make sure the relevant collections
can review.

I agree we need to have a plan here for the entire media process and SOP's
for various steps. Let's start with talking to TACC/CJ once their hardware
work is complete as he indicated we can have a few Arctos managers and then
we can build out needs/ specifications.
I suspect a large bottleneck is linking the media record to the object on
TACC's web archive. I can think of a bunch of ways to make it better but
the main function is making the linking of the object with the media record
as automated as possible-- not bulkloading csv's -- that's the bottleneck!

For now, I will check on MVZ's -- thanks!

On Mon, Jun 7, 2021 at 6:14 AM dustymc @.*> wrote:

What am I missing?

Probably the date in some form in the URL, but the real answer is whatever
you worked out with Chris when he loaded them; there's no rhyme or reason
to the final URL, that's why I have questions.

these are duplicates or something

If they are - and they all should be - then we're done and they can get
deleted. (And they're the originals - the copy should be on the webserver -
but same thing.)

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/ArctosDB/arctos/issues/3641#issuecomment-855916606,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AATH7UNUZSLQDEQGFDBOFYLTRTA3RANCNFSM46BXLE4A
.

linking of the object with the media record as automated as possible-- not bulkloading csv's -- that's the bottleneck!

Agree. The two-step process is cumbersome and it is very easy to lose track of what has been related and what hasn't. I have no ideas on how to make it easier - but I'll sleep on it for a while.

start with talking to TACC/CJ o

The current plan is that I'll "own" the webserver - not sure what else TACC could do, and I should be able to give access to anyone else from that.

plan

Yes, this isn't going to work without some kind of procedure.

bulkloading csv

I'm certainly up for better ideas, but I can't think of any even theoretically-viable alternatives.

Uh oh. UCM media listed in the file are not duplicate and and are not linked in Arctos and I can confirm that I bulkloaded the relationships recently.

I know that there are a bunch of MVZ bird/egg images still pending on Corral. I just haven't had time to link them, and don't recall if they've been ingested to the web server. One issue is that when files get ingested, they remain on Corral and need to be manually moved/deleted. It would be great if that could happen automatically.

I won't be able to look at these for a while, as I'm on vacation until the 22nd.

Yes, I see a couple folders under UWBM that were uploaded with the intention to create media. Those folders, and all the contents I expect, are still visible using the link you sent. We postponed the creation of media once that bulkloading CSV tool broke last April (I haven't picked it up again to try to figure it out).

@ebraker do you have an example? (Sounds like maybe that's more the media bulkloader than the upload, or somewhere in the middle, or ??)

I've got an archive marked to save, I can (after Chris waves his magic wand around a bit more) just move the mvz (@ccicero) and ucm (unless Chris finds some time and I hear back from you @ebraker) into that and we can deal with this later - but please remind me, I don't want to re-create the current situation.

@jebrad if your stuff is available from somewhere on https://web.corral.tacc.utexas.edu then it's safe for this, and someone can probably help you with whatever you've already loaded (in a different Issue please - I think there's one somewhere....).

Does anyone have a contact at nhmu? They've got 1.4TB of data that I can't find elsewhere.

Does anyone have a contact at nhmu?

Do they even have collections in Arctos? If so, what are they?

I think it's Utah, but I'm just looking a folder names - it could be something that doesn't use any of those letters anywhere else.....

@kderieg322079 is at Utah.

Yes, we have collections in Arctos; collection code is UMNH. The 1.4TB of data is field note scans and our digitization manager, Alyson Wilkins @awilkins007, knows the specifics. I've asked her to join this group so she can jump in and provide more info.

Update: we have all of that 1.4TB of data in our DAMS, so it can be cleaned out. I believe @awilkins007 intends to clear it, but I'm not sure. Either way, we have duplicates of all of it elsewhere. Thanks for bringing it to my attention.

Yes, I intended to clear it out earlier but got pulled onto other projects by other staff. We have all that other media stored and managed elsewhere so all that data can be cleared out.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

ccicero picture ccicero  Â·  8Comments

acdoll picture acdoll  Â·  4Comments

mvzhuang picture mvzhuang  Â·  5Comments

dustymc picture dustymc  Â·  6Comments

alexkrohn picture alexkrohn  Â·  3Comments