Documentation: Do we need the "Externally Referenced Content" in CLAW/Fedora 4

Created on 22 Mar 2017  路  13Comments  路  Source: Islandora/documentation

Currently the Fedora 4 reference implementation and the Fedora API [1] [2] only cover one way to handle binary data that is not stored inside Fedora.

To add to the confusion the _External Content_ in Fedora 4 is akin to the _Redirect Referenced Content_ in Fedora 3. This means that if your binary is not stored inside Fedora, then a request for it sends you a 3XX response code to the alternate location.

Below is from the Fedora 3 Digital Object Model wiki page.

Externally Referenced Content - the content is stored outside the repository and the digital object XML maintains a URL that can be dereferenced by the repository to retrieve the content from a remote location. While the datastream content is stored outside of the Fedora repository, at runtime, when an access request for this type of datastream is made, the Fedora repository will use this URL to get the content from its remote location, and the Fedora repository will mediate access to the content. This means that behind the scenes, Fedora will grab the content and stream in out the the client requesting the content as if it were served up directly by Fedora. This is a good way to create digital objects that point to distributed content, but still have the repository in charge of serving it up.
Redirect Referenced Content - the content is stored outside the repository and the digital object XML maintains a URL that is used to redirect the client when an access request is made. The content is not streamed through the repository. This is beneficial when you want a digital object to have a Datastream that is stored and served by some external service, and you want the repository to get out of the way when it comes time to serve the content up. A good example is when you want a Datastream to be content that is stored and served by a streaming media server. In such a case, you would want to pass control to the media server to actually stream the content to a client (e.g., video streaming), rather than have Fedora in the middle re-streaming the content out.

Remember Fedora 4's External Content is like the "Redirect Referenced Content" type above. This means that Fedora is not involved in retrieving or presenting that content.

The Islandora community (specifically me) have a need to not store some or all binaries in Fedora to provide some flexibility.

However, this raises some questions like how do we handle authorization on calls to the end resource.

Does the Islandora community have a need for the "Externally Referenced Content" type?

Reasons expressed for this are:

  1. Handling Authorization
  2. Keeping HTTP responses consistent
  3. ... add your use case below.
fcrepo use case

Most helpful comment

Yes, it seems like a path to CDN support.

Here's a relevant conversation I had with Wim Leers, lead developer of Drupal's CDN module, concerning Fedora Commons and CDN:
https://www.drupal.org/node/2758739

In addition to the video streaming use case, which is crucial because pseudo-streaming isn't robust enough, it is highly desirable to serve up practically every datastream via CDN. The result will be a higher performance experience for public patrons, and great savings on the Internet bandwidth bill.

All 13 comments

Continuation of reasons expressed:
...

  1. Lifecycle management of binary descriptions relative to binary resource
  2. Fixity (with caution)

Yes, it seems like a path to CDN support.

Here's a relevant conversation I had with Wim Leers, lead developer of Drupal's CDN module, concerning Fedora Commons and CDN:
https://www.drupal.org/node/2758739

In addition to the video streaming use case, which is crucial because pseudo-streaming isn't robust enough, it is highly desirable to serve up practically every datastream via CDN. The result will be a higher performance experience for public patrons, and great savings on the Internet bandwidth bill.

Our main reason for using "Externally Referenced Content" exclusively for all primary datastreams (OBJs) in a repository with about 100TB of data is to have more control over the directory structure, which is important to us for defining rules for an HSM storage system and because we are providing local read access to the same resources via Samba shares. We do however still want to provide access (with access control) directly via Islandora as well, so the "Redirect Referenced Content" type does not suit our use case.

Our use case is at METRO similar. We want to be able (we are moving to that now) to have our own long-term storage strategy, which involves:

  • Multiple Terabytes and growing
  • Our own selective checksumming strategy
  • Our own selective backup strategies
  • Our own workflow (move non-frequent access binaries to Glacier/similar cheap storage)
  • heterogenous storage solutions (Local/remote/ cloud) and even file naming

All this because we have been developing a rule system that allows us to build a backend storage strategy, based on size limits, content models, data stream names and even stub datastream creation to make our digital preservation "inclined" initiatives easier and more sustainable (For islandora/Fedora 3.8) and to escape some limiting aspects about Akubra. And we would like to reuse the assets when moving a Fedora 4 API specs based Repository.

And all that under heterogeneous storage providers/technologies.

Resuming: we would like to have a common, consistent (whatever that means) REST/API experience when hitting a non rdf source asset wherever it is stored. Sadly the redirect approach is not consistent (headers, etc) and not flexible enough (no control on some server managed properties and no WebACL) for us.

The LYRASIS use case is very similar to what has been posted here already, but we would like to be able to use tiered storage so that:

  • Infrequently used preservation masters end up in glacier or s3 infrequent access
  • Infrequently used datastream content ends up in some middle tier (s3 or similar)
  • Frequently accessed datastreams end up in in a fast tier of storage (or CDN etc)

I see _Externally Referenced Content_ as a method to achieve this while still being able to keep data in fedora.

As of https://github.com/Islandora-CLAW/Crayfish/commit/0663c54a43b3c97dbf1aa1492bbe184025de6cdc we are using fcrepo 5's external content capabilities to provide redirects to binary resources stored in the Drupal file system or elsewhere (aws, do, rackspace, etc...) using flysystem.

@dannylamb does documentation on how to use this exist?

@mjordan In Islandora or in Fedora? Because I think this is just used as part of the Flysystem code now.

In Islandora. I haven't spun up a new VM in a while so sorry for not looking first, but for a given object, if I wanted to point to externally hosted content, how would I do that in the node/media edit GUI?

Oh! @dannylamb is just using the externally referenced content to reference the stuff in Drupal from Fedora. This is not to have Drupal look elsewhere for its content.

So you would need to be able to access your content using Flysystem and then it would work.

OK, sorry, I misunderstood, then this is not an end-user feature - is that correct? In other words, as someone creating an object, I can't point to an external URL for any of the media.

I don't think so, you could set up a flysystem adapter to somewhere and stuff content there but I'm not sure if Drupal allows you to reference external content.

OK, thanks for the explanation.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

acoburn picture acoburn  路  4Comments

ruebot picture ruebot  路  4Comments

Natkeeran picture Natkeeran  路  3Comments

acoburn picture acoburn  路  5Comments

ruebot picture ruebot  路  4Comments