Google-cloud-go: storage.Reader should implement io.ReadSeeker rather than loading everything immediatly

Created on 28 Aug 2018  路  5Comments  路  Source: googleapis/google-cloud-go

On the storage package, building a *storage.Reader means start loading the file in memory (storage/reader.go:130).

I was looking for binding a *storage.Reader to the http.ServeContent method to optimize seeking for file parts thanks to the If-Range requests.

This means storage.Reader should implement the io.ReadSeeker interface, and delegate the file loading to the Read() method. A Seek() method should also be implemented.

I know the If-Range requests can be handled as-is with the storage.NewRangeReader() function, but I think it would be prettier to have io.ReadSeeker implemented.

I'll try submitting a PR for this 馃檪

storage feature request

All 5 comments

Hi @bLuka - if you intend on submitting a code change, please do so via gerrit. We unfortunately don't have github-gerrit two-way support yet :( A description of how to do so can be found at our CONTRIBUTING.md - let me know if you have any questions about that though!

I'm sorry, I'm just getting around to looking at the motivation for the CL. I'm not sure modifying the storage client is the right way to do this.

First, it involves significant changes to storage.Reader, as the CL shows. The current state of the CL is just the start. It would need unit and integration tests, and then there's the issue of delaying the request until the first read, which will further complicate the reader (and change its semantics a bit, since NewRangeReader will not return an error if there's a problem).

Second, I think providing Seek is misleading, since you're not really seeking through the stream, you're making a new request.

Third, it won't be a great fit: io.ReadSeeker can't handle all the information in the ranged request. For example, if the request asks for 10 bytes at offset 17, http.ServeContent will seek to 17. Your implementation will then issue a ranged request for the _entire_ content at offset 17, which could result in unnecessary traffic (and expense).

Fourth, I think you could easily implement io.ReadSeeker on top of the storage client. Here's a sketch:
```
type storageReadSeeker struct {
ctx context.Context
obj *storage.ObjectHandle
r *storage.Reader
offset int64 // initial offset
pos int64 // current position (like 'seen' in storage.Reader)
}

func (s *storageReadSeeker) Read(buf []byte) (int, error) {
if s.r == nil {
// Note: the -1 for length is necessary because we don't have all the information
// from the request that http.ServeContent received.
s.r, err = s.obj.NewRangeReader(s.ctx, s.offset, -1)
// check err
}
n, err := s.r.Read(buf)
// set pos
// check err
}

func (s *storageReadSeeker) Seek(offset int64, whence int) (int64, error) {
var newOffset int64
switch whence {
//...
}
s.r.Close()
s.r = nil
s.offset = newOffset
// Next call to Read will re-open at the right place.
// ...
}
```

Thanks for your answer!

I understand your highlights, and I need further knowledge on HTTP's range requests and Go architecture. I'd be glad if I could ask you some questions?

  1. I had the same assumption at first (when I opened this issue), thinking all the file would be downloaded right after the HTTP request.
    However that would need a large internal (at kernel level?) buffer if you don't consume it right now, and I remembered a write syscall may block on a socket until a read through it.

    So I'm wondering wether an HTTP body would be consumed if nobody is reading it or not?
    I did my whole CL's design after supposing that was not the case, that may had be a mistake 馃槃

  2. I'm often going into the code I'm using to make it include everything I need to use it as smoothly as I can. By doing so, I'm creating huge modules I'm using with the fewest lines of code.

    So, if I understand your advice correctly, a better way here would be to wrap the storage module in an io.ReadSeeker structure I define on the top of the module?
    What benefices it makes from doing so?
    Should I make it in another module I'll export later, or directly in the one where I need it?

  1. I didn't mean to say that the whole file would be downloaded. You're right that an HTTP Body won't be consumed if no one reads it. But we can assume that the data is buffered, so even if you only want the first ten bytes, perhaps 8K or 64K are delivered. That's still a lot more bandwidth and money than necessary.

  2. The advantage to writing code that uses the client (or any package) rather than adding the code to the package is that the package stays small, with only the features that most people need. For the clients in this repo, we generally try to make them fairly light wrappers around the underlying API. We often add functionality to expose new parts of the API, but we don't generally add convenience functions, especially if they can be implemented on top of the client.

I think the io.ReadSeeker wrapper could certainly be its own package, but if you only need it in one place, you can just put it in that package. You can always break it out to its own package later if you find another use for it.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

3cham picture 3cham  路  3Comments

rileykarson picture rileykarson  路  4Comments

dragan-cikic-shortcut picture dragan-cikic-shortcut  路  3Comments

junghoahnsc picture junghoahnsc  路  4Comments

philippgille picture philippgille  路  3Comments