Etcher: Improve the "validation failed" error message wording

Created on 30 Sep 2016  Â·  46Comments  Â·  Source: balena-io/etcher

The current validation message seems to imply to users either that the write failed, or that Etcher damaged the card. We need to make clearer what it is that we are trying to convey. Perhaps a better message would be something like:

"The write has been completed successfully but Etcher detected potential corruption issues in reading the image back from the drive. It's possible that the drive is corrupted, please consider writing the image to a different one."

gui all

All 46 comments

It would also be helpful to provide more detail on the verification error. For example, what failed -- was it a comparison between expected and actual data that failed, or an inconsistent result over multiple reads? And where was it on the drive? It would be nice if the raw sector information were reported at least, but if you were able to load the drive image further to get a partition or even specific file that would be great.

I've had verification issues after flashing drives on multiple occasions, but without extra information there's not much that can be done.

It would also be helpful to provide more detail on the verification error. For example, what failed -- was it a comparison between expected and actual data that failed, or an inconsistent result over multiple reads? And where was it on the drive? It would be nice if the raw sector information were reported at least, but if you were able to load the drive image further to get a partition or even specific file that would be great.

Sadly, validation is currently implemented by just comparing checksums, so more detailed information about what exactly failed is not possible at the moment, but definitely something to consider.

I'd love to have sector specific information, and also a percentage of the data that failed, since 50% failure is very different from 0.01% failure.

@jviotti, I guess if we really got hardcore about it we could save more
checksums for smaller parts of the image and compare piece by piece, giving
us more detail on exactly what failed, right? And I guess if the chunk is
small enough and we have the image available (or does this conflict with
archives?) we could compare byte by byte and find the precise point. But
yeah, this is for later.

Or maybe @petrosagg has some crazy idea about an efficient encoding that
would automatically tell us where the failure was :P

_Alexandros Marinos_

Founder & CEO, Resin.io

+1 206-637-5498

@alexandrosm

On Fri, Sep 30, 2016 at 8:14 AM, Juan Cruz Viotti [email protected]
wrote:

It would also be helpful to provide more detail on the verification error.
For example, what failed -- was it a comparison between expected and actual
data that failed, or an inconsistent result over multiple reads? And where
was it on the drive? It would be nice if the raw sector information were
reported at least, but if you were able to load the drive image further to
get a partition or even specific file that would be great.

Sadly, validation is currently implemented by just comparing checksums, so
more detailed information about what exactly failed is not possible at the
moment, but definitely something to consider.

I'd love to have sector specific information, and also a percentage of the
data that failed, since 50% failure is very different from 0.01% failure.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
https://github.com/resin-io/etcher/issues/735#issuecomment-250771098,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABLUCDgCSsK_4RYZCwDKwxDb_3CnS3uqks5qvSdjgaJpZM4KLI71
.

Sadly, validation is currently implemented by just comparing checksums, so more detailed information about what exactly failed is not possible at the moment, but definitely something to consider.

Ah, I was scared of that. If you _were_ able to do it on a sector-by-sector basis, it would also let you show a "failed" message much earlier in the process because you wouldn't necessarily need to wait for the whole drive to be scanned. Obviously you would still want to verify the whole thing to provide more information, but you could at least tell the user what happened and let them abort more quickly.

I'd love to have sector specific information, and also a percentage of the data that failed, since 50% failure is very different from 0.01% failure.

Yeah, that would be great. In general I have just continued using a card even if validation failed for it, because if the device doesn't operate quite right I just re-flash it :laughing: I see people complain about Etcher's validation errors fairly often, and honestly I haven't had the time or patience to look into whether it's an issue with Etcher or an issue with the image/card/reader.

Many smaller checksums could work. Alternatively, by the way checksum calculation over streaming works, we could retrieve the checksum for all the bytes what were consumed so far and save various "checkpoints" we could use for comparison purposes.

The checkpoints approach sounds much more implementation-friendly and
should accomplish the same result, given that validation should always be a
linear process as far as I can think.

_Alexandros Marinos_

Founder & CEO, Resin.io

+1 206-637-5498

@alexandrosm

On Sun, Oct 2, 2016 at 5:50 PM, Juan Cruz Viotti [email protected]
wrote:

Many smaller checksums could work. Alternatively, by the way checksum
calculation over streaming works, we could retrieve the checksum for all
the bytes what were consumed so far and save various "checkpoints" we could
use for comparison purposes.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/resin-io/etcher/issues/735#issuecomment-251007933,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABLUCAtDWK4qYsFfTCX5khhvtQNRQOgUks5qwFFugaJpZM4KLI71
.

Hmm, so if you're storing multiple checksum-checkpoints-along-the-way, does that mean it'd be better to change it from:

  1. Write the whole image
  2. Read the whole image, verifying the "rolling checksum" at various checkpoints

to:

  1. Write a chunk of the image
  2. Read that chunk of the image, calculating the rolling checksum
  3. Write the next chunk of the image
  4. Read that chunk of the image, calculating the rolling checksum

That way if the user had a 'bad' card they'd get notified earlier, rather than having to wait for the whole image to be written first (and I guess you could give them the options of 'Ignore verification error' or 'Abort writing'). Only possible downside is that flipping back and forth between writing and reading _may_ be slower than just writing all in one go. Would need some experimentation...

I think @jviotti wanted to keep the processes in one piece for performance
reasons, though I could imagine that there could exist natural "stopping
points" where the switching penalty is small or non-existant.

I also had a quick chat with @petrosagg who wasn't quite convinced the
process couldn't be multiplexed.

In terms of UX, I can envision a pretty awesome three-coloured progress
bar, with colour 1 meaning "unwritten", colour 2 meaning "written, not
validated", and colour 3 meaning "written and validated". In fact this
would be an improvement over our current approach whether we multiplex or
not.

_Alexandros Marinos_

Founder & CEO, Resin.io

+1 206-637-5498

@alexandrosm

On Sun, Oct 2, 2016 at 6:26 PM, Andrew Scheller [email protected]
wrote:

Hmm, so if you're storing multiple checksum-checkpoints-along-the-way,
does that mean it'd be better to change it from:

  1. Write the whole image
  2. Read the whole image, verifying the "rolling checksum" at various
    checkpoints

to:

  1. Write a chunk of the image
  2. Read that chunk of the image, calculating the rolling checksum
  3. Write the next chunk of the image
  4. Read that chunk of the image, calculating the rolling checksum

That way if the user had a 'bad' card they'd get notified earlier, rather
than having to wait for the whole image to be written first (and I guess
you could give them the options of 'Ignore verification error' or 'Abort
writing'). Only possible downside is that flipping back and forth between
writing and reading _may_ be slower than just writing all in one go.
Would need some experimentation...

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/resin-io/etcher/issues/735#issuecomment-251010247,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABLUCA6tYTgk18U3nlY3qoJJIyQp1B6pks5qwFnBgaJpZM4KLI71
.

I'd need to measure this to be able to provide a sane answer. The possibilities are:

  • Retrieve checksums of the current data after X amount of blocks. This may mean that the first data that passed through the stream must be taken in consideration each time we want to calculate a checkpoint, which may slow things down, unless the checksum calculator stream provides good memoization support.
  • Create a custom transform stream that calculates a checksum for every block, as suggested by @lurch, but I worry that calculating per chunk will impose a noticeable overhead. Currently, calculating the checksum is one of the must CPU intensive tasks in the whole backend, but maybe we can calculate for an X amount of chunks instead, or find a way to this asynchronously in web workers or something along those lines.

@jviotti Are you sure about checksums having an overhead? On my system I get these numbers on a single core:

  • CRC: 280MB/s
  • sha256sum: 341MB/s
  • sha512sum: 507MB/s
  • md5sum: 520MB/s
  • sha1sum: 660MB/s

You can't possibly be handling data at a faster rate than this. Are you using node's crypto API that calls out to openssl?

Yeah, I'd always thought that calculating file checksums tended to be IO-bound rather than CPU-bound?

One problem I realised with switching between writing and reading after every 'chunk', is that you'd need to insert a 'sync' between writing and reading, to ensure that you read the data back from the disk rather than from the in-memory buffer, and multiple 'sync's could indeed slow down the entire process.
Probably still worth actually measuring though? ;)

And WRT calculating the 'rolling checksum' at various points - there's no need for memoization, you can just keep adding data to the checksum function, and it'll return the checksum of all the data seen so far. See e.g. https://docs.python.org/2/library/hashlib.html where you can keep calling update with additional data, and then simply get the digest at various points. Like this:

Python 2.7.6 (default, Jun 22 2015, 17:58:13) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import hashlib
>>> m = hashlib.md5()
>>> m.update("Nobody inspects")
>>> m.hexdigest()
'3ef729ccf0cc56079ca546d58083dc12'
>>> m.update(" the spammish repetition")
>>> m.hexdigest()
'bb649c83dd1ea5c9d9dec9a18df0ffe9'
>>> m2 = hashlib.md5()
>>> m2.update("Nobody inspects the spammish repetition")
>>> m2.hexdigest()
'bb649c83dd1ea5c9d9dec9a18df0ffe9'

Unfortunately it seems like https://www.npmjs.com/package/md5 doesn't offer the same API.

Why would you need to calculate the "rolling" version to include previous data? You already have checked the stuff up until the current "block", so checking it again is unnecessary. Even if you needed to, you could just prepend the previous hash string to the new data before calculating the checksum.

I guess one advantage of a "rolling" checksum (not sure I've explained it very well) is that you can choose arbitrary-size chunks to verify, and you always have a valid checksum at each point; whereas if you only checksummed individual chunks, you'd be forced to verify each chunk individually.
e.g. if you take a 'rolling checksum' every 1 MB, if you wanted to you could instead only verify the checksums every 10 MB (or for just the whole file), and they'd still be valid. Kinda hard to explain in words...

So, using a code example calculating a 'rolling checksum' with a chunk-size of 4 characters:

>>> m = hashlib.md5()
>>> m.update('1111')
>>> m.hexdigest()
'b59c67bf196a4758191e42f76670ceba'
>>> m.update('2222')
>>> m.hexdigest()
'821f3157e1a3456bfe1a000a1adf0862'
>>> m.update('3333')
>>> m.hexdigest()
'95ebebaa68041c9faae8673241eaae1b'
>>> m.update('4444')
>>> m.hexdigest()
'7b8802b5aa06e55f70c9d8711213364b'
>>> m.update('5555')
>>> m.hexdigest()
'fa06fbd962ca90ba3a574486c17b8d00'
>>> m.update('6666')
>>> m.hexdigest()
'7525aecccfaf9f40d53950df929256e6'

Now when we want to verify the data, we could obviously do it in 1-chunk steps as above, or we could do it in 2-chunk steps:

>>> m2 = hashlib.md5()
>>> m2.update('11112222')
>>> m2.hexdigest()
'821f3157e1a3456bfe1a000a1adf0862'
>>> m2.update('33334444')
>>> m2.hexdigest()
'7b8802b5aa06e55f70c9d8711213364b'
>>> m2.update('55556666')
>>> m2.hexdigest()
'7525aecccfaf9f40d53950df929256e6'

...or 3-chunk steps:

>>> m3 = hashlib.md5()
>>> m3.update('111122223333')
>>> m3.hexdigest()
'95ebebaa68041c9faae8673241eaae1b'
>>> m3.update('444455556666')
>>> m3.hexdigest()
'7525aecccfaf9f40d53950df929256e6'

...or even in a single 6-chunk step:

>>> m6 = hashlib.md5()
>>> m6.update('111122223333444455556666')
>>> m6.hexdigest()
'7525aecccfaf9f40d53950df929256e6'

...and at each point we always get the correct checksums :-)

EDIT: In case it's not obvious, each time we call update it _doesn't_ recompute the hash of all the previous data it'd seen, it only 'adds' the checksum of the new data to the 'rolling total' (which is what we retrieve with hexdigest).

the point makes sense to me, but is the option to decide later useful? I
would assume we know what we want to do at the time we're sampling the
checksums. no?

_Alexandros Marinos_

Founder & CEO, Resin.io

+1 206-637-5498

@alexandrosm

On Mon, Oct 3, 2016 at 6:24 AM, Andrew Scheller [email protected]
wrote:

I guess one advantage of a "rolling" checksum (not sure I've explained it
very well) is that you can choose arbitrary-size chunks to verify, and you
always have a valid checksum at each point; whereas if you only checksummed
individual chunks, you'd be forced to verify each chunk individually.
e.g. if you take a 'rolling checksum' every 1 MB, if you wanted to you
could instead only verify the checksums every 10 MB (or for just the whole
file), and they'd still be valid. Kinda hard to explain in words...

So, using a code example calculating a 'rolling checksum' with a
chunk-size of 4 characters:

m = hashlib.md5()>>> m.update('1111')>>> m.hexdigest()'b59c67bf196a4758191e42f76670ceba'>>> m.update('2222')>>> m.hexdigest()'821f3157e1a3456bfe1a000a1adf0862'>>> m.update('3333')>>> m.hexdigest()'95ebebaa68041c9faae8673241eaae1b'>>> m.update('4444')>>> m.hexdigest()'7b8802b5aa06e55f70c9d8711213364b'>>> m.update('5555')>>> m.hexdigest()'fa06fbd962ca90ba3a574486c17b8d00'>>> m.update('6666')>>> m.hexdigest()'7525aecccfaf9f40d53950df929256e6'

Now when we want to verify the data, we could obviously do it in 1-chunk
steps as above, or we could do it in 2-chunk steps:

m2 = hashlib.md5()>>> m2.update('11112222')>>> m2.hexdigest()'821f3157e1a3456bfe1a000a1adf0862'>>> m2.update('33334444')>>> m2.hexdigest()'7b8802b5aa06e55f70c9d8711213364b'>>> m2.update('55556666')>>> m2.hexdigest()'7525aecccfaf9f40d53950df929256e6'

...or 3-chunk steps:

m3 = hashlib.md5()>>> m3.update('111122223333')>>> m3.hexdigest()'95ebebaa68041c9faae8673241eaae1b'>>> m3.update('444455556666')>>> m3.hexdigest()'7525aecccfaf9f40d53950df929256e6'

...or even in a single 6-chunk step:

m6 = hashlib.md5()>>> m6.update('111122223333444455556666')>>> m6.hexdigest()'7525aecccfaf9f40d53950df929256e6'

...and at each point we always get the correct checksums :-)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/resin-io/etcher/issues/735#issuecomment-251104550,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABLUCMaaC_0VKsK4y2z10IVe2PwJFS5gks5qwQIVgaJpZM4KLI71
.

Yeah... I still don't understand :laughing: Wouldn't you know the granularity with which you want to verify the drive _before_ you start?

I didn't mean to suggest you _should_ change the granularity with which you verify the drive (unless you wanted to automatically vary it depending on the read-speed of the drive, to keep the GUI responsive), I was just giving an example of a 'possible' advantage of 'rolling checksums' as opposed to individually checksumming every chunk. AFAIK there'd be no difference in 'cost' of either rolling _or_ separate checksums, but a (very small) practical advantage of a rolling checksum is that the value of the final rolling checksum will also match the checksum of the whole image, as generated by a standalone tool such as md5sum.

Hmmm, looking back at the earlier messages in the thread, I guess the disadvantage of a rolling checksum is that once it's failed, you can't tell if any subsequent chunks are okay. Whereas with individual checksums, you can tell which chunks are 'good' and which are 'bad'. Apologies for dragging this thread OffTopic ;-)

As far as I've seen, CPU usage goes up a lot when computing the CRC32 checksum, which reaches almost ~90% in my Macbook Pro when doing multi writes (like 8 at the same time). We could mitigate this in part by only calculating the checksum once during the writing phase (instead of re-calculating in each child process), but we can't get away with it during the validation phase.

Anyway, it may be the CRC32 implementation we're using (its a third party module), but I believe MD5 is much less intensive (and we can use NodeJS crypto module directly), and we're actually looking forward to replace CRC32 with it (https://github.com/resin-io/etcher/issues/643).

We could mitigate this in part by only calculating the checksum once during the writing phase (instead of re-calculating in each child process)

Doh, I keep forgetting that for disk images that aren't extended archives, Etcher has to calculate the hashes itself, rather than simply reading the list of already-computed checksums from the metadata :blush:

@jviotti why not using sha1sum? It seems to be the fastest. The reason CRC is slow currently is because it's implemented in JS[1] whereas sha1sum is implemented in C.

Regarding multiplexing of read/writes, we can try using the O_DIRECT and O_SYNC flags when opening the disk. But also, even assuming that there is a cost of switching from reading to writing you can amortise the overhead with the expense of RAM. So basically you can keep the last 50MB in memory and instead of doing write 1 block, sync, read 1 block you could do write 50MB, sync, read 50MB. The more memory you sacrifice, the less syncs you'll do. That said, if we assume the source slowness of sync calls is due to the speed of the medium, then it doesn't matter that it's slow because you'll have to write X amount of data anyway. Either you'll do a lot of small and relatively fast syncs or a big and slow at the end.

@lurch what you're proposing regarding "rolling hashes" (that I'll call length extension here to not confuse them with rolling hashes which are a bit different) doesn't have a big advantage regarding memory usage but I think it's the right approach. All the common hash functions (MD5, SHA1, SHA2) are based on the Merkle–Damgård construction. This construction allows length extension by design[2].

So say we have an image with N blocks, memoisation of single block hashes would require storing the following in memory:

hashes = [
    hash(blocks[0..1]),
    hash(blocks[1..2]),
    ...
    hash(blocks[N-1..N])
]

Calculating an overall hash while keeping checkpoints would require storing the following in memory:

hashes = [
    hash(blocks[0..1]),
    hash(blocks[0..2]),
    ...
    hash(blocks[0..N])
]

So both cases need N * sizeof(hash) bytes of memory and both cases need the same CPU because hash(blocks[0..M]) can be rewritten as length_extend(hash(blocks[0..M-1]), blocks[M-1..M]), assuming it's one of MD5, or SHA families. This means that each step always processes a constant sizeof(block) amount of data.

Practically I think the second approach will also be faster since there won't be a setup cost for each block (no mallocs etc) and it will also end up with an overall hash which could be useful for other purposes.

[1] https://github.com/brianloveswords/buffer-crc32
[2] https://en.wikipedia.org/wiki/Merkle%E2%80%93Damg%C3%A5rd_construction#Security_characteristics

Practically I think the second approach will also be faster since there won't be a setup cost for each block (no mallocs etc) and it will also end up with an overall hash which could be useful for other purposes.

I don't follow. Why do you need to store all the hashes in memory? Why not just keep them one at a time? And how does the "length extension" method make a difference?

If you want Etcher to be able to report the specific sectors/blocks which didn't match, you can't do the "rolling hashes" thing: if you do that, all comparisons after one corrupt one will also appear corrupt, because the ones from the live disk would include the hashed data from previous checks.

I don't follow. Why do you need to store all the hashes in memory? Why not just keep them one at a time?

If you do 2 passes, one that writes the whole image and a second that validates the image, then the second needs to have access to all the checkpoints so that it can accurate pinpoint where the error occurred. This is true for both methods.

I also realised that if we don't do 2 full passes but instead we do either write 1 block,sync,read 1 block or write 50MB,sync,read 50MB there is no point messing with hashes at all. We can just keep a window of data in memory and do a byte-by-byte comparison.

And how does the "length extension" method make a difference?

As mentioned, there is no difference in RAM or CPU usage between the two methods but I expect length extension to be faster because you won't have to initialise/allocate a new Hash object every time.

If you want Etcher to be able to report the specific sectors/blocks which didn't match,

I think the requirement can be relaxed and have etcher report the first corrupt sector and then bail out. If what we're after is full reporting even after a corrupt sector then yes, the length extension method doesn't work.

I'm still not sure I understand, but I could easily just be too tired.

I think the requirement can be relaxed and have etcher report the first corrupt sector and then bail out. If what we're after is full reporting even after a corrupt sector then yes, the length extension method doesn't work.

For me, I think it makes a big difference whether it's only one corrupt sector or whether the whole thing is defective. If it's just a small section, I would probably (and have in the past) ignore it anyway if I don't have a new card handy, because the system will probably run fine for the most part, which is all I may need for testing things. If the whole thing is broken, there isn't any point in continuing without a new card.

My main motivation here is to get more information on these failures. I often have people complaining of Etcher failing at the validation stage, and I would like to know whether it seems more like a validation bug or real corruption (I suspect the former, but I really have no way of knowing). This use case isn't a main target, though. I don't actually care how it is implemented internally as long as it either helps me solve that problem from my end or helps users solve it themselves.

why not using sha1sum? It seems to be the fastest.

I wonder if that's only true on certain CPU arch-es? I always thought that MD5 was faster than SHA1, but I think some CPUs added special hardware-cryptography instructions?
(did some searching and found http://stackoverflow.com/questions/2722943/is-calculating-an-md5-hash-less-cpu-intensive-than-sha-family-functions and http://stackoverflow.com/questions/20692386/are-there-in-x86-any-instructions-to-accelerate-sha-sha1-2-256-512-encoding )

That said, if we assume the source slowness of sync calls is due to the speed of the medium, then it doesn't matter that it's slow because you'll have to write X amount of data anyway. Either you'll do a lot of small and relatively fast syncs or a big and slow at the end.

True. I wonder if it depends what kind of buffering / write-strategy the internal disk controller uses though (the FTL on SD cards, etc.) ? Again, only way to be sure is via experimentation.

I also realised that if we don't do 2 full passes but instead we do either write 1 block,sync,read 1 block or write 50MB,sync,read 50MB there is no point messing with hashes at all. We can just keep a window of data in memory and do a byte-by-byte comparison.

Yeah, I guess on modern PCs there's no harm in allocating two 50 MB memory buffers, but in the "olden days" you'd have done "write 50MB in 1MB chunks calculating the cumulative hash as you go, sync, read 50MB in 1MB chunks calculating the cumulative hash as you go" needing only a single 1MB buffer.

I agree that having Etcher being able to detect which blocks failed will probably be useful, and ties in nicely with @alexandrosm 's three-coloured progress bar, which we could extend to 4 colours: unwritten, written but not validated, written and validated good, written and validated bad. And if we do do that, I still think the user should be given the option to "Abort" or "Ignore and Continue" when the first error is detected.

I expect length extension to be faster because you won't have to initialise/allocate a new Hash object every time.

I suspect that any differences in hash-speed (assuming you're using a C implementation rather than pure JS) will probably still be dwarfed by the slowness of actually writing or reading data to/from disk?

@jviotti why not using sha1sum?

The problem is that users still want to see a checksum on completion, and md5 is by far the most popular one in the image writing world. This also means that even if we calculate checksums per block, we still have to calculate a checksum for the whole thing as well to present to the user. This can be separated though, so we calculate an MD5 for the whole thing, and use something like sha1sum in a per-block fashion.

I think the requirement can be relaxed and have etcher report the first corrupt sector and then bail out. If what we're after is full reporting even after a corrupt sector then yes, the length extension method doesn't work.

That's no different to the current situation, where we can tell if writing failed or not, but can't point out specific blocks.

If we decide to calculate a sha1sum per block, this means we need to store ~40 bytes per block in an object to be able to do the comparison later on. This is not a big deal, but if we mix the validation and writing as a single step, we can get away with only storing the checksums we immediately need, although in this case, its probably easier to just to byte by byte comparison.

Is there any benefit on having writing and validation as two separate steps? If we can't think of any, switching to a single pass seems like the way to go, since it simplifies a lot of things.

Is there any benefit on having writing and validation as two separate steps? If we can't think of any, switching to a single pass seems like the way to go, since it simplifies a lot of things.

I can't think of much. Maybe if someone doesn't care about validation, they wouldn't want the write operation being slowed down by the validation, but that can be solved with a "perform validation" checkbox in the settings menu.

Yeah, the only issue is some presumed drop in performance, but I also think
we should take the hit and do the right thing.

_Alexandros Marinos_

Founder & CEO, Resin.io

+1 206-637-5498

@alexandrosm

On Tue, Oct 4, 2016 at 6:31 AM, Wasabi Fan [email protected] wrote:

Is there any benefit on having writing and validation as two separate
steps? If we can't think of any, switching to a single pass seems like the
way to go, since it simplifies a lot of things.

I can't think of much. Maybe if someone doesn't care about validation,
they wouldn't want the write operation being slowed down by the validation,
but that can be solved with a "perform validation" checkbox in the settings
menu.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/resin-io/etcher/issues/735#issuecomment-251388256,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABLUCMo-7k66D3VpiOvbA9GojjK7kkbLks5qwlVIgaJpZM4KLI71
.

The problem is that users still want to see a checksum on completion, and md5 is by far the most popular one in the image writing world.

As an anecdote, raspberry pi downloads use SHA1[1].

Full image checksums aren't very useful for the question "Is the image correctly written to disk?". You can do a byte-by-byte or any of the previous techniques mentioned in this thread. The full image checksum is useful to verify that the original file downloaded was not corrupted. So it doesn't have to be tied to the writing process, you can spawn a thread in the background to do the checksum.

Is there any benefit on having writing and validation as two separate steps? If we can't think of any, switching to a single pass seems like the way to go, since it simplifies a lot of things.

To me two separate steps sound simpler implementation-wise. And the more I think about it the more sense the byte-by-byte approach makes. Why doesn't etcher just re-read the file from disk and compare the sectors byte-by-byte?

[1] https://www.raspberrypi.org/downloads/raspbian/

For one, we're keeping the implementation compatible with a "download
straight to the card" future. Which means streaming type reasoning for most
things.

Dare I say we have gone completely off topic? Maybe a new issue is in order?

For one, we're keeping the implementation compatible with a "download
straight to the card" future. Which means streaming type reasoning for most
things.

You can always write a temp file with the download (even gzipping it) on disk that will get cleaned up at the end.

Dare I say we have gone completely off topic? Maybe a new issue is in order?

We're definitely not off topic since the specific implementation of validation is tied to the behaviour of "validation failed" message. Maybe we can change the title of this issue to better describe the thread, but the messages in the thread have been consistent on the subject.

Well, what I mean is that my original suggestion was to improve the wording.

I _love_ the conversation and insights shared here, but it comes down to a much deeper reconsideration of the whole writing/validating process, and I want to make sure we don't hold off improving the wording until we've retooled the entire algorithm.

It would be nice to avoid needing X GB of temporary space if we can, no?

Sparing some storage for the few minutes of writing (and only in the case of a network stream) could be justified, especially if it is compressed.

There is also an important advantage. If etcher is writing X number of cards in parallel, during validation it will be handling X * max_read_speed amounts of data. Storing the temp file allows for virtually zero overhead checking (versus calculating X checksums in parallel). Thus making the IO subsystem the only bottleneck and etcher a very good CPU citizen.

You can still do the amortising technique though, but one order of magnitude bigger. Write 1GB, sync, validate 1GB.

Replying to a few comments at once...

The problem is that users still want to see a checksum on completion, and md5 is by far the most popular one in the image writing world.

Unless the user is using an extended archive, we have no way of knowing whether they want the checksum displayed as MD5 or SHA1. So if (as I suspect) the hashing time is minor compared to the actual disk reading/writing time, we could probably calculate both at the same time, and display them both to the user at the end? (or maybe have it selectable in the options, defaulting to MD5)

The full image checksum is useful to verify that the original file downloaded was not corrupted.

Yeah, there's obviously a difference between the download (.zip) checksum, and the image (.img) checksum.
Hmmm, perhaps if Etcher is given a 'normal' zipfile (i.e. not an extended archive) containing e.g. some-disk-image.img, maybe it could also look for each variant of some-disk-image.img.md5, some-disk-image.img.md5sum, some-disk-image.md5, some-disk-image.md5sum, some-disk-image.img.sha1, some-disk-image.img.sha1sum, some-disk-image.sha1, some-disk-image.sha1sum also being in the zipfile, and using the contents of that as the image checksum?

To me two separate steps sound simpler implementation-wise.

Unless (as I pointed out earlier) we do step-by-step checksumming, and allow the user to bail out on the first checksum error, instead of having to wait until the entire image is written.

Sparing some storage for the few minutes of writing (and only in the case of a network stream) could be justified, especially if it is compressed.

IMHO using a temporary cache file should be optional, for people with low diskspace. OTOH, downloading the entire image to a local cache, allows you to checksum the downloaded image (in case of any download errors) before you start writing it. Swings & roundabouts ;-)

So... coming back to this old issue again. I've just done some benchmarking on my laptop (i5-5200U CPU @ 2.20GHz) and (assuming the JS hashing algorithms aren't written _incredibly_ inefficiently), it seems I was right in my assumption that the time to do the actual hash-calculations will be vastly swamped by anything else:

$ dd if=/dev/urandom of=random_1GB_file.img bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 101.898 s, 10.5 MB/s
$ time crc32 random_1GB_file.img 
e5866893

real    0m3.742s
user    0m1.348s
sys 0m0.380s
$ time crc32 random_1GB_file.img 
e5866893

real    0m3.213s
user    0m1.368s
sys 0m0.360s
$ time md5sum random_1GB_file.img 
17f552abc2960e656478add76ef50fc5  random_1GB_file.img

real    0m3.572s
user    0m2.176s
sys 0m0.352s
$ time md5sum random_1GB_file.img 
17f552abc2960e656478add76ef50fc5  random_1GB_file.img

real    0m3.043s
user    0m2.204s
sys 0m0.376s
$ time sha1sum random_1GB_file.img 
8bcf866254b168a606c0b33cec322e402f567a5e  random_1GB_file.img

real    0m6.808s
user    0m3.636s
sys 0m0.244s
$ time sha1sum random_1GB_file.img 
8bcf866254b168a606c0b33cec322e402f567a5e  random_1GB_file.img

real    0m6.135s
user    0m3.416s
sys 0m0.384s
$ time sha256sum random_1GB_file.img 
a6673312b856d36906424d903e849e14ef7c736e3c9e8874e74c4a28e893ad67  random_1GB_file.img

real    0m7.336s
user    0m6.388s
sys 0m0.360s
$ time sha256sum random_1GB_file.img 
a6673312b856d36906424d903e849e14ef7c736e3c9e8874e74c4a28e893ad67  random_1GB_file.img

real    0m7.068s
user    0m6.372s
sys 0m0.332s

I've sent a PR fixing the original concern of this issue (the validation error message). There were some cool ideas mentioned here, however I believe that taking the effort to say to the user how much of the flash failed is not critical and probably not worth it, since as a user, even if it fails 1%, I wouldn't trust the drive enough to use it.

since as a user, even if it fails 1%, I wouldn't trust the drive enough to use it.

Oh? Is it possible that it might be only a 'temporary' failure, and attempting to write the image a second time might work? Or do you think any errors means the whole card is unusable?

Or do you think any errors means the whole card is unusable?

It might be fine in most cases, or the error is too subtle to be noticed (or the file-system is able to cope with it at runtime), however my point is that telling the user exactly what blocks failed, etc is not very useful from its point of view.

I'd argue that the feature isn't necessarily useful until you are attempting to figure out what _does_ work. Knowing that a particular block failed isn't particularly useful in itself, but if you determine that different images fail on the same block for a given drive, that tells you a lot -- and gives you a logical next step for finding a configuration that does work. Likewise, if it fails at the same point regardless of the drive used, you'd start to suspect the image file and/or Etcher itself.

I wasn't suggesting that if e.g. only 0.5% of a drive failed validation that it'd be worth trying booting the drive anyway. (I was only referring to percentages with my previous comment, not necessarily knowing _which_ blocks failed - which is more of an advanced-user feature)
I was more suggesting that if 0.5% of a drive failed to flash that it might indicate only a temporary writing failure, and so maybe attempting to rewrite the same image to the same drive _may_ result in 100% validation on a second attempt. Whereas if 50% of the drive failed verification, it's more likely that the whole drive should be thrown in the bin.

But of course that's very much a judgement call, and the safest option is to simply recommend trying a different drive, as we do now.

Let's bring @jhermsmeier to the loop (though we should probably open a new issue for this). He's re-architecting our drive flashing engine, so it would be nice for him to keep this in mind, and see if its feasible to get a percentage of failed blocks somehow (then we can figure out what to do with the data).

@jviotti @lurch @WasabiFan I made a new issue to track this: https://github.com/resin-io/etcher/issues/1074

I continue to get errors on any SD card running Etcher on a Mac. From what I can tell, the search feature in OSX is writing information on the card as well, and I think the CRC checks are detecting them, and determining that the write failed.

Is there any workaround for this?

@cdeerinck Please create a new issue, describing your problem in detail, and someone from the Etcher team will look into it :)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

markcorbinuk picture markcorbinuk  Â·  5Comments

GGShinobi picture GGShinobi  Â·  4Comments

jwa5426 picture jwa5426  Â·  5Comments

grash54 picture grash54  Â·  5Comments

m-p-3 picture m-p-3  Â·  5Comments