Node: Tracking: openSSL with asm on arm64

Created on 26 Oct 2018  路  59Comments  路  Source: nodejs/node

  • Version: master
  • Platform: CI arm64
  • Subsystem: test,crypto


https://github.com/nodejs/node/blob/master/test/parallel/test-https-client-get-url.js

Look nasty, hopefully 馃 it is just a flake

https://ci.nodejs.org/job/node-test-commit-arm/19511/nodes=centos7-arm64-gcc6/testReport/junit/(root)/test/parallel_test_https_client_get_url/

Error Message
fail (1)

Stacktrace
(node:57030) Warning: Setting the NODE_TLS_REJECT_UNAUTHORIZED environment variable to '0' makes TLS connections and HTTPS requests insecure by disabling certificate verification.
events.js:167
      throw er; // Unhandled 'error' event
      ^

Error: 4396790469872:error:1408F119:SSL routines:ssl3_get_record:decryption failed or bad record mac:../deps/openssl/openssl/ssl/record/ssl3_record.c:469:

Emitted 'error' event at:
    at TLSSocket.socketErrorListener (_http_client.js:399:9)
    at TLSSocket.emit (events.js:182:13)
    at TLSSocket._emitTLSError (_tls_wrap.js:600:10)
    at TLSWrap.onerror (_tls_wrap.js:268:11)

/CC @nodejs/testing @nodejs/crypto

crypto openssl tls

All 59 comments

@tniessen @bnoordhuis "bad record mac"? That shouldn't happen, right (especially intermittently).

test.parallel/test-tls-pfx-authorizationerror

```
Error Message
fail (1)
Stacktrace
events.js:167
throw er; // Unhandled 'error' event
^

Error: 281472993497088:error:1408F119:SSL routines:ssl3_get_record:decryption failed or bad record mac:../deps/openssl/openssl/ssl/record/ssl3_record.c:469:

Emitted 'error' event at:
at TLSSocket._emitTLSError (_tls_wrap.js:600:10)
at TLSWrap.onerror (_tls_wrap.js:268:11)

test.parallel/test-https-client-checkServerIdentity

Error Message
fail (1)
Stacktrace
/home/iojs/build/workspace/node-test-commit-arm/nodes/ubuntu1604-arm64/test/parallel/test-https-client-checkServerIdentity.js:71
    throw err;
    ^

Error: 281472838037504:error:1408F119:SSL routines:ssl3_get_record:decryption failed or bad record mac:../deps/openssl/openssl/ssl/record/ssl3_record.c:469:

ping @nodejs/platform-arm @nodejs/crypto

Any recommendation how to workaround this?

If it only happens on arm64, it might be worthwhile to turn off assembly for a while on that architecture (./configure --openssl-no-asm) and see if the problem goes away.

If it only happens on arm64, it might be worthwhile to turn off assembly for a while

Done, https://github.com/nodejs/build/issues/1556.
I'll try to remember to report of outcome...

I've never tested openssl asms of Node working on centos-arm64.

@refack Can I log in the machine and testing build openssl-1.1.0 and node?

@shigeki does:
https://ci.nodejs.org/job/node-test-commit-arm/nodes=centos7-arm64-gcc6/19655/consoleFull (鉁旓笍)

12:24:34 python ./configure --verbose  --openssl-no-asm

enough?

Haven't seen this fail on centos7-arm64-gcc6 since, but now it shows on ubuntu1604-arm64:

https://ci.nodejs.org/job/node-test-commit-arm/19612/nodes=ubuntu1604-arm64/console
test.parallel/test-https-eof-for-eom

Error Message
fail (1)

Stacktrace
1) Making Request
2) Server got request
3) Client got response headers.
events.js:167
      throw er; // Unhandled 'error' event
      ^

Error: 281473409806336:error:1408F119:SSL routines:ssl3_get_record:decryption failed or bad record mac:../deps/openssl/openssl/ssl/record/ssl3_record.c:469:

Emitted 'error' event at:
    at TLSSocket.socketErrorListener (_http_client.js:399:9)
    at TLSSocket.emit (events.js:182:13)
    at TLSSocket._emitTLSError (_tls_wrap.js:600:10)
    at TLSWrap.onerror (_tls_wrap.js:268:11)

@refack if you're keeping track of this can you also record which arm64 machine it's happening on? I think this is the same as the error we had previously and it ended up occuring mainly (or only?) on one of the machines we have. There was a whole thread where I roped in openssl folks and packet.net folks but it was never fully resolved. IIRC it kind of died down. I'm sure the thread is still active somewhere, nodejs/build maybe.

ftr test-packetnet-ubuntu1604-arm64-1 had a _ton_ of node zombies on it. I bet this is related (don't know which way it's related). It may be that this machine is dodgy. If we get enough evidence of this one being a culprit we can get packet.net to get into it and figure out what the problem might be, likely hardware IMO.

ftr test-packetnet-ubuntu1604-arm64-1 had a _ton_ of node zombies on it.

I had a feeling that I've seen the failed or bad record mac mostly on test-packetnet-centos7-arm64-1, but the records point to more reports from test-packetnet-ubuntu1604-arm64-1 with some from test-packetnet-ubuntu1604-arm64-1

For the time being I added --openssl-no-asm for *centos7-arm64* (https://github.com/nodejs/build/issues/1556)
Per: https://github.com/nodejs/node/issues/23913#issuecomment-434637662

Last 48 hour:

test-packetnet-ubuntu1604-arm64-2 - https://ci.nodejs.org/job/node-test-commit-arm/nodes=ubuntu1604-arm64/19741/testReport/junit/(root)/test/parallel_test_tls_fast_writing/

test-packetnet-ubuntu1604-arm64-1 https://ci.nodejs.org/job/node-test-commit-arm/nodes=ubuntu1604-arm64/19556/testReport/junit/(root)/test/parallel_test_tls_pfx_authorizationerror/

(I'm hesitant to add --openssl-no-asm to these since it's the only coverage we have for openSSL asm on ARM64)

yeah, let's not turn asm off totally for arm64 since we're shipping arm64 binaries without this option at the moment and it's not something we want left uncovered!

Does it happen on all arm64 machines or on a subset? I see only the ones mentioned in this issue on ci.nodejs.org so I'm guessing it's all of them?

If that's the case, the prudent thing to do is turn off assembly in release binaries for a while until we get to the bottom of this. Better slow than broken.

Does it happen on all arm64 machines or on a subset?

Initially I saw it only on one specific machine, but since it seems to be happening on all 4 arm64 public CI workers (2 x centos7 + 2 x ubuntu1604).

@refack I need to check and compare asm binaries between ones build in Node.js and in bare openssl-1.1.0. This needs for me to login the server and work on it as I did in plinux/smartos issues in OpenSSL-1.1.1. If it is not allowed, please download https://www.openssl.org/source/openssl-1.1.0i.tar.gz, build it and give me its tar.gz together with built Node.js tar.gz.

run on naively on test-packetnet-ubuntu1604-arm64-2

> ./config
> make

openssl-1.1.0i.built.tar.xz.zip

@rvagg if you have a pointer to specific dodgy Packet hardware, please let me know. We're getting some more inventory of newer (faster) gear and if you have a flaky test it would be ideal to check on other hardware too.

I know this sort of thing can be slow-going, but anything new to report on this?

I can make the POC code to reproduce this issue on the CI machine but it still needs the further investigation.

Is there a minimum set of code that reproduces this problem, short of setting up the whole CI environment?

I know this sort of thing can be slow-going, but anything new to report on this?

So the flaky part of this was resolved with https://github.com/nodejs/node/pull/24270, so this could be "downgraded" to a tracking issue.

"resolved" is not the right word, turning off ASM is far from optimal and we need to get back to a state where ASM is back on for ARM64.

@vielmetti I think @shigeki is only referring to having it reproducable on the packet.net machine he has access to for testing it out so it probably wouldn't need a large setup. @shigeki do you have a minimal enough case that you could document something so that @vielmetti could try elsewhere since he has access to a wide array of ARM64 hardware?

"resolved" is not the right word

Ack. Only thing that was resolved was the flakiness (in test and presumably in production). Hence we didn't close this issue, and IIUC @shigeki is working on a resolution https://github.com/nodejs/build/issues/1567

Yes, I am now bisecting to identify the issue with using my POC code. Please wait for a moment.

Updated: I'm working testing on the new AWS EC2 a1.4xlarge of arm64 servers on Ubuntu16 and 18 for several hours but no issues are found yet. I'm going to ask for a new server of packetnet to see if the issue is specific to the hardware.

What precise version of OpenSSL is flaking? I suspect that the problem is upstream, and if there's some variance in arm64 hardware versions I'd like to bring this to more people's attention.

As far as I made testing, this is not the issue of the upstream of OpenSSL. We have exact the same asm binaries between bare openssl and node
Also I cannot reproduce the issue in the bare OpenSSL in the arm64 packetnet server.
Moreover the issue cannot be occurred in Node-v10. This is the specific issue on Node-v11 while the OpenSSL sources is no difference.

The cores on the AWS a1.* ("Graviton") are appreciably faster than the individual cores in the Packet c1.large.arm ("ThunderX"). Is it possible that there's a timing issue in here somewhere?

Does the asm code use vector (what used to be called NEON) instructions?

I've pointed some additional people at this issue.

I'm sure that there are some timing issues exist. The packetnet server has 96 cores and the failure rates seems to depend on the number of parallel processing and its duration time but I found the failure occurred even in 2 parallels.
I made tests on a1.4large since it has 16 cpus to see if the issue occurred in many cores.

@shigeki Can you also give this a try under Red Hat on the a1.* instances on AWS? They use a 64K page size which will potentially shake out other issues, or maybe this same issue.

Refreshing this issue to determine if it's still open, and if so, to inquire as to what's necessary to make progress. @shigeki @refack @nodejs/platform-arm

I made tests on the current master HEAD during over last 24 hours and found no issues are found.
I found that issue occurred since V8 6.9 and OpenSSL-1.1.0 but it seems that the issue is resolved in V8 7.1 and OpenSSL-1.1.1 at the current master HEAD.

Is there any chance to to see if CIs are working well with enabling asm support on arm64?

Thanks @shigeki - if someone has a CI infrastructure running that has a link to the dashboard that I can review the logs.

@refack it took 4 hr 26 min but the Ubuntu stress test came back green. https://ci.nodejs.org/job/node-stress-single-test/2154/nodes=ubuntu1604-arm64/

CentOS took 2 hr 12 min and also came back green.
https://ci.nodejs.org/job/node-stress-single-test/nodes=centos7-arm64-gcc6/2154/

@shigeki are we confident that this is solved in the latest master? i see we have several tests with green, but I don't know that is enough.

I see a warning at the top

python ./configure --verbose 
WARNING: C++ compiler too old, need g++ 6.3.0 or clang++ 8.0.0 (CXX=ccache g++)

Is this related, @refack ?

No, that warning was added in preparation for the next version of node due next month.
P.S. AFAICT from the logs this happened once for build 23201 at 7:08AM UTC 2019-03-26.

  • The same binary is used for all tests
  • 50 tests are run in parallel (50 procs from the same binary)
  • It seemed to work up till test 707
  • Once it started failing it persisted till the end of the run
  • Next job run two hour later, all ok
  • Same code was tested on other workers and passed - https://ci.nodejs.org/job/node-test-commit-arm/23201/

@refack - do you have access to any of the system logs for this machine? I'm wondering if there are contemporary console logs that would hint at temporary hardware failures.

On at least one similarly equipped system we were seeing flakes as a result of thermal issues under heavy sustained load, which ended up being chased back to a problem with a fan.

How is storage configured here ( are you using directly attached storage, or is it running off of an attached network block storage )? I want to also rule out drive failures.

Alternatively (and perhaps easier on everyone) if there are any indications that this is a hardware flake at all we should look at swapping out the hardware.

A candidate for hardware problems: https://github.com/nodejs/node/issues/25028, same machine shigeki was testing openssl on. Not confirmed, could be something else, worth watching though.

Is there an update on this issue?

I have some flexibility now (that I didn't have then) in swapping out hardware if it turns out that a particular piece of gear is having issues. Let me know if that would be appealing, now or in the future.

@sam-github and @Trott correct me if I'm wrong but I don't think we've seen any systematic OpenSSL failures on ARM in a while now, either because of OpenSSL upgrades, the way we're using it, or something else! In fact I don't think we've had to do a whole lot of ARM64-specific work in a while now, quite stable. This is good news I think but we'll keep you informed @vielmetti, thanks for being a resource as always.

@sam-github and @Trott correct me if I'm wrong but I don't think we've seen any systematic OpenSSL failures on ARM in a while now, either because of OpenSSL upgrades, the way we're using it, or something else! In fact I don't think we've had to do a whole lot of ARM64-specific work in a while now, quite stable. This is good news I think but we'll keep you informed @vielmetti, thanks for being a resource as always.

We haven't seen any failures but we turned off asm support for arm64 in https://github.com/nodejs/node/pull/24270 and haven't reenabled it: https://github.com/nodejs/node/blob/5c61c5d152aedfe992fd42b3d51823b16a547b21/common.gypi#L81-L86

Thanks @richardlau @rvagg - it's good that the system is stable, so that we can rule out infrastructure issues. However of course I'd always prefer not to see a performance regression.

Would it be worthwhile to test a new PR to turn asm support back on? And if you get any flakiness, I'm happy to look upstream for some more specialized vendor support and resources to take a look at what's going on.

@vielmetti Definitely worthwhile, https://github.com/nodejs/node/issues/23913#issuecomment-499475913 points to what would need changing, thanks for looking at this.

Thanks. The smallest PR I could imagine is in #28180 and the goal I see is to identify any flakiness associated with that one particular commit.

I came across this today in the Jenkins setup, associated with the opening of this issue, invoked for arm64 centos:

    # temporary mesure to evaluate https://github.com/nodejs/node/issues/23913
    export CONFIG_FLAGS="$CONFIG_FLAGS --openssl-no-asm"

Runs in here: https://ci.nodejs.org/job/node-test-commit-arm

So I guess even with it re-enabled in #28180, we've still been compiling without asm.

@sam-github @vielmetti should I just yank it out and see what happens?

Yes

done, we shall see if anything shows up

@nodejs/build @nodejs/crypto ... does this issue need to remain open?

Have just confirmed the config entry is removed from CI (it was commented out but is now removed), so with #28180 in place and no reported problems since I think we can call this done!

Was this page helpful?
0 / 5 - 0 ratings