Kibana version: 7.3.0 BC5
Elasticsearch version: 7.3.0 BC5
Server OS version: CentOS 7
Browser version: N/A
Browser OS version: N/A
Original install method (e.g. download page, yum, from source, etc.): rpm default dist
Describe the bug: 2 tests in the integration-test repo create an interval watch to email a Kibana dashboard PDF and PNG report (2 separate tests). On 7.3.0 BC4 and earlier both tests pass. But on 7.3.0 BC5 the second test always fails. I can switch the PDF and PNG reports and it's always the second test that fails.
These watches are sending the reports as email attachments but I don't think that is relevant to the problem based on the error message;
An example of the .watcher-history doc for one of these failures;
00:25:12.817 .watcher-history*={"took":67,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.6931472,"hits":[{"_index":".watcher-history-10-2019.07.19","_type":"_doc","_id":"watcher_report1563554002370_44c82301-867c-4638-a051-5c5090211b9c-2019-07-19T16:45:37.861919Z","_score":0.6931472,"_source":{"watch_id":"watcher_report1563554002370","node":"thvOhxZ2SpqAk0pC-aWvdQ","state":"executed","status":{"state":{"active":true,"timestamp":"2019-07-19T16:45:37.861Z"},"last_checked":"2019-07-19T16:45:37.861Z","last_met_condition":"2019-07-19T16:45:37.861Z","actions":{"email_admin":{"ack":{"timestamp":"2019-07-19T16:45:37.861Z","state":"awaits_successful_execution"},"last_execution":{"timestamp":"2019-07-19T16:45:37.861Z","successful":false,"reason":""}}},"execution_state":"executed","version":-1},"trigger_event":{"type":"schedule","triggered_time":"2019-07-19T16:45:37.861Z","schedule":{"scheduled_time":"2019-07-19T16:45:37.542Z"}},"input":{"none":{}},"condition":{"always":{}},"result":{"execution_time":"2019-07-19T16:45:37.861Z","execution_duration":4,"input":{"type":"none","status":"success","payload":{}},"condition":{"type":"always","status":"success","met":true},"actions":[{"id":"email_admin","type":"email","status":"failure","error":{"root_cause":[{"type":"exception","reason":"Watch[watcher_report1563554002370] reporting[test_report.pdf] Error response when trying to trigger reporting generation host[localhost], port[5601] method[POST], path[/api/reporting/generate/printablePdf], status[408]"}],"type":"exception","reason":"Watch[watcher_report1563554002370] reporting[test_report.pdf] Error response when trying to trigger reporting generation host[localhost], port[5601] method[POST], path[/api/reporting/generate/printablePdf], status[408]"}}]},"messages":[]}}]}}
with the important part being;
Error response when trying to trigger reporting generation host[localhost], port[5601] method[POST], path[/api/reporting/generate/printablePdf], status[408]"
On a local install, I kept Elasticsearch 7.3.0 BC5 and changed Kibana back to BC4 and the tests pass. Change it back to BC5 and they fail again so it really seems like this is caused by a change between those 2 BCs.
Steps to reproduce:
Expected behavior: Both watches should work as they have before.
Screenshots (if relevant):
Errors in browser console (if relevant):
Provide logs and/or server output (if relevant): I haven't found any errors in either Kibana or Elasticsearch logs related to the timeout triggering report generation.
Any additional context:
Pinging @elastic/kibana-stack-services
I'll take a look at this first thing Monday.
For timing purposes on a failing test, here's the watch being added;
16:46:30.247: { _index: '.watches',
_type: '_doc',
_id: 'watcher_report1563572456661',
_version: 1,
result: 'created',
_shards: { total: 1, successful: 1, failed: 0 },
_seq_no: 15,
_primary_term: 1 }
The watch has a 300s interval (we want it to complete before it gets triggered a second time).
At this point the watch hasn't run yet (just short or 5 minutes);
16:51:24.486:
.watcher-history*=
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 0,
"relation": "eq"
},
"max_score": null,
"hits": []
}
}
And the next check (about 15 seconds later), it has failed;
16:51:40.061:
.watcher-history*=
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1.5404451,
"hits": [
{
"_index": ".watcher-history-10-2019.07.19",
"_type": "_doc",
"_id": "watcher_report1563572456661_7dd99997-b417-40c1-937a-06554dc8e742-2019-07-19T21:51:30.409402Z",
"_score": 1.5404451,
"_source": {
"watch_id": "watcher_report1563572456661",
"node": "k9oLryI0R5-Vy68ht9wG6g",
"state": "executed",
"status": {
"state": {
"active": true,
"timestamp": "2019-07-19T21:51:30.409Z"
},
"last_checked": "2019-07-19T21:51:30.409Z",
"last_met_condition": "2019-07-19T21:51:30.409Z",
"actions": {
"email_admin": {
"ack": {
"timestamp": "2019-07-19T21:51:30.409Z",
"state": "awaits_successful_execution"
},
"last_execution": {
"timestamp": "2019-07-19T21:51:30.409Z",
"successful": false,
"reason": ""
}
}
},
"execution_state": "executed",
"version": -1
},
"trigger_event": {
"type": "schedule",
"triggered_time": "2019-07-19T21:51:30.409Z",
"schedule": {
"scheduled_time": "2019-07-19T21:51:30.243Z"
}
},
"input": {
"none": {}
},
"condition": {
"always": {}
},
"result": {
"execution_time": "2019-07-19T21:51:30.409Z",
"execution_duration": 15,
"input": {
"type": "none",
"status": "success",
"payload": {}
},
"condition": {
"type": "always",
"status": "success",
"met": true
},
"actions": [
{
"id": "email_admin",
"type": "email",
"status": "failure",
"error": {
"root_cause": [
{
"type": "exception",
"reason": "Watch[watcher_report1563572456661] reporting[test_report.png] Error response when trying to trigger reporting generation host[localhost], port[5601] method[POST], path[/api/reporting/generate/png], status[408]"
}
],
"type": "exception",
"reason": "Watch[watcher_report1563572456661] reporting[test_report.png] Error response when trying to trigger reporting generation host[localhost], port[5601] method[POST], path[/api/reporting/generate/png], status[408]"
}
}
]
},
"messages": []
}
}
]
}
}
I ran again with logging.events.log: ['warning', 'error', 'fatal', 'info', 'reporting'] in my kibana.yml
timeout3.log
I don't see anything revealing in that log. The timestamps in the log are UTC which is +5 hours to my time and the test logging. So with the above log, here's when the watch was created;
17:08:05.440: { _index: '.watches',
_type: '_doc',
_id: 'watcher_report1563573751941',
_version: 1,
result: 'created',
_shards: { total: 1, successful: 1, failed: 0 },
_seq_no: 21,
_primary_term: 1 }
>> PASS: chrome 11 on any platform - reporting app - watcher app - PNG Reporting watch - should successfully add a new watch for png Reporting (9ms)
Here's my theory now by looking at the logs which were just at the end of the previous test and just before this test;
{"type":"log","@timestamp":"2019-07-19T22:07:44Z","tags":["reporting","screenshots","debug"],"pid":28493,"message":"screenshot took 839ms"}
{"type":"log","@timestamp":"2019-07-19T22:07:45Z","tags":["reporting","browser-driver","chromium-driver-factory","debug"],"pid":28493,"message":"deleting chromium user data directory at [/tmp/chromium-L2BcNI]"}
Maybe this cleanup step of deleting a directory is interfering with the creation of the next watch. I'll try putting some sleep between the tests and see if that changes the results.
Update: Adding a 30 second sleep between deleting the first watch and creating the second watch didn't have any impact on the failure.
Could it possibly be something like the es client connection I use for the first test gets disconnected for some reason and can't post the second reporting URL?
Now that I think about it more, I don't think this is it. The Es client is used to post the watch to elasticsearch and that works. It's not to generate the report.
There's definitely a bug here, with the error being Error response when trying to trigger reporting generation, but from the logs it looks like Kibana did generate a report
The timing is interesting as well, as the time between messages is so close:07:44Z vs 07:4Z. There isn't enough time between steps to even run the job.
I'm taking a look now, but my guess is that there's an issue with a timeout set someplace.
I've been running a watch that generates a PDF locally for awhile, and not hitting any issues.
It repeats every 5 minutes, which makes it look like Watcher is not trying to hold a connection open between triggers.
If I set server.keepaliveTimeout: 1200000 (20 minutes) in my kibana.yml then my tests pass. Without it they fail.
Hereâs the snippet in watcher for the initial report generation. So the request must have been returning 200 before but returning 408 now? This happens before the polling / attachment retries begin.
Quoting an email from @jasontedor
There is a breaking difference in behavior from Kibana before this change was integrated and after this change was integrated.
The timeout is indeed unchanged from 120s to 120s (which is why this was seen as a safe change), but what is different is how Kibana handles the socket timeout.
Prior to the change being integrated, Kibana would quietly close the connection on a socket that had timed out by sending a TCP FIN.
After the change was integrated, Kibana now closes the connection on a socket that has timed out first by sending an HTTP response (literally âHTTP/1.1 408 Request Timeoutâ), and then proceeding to close the connection by sending a TCP FIN.
This behavior change is causing the HTTP client to treat the response from Kibana as an actual response which is delivered to the application (Watcher) which treats the non-200 response a failure.
Prior to this change, the HTTP client would treat the FIN as the connection had been closed, and a new one needs to be re-established to then send the POST request to. In this case, the application (again, Watcher) would be none the wiser to the fact that a managed HTTP connection had been closed, and then re-opened on its behalf.
It can be reproduced without needing watcher, below confirms what he describes. These steps reproduce the problem in 6.8, 7.2, and 7.3 branches.
$ socat TCP:localhost:5601 UNIX-LISTEN:/tmp/curl.sock,fork
$ netstat -na | grep "5601 "
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp4 0 0 127.0.0.1.5601 127.0.0.1.53524 ESTABLISHED
tcp4 0 0 127.0.0.1.53524 127.0.0.1.5601 ESTABLISHED
tcp4 0 0 127.0.0.1.5601 *.* LISTEN
$ netstat -na | grep "5601 "
Proto Recv-Q Send-Q Local Address Foreign Address (state)
tcp4 0 0 127.0.0.1.5601 127.0.0.1.53524 FIN_WAIT_2
tcp4 32 0 127.0.0.1.53524 127.0.0.1.5601 CLOSE_WAIT
tcp4 0 0 127.0.0.1.5601 *.* LISTEN
$ curl -v --unix-socket /tmp/curl.sock http://localhost:5601/fov/api
* Trying /tmp/curl.sock...
* Connected to localhost (/tmp/curl.sock`Ò„??) port 80 (#0)
> GET /fov/api HTTP/1.1
> Host: localhost:5601
> User-Agent: curl/7.54.0
> Accept: */*
>
< HTTP/1.1 408 Request Timeout
* no chunk, no close, no size. Assume close to signal end
<
* Closing connection 0
Without the change, the difference is those 32 bytes arenât there in the ârecv-qâ and when you send the curl request, you get Empty reply from server instead of an HTTP response
I've removed myself from this since it's not in the reporting territory, however I'm still confused/concerned about the 408 response timing out. It seems to me that we shouldn't be "holding" this response on a keep-alive socket, and I'm not sure if this is something we can change since it appears to be how node behaves. From my perspective we're treating the underlying socket as a timeout vs the HTTP request itself. Ideally we'd only have this timeout per-http request and not per-socket.
@jbudz What would be the drawback of removing just the 408 response on timeout - then it should behave as it did before? Do we use / intend to use the 408 to provide messaging in the Kibana app?
We'll potentially see this anywhere an idle client connection can be kept alive longer than the kibana server connection. If we configure the kibana server timeout value very low, then we should be able to see this in the Kibana app even (some will be legitimate 408s, others not, it should be clear in the network view).
In the light amount of reading I've done on the NodeJS docs, I think @peterschretlen's comment is what I'm leaning towards as well. Seems that our client(s) in Kibana seem to have the current behavior built-in, and wonder what other users might find if the 408 starts showing up. I think simply removing that timeout handler would work.
I'm always a bit leery of "unusual" http responses - I had to look up 408, I've never seen that one before! I would think most clients would treat it as a generic 4XX response, meaning, user error, don't try this again. The reaction from a "smart" client seems like it should be to try the request again, but request via some option a longer timeout, if that client thought the timeout was transient and success might be possible with a longer timeout.
I don't know enough about the actual scenario to suggest alternatives tho.
The drawback is mostly around messaging and debugging. The messier side of this is browser -> server interaction that result in undefined errors. Moreso when there's a proxy in front leaving another layer for interpretation.
I'd argue the watcher client should be managing its own long polling timeout (although maybe it is, timestamps look iffy). Anything in between Kibana and watcher could have different behavior and its own timeout, whether its 504 or 502 or 408. I'll remove the message and we can bring that up for discussion later.
I'd argue the watcher client should be managing its own long polling timeout (although maybe it is, timestamps look iffy).
As I mentioned in my email that @peterschretlen posted here, the underlying client library does transparently as long as this is handled at the socket level, but once itâs done at the HTTP level I think we are doing the right thing by treating 408 as an error (the underlying client library has no idea for sure). Right now we treat the report creation as binary: it either succeeded or it didnât. I am not a fan of making this more complex on the client side.
Most helpful comment
Quoting an email from @jasontedor
It can be reproduced without needing watcher, below confirms what he describes. These steps reproduce the problem in 6.8, 7.2, and 7.3 branches.
Without the change, the difference is those 32 bytes arenât there in the ârecv-qâ and when you send the curl request, you get
Empty reply from serverinstead of an HTTP response