Kibana: watcher fails with "Error response when trying to trigger reporting generation"

Created on 19 Jul 2019 · 18Comments · Source: elastic/kibana

Kibana version: 7.3.0 BC5

Elasticsearch version: 7.3.0 BC5

Server OS version: CentOS 7

Browser version: N/A

Browser OS version: N/A

Original install method (e.g. download page, yum, from source, etc.): rpm default dist

Describe the bug: 2 tests in the integration-test repo create an interval watch to email a Kibana dashboard PDF and PNG report (2 separate tests). On 7.3.0 BC4 and earlier both tests pass. But on 7.3.0 BC5 the second test always fails. I can switch the PDF and PNG reports and it's always the second test that fails.

These watches are sending the reports as email attachments but I don't think that is relevant to the problem based on the error message;

An example of the .watcher-history doc for one of these failures;

00:25:12.817 .watcher-history*={"took":67,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":0.6931472,"hits":[{"_index":".watcher-history-10-2019.07.19","_type":"_doc","_id":"watcher_report1563554002370_44c82301-867c-4638-a051-5c5090211b9c-2019-07-19T16:45:37.861919Z","_score":0.6931472,"_source":{"watch_id":"watcher_report1563554002370","node":"thvOhxZ2SpqAk0pC-aWvdQ","state":"executed","status":{"state":{"active":true,"timestamp":"2019-07-19T16:45:37.861Z"},"last_checked":"2019-07-19T16:45:37.861Z","last_met_condition":"2019-07-19T16:45:37.861Z","actions":{"email_admin":{"ack":{"timestamp":"2019-07-19T16:45:37.861Z","state":"awaits_successful_execution"},"last_execution":{"timestamp":"2019-07-19T16:45:37.861Z","successful":false,"reason":""}}},"execution_state":"executed","version":-1},"trigger_event":{"type":"schedule","triggered_time":"2019-07-19T16:45:37.861Z","schedule":{"scheduled_time":"2019-07-19T16:45:37.542Z"}},"input":{"none":{}},"condition":{"always":{}},"result":{"execution_time":"2019-07-19T16:45:37.861Z","execution_duration":4,"input":{"type":"none","status":"success","payload":{}},"condition":{"type":"always","status":"success","met":true},"actions":[{"id":"email_admin","type":"email","status":"failure","error":{"root_cause":[{"type":"exception","reason":"Watch[watcher_report1563554002370] reporting[test_report.pdf] Error response when trying to trigger reporting generation host[localhost], port[5601] method[POST], path[/api/reporting/generate/printablePdf], status[408]"}],"type":"exception","reason":"Watch[watcher_report1563554002370] reporting[test_report.pdf] Error response when trying to trigger reporting generation host[localhost], port[5601] method[POST], path[/api/reporting/generate/printablePdf], status[408]"}}]},"messages":[]}}]}}

with the important part being;
Error response when trying to trigger reporting generation host[localhost], port[5601] method[POST], path[/api/reporting/generate/printablePdf], status[408]"

On a local install, I kept Elasticsearch 7.3.0 BC5 and changed Kibana back to BC4 and the tests pass. Change it back to BC5 and they fail again so it really seems like this is caused by a change between those 2 BCs.

Steps to reproduce:

Starting with Kibana default distribution (mine is installed via rpm packages on CentOS 7)
We have Elasticsearch configured with an email address, security enabled, self-signed certificates but I'll try to reproduce without the email part.
We use the watcher API to create the 1st watch which has a 5 minute interval. Then we wait 5 minutes and start checking if the watch has been successfully executed. The we delete that watch.
We create the next watch with the only difference being the report generation url so we get a PNG instead of a PDF. wait 5 minutes, check until it's successful, and delete the watch.

Expected behavior: Both watches should work as they have before.

Screenshots (if relevant):

Errors in browser console (if relevant):

Provide logs and/or server output (if relevant): I haven't found any errors in either Kibana or Elasticsearch logs related to the timeout triggering report generation.

Any additional context:

Reporting bug regression

Source

LeeDr

Most helpful comment

Quoting an email from @jasontedor

There is a breaking difference in behavior from Kibana before this change was integrated and after this change was integrated.

The timeout is indeed unchanged from 120s to 120s (which is why this was seen as a safe change), but what is different is how Kibana handles the socket timeout.

Prior to the change being integrated, Kibana would quietly close the connection on a socket that had timed out by sending a TCP FIN.

After the change was integrated, Kibana now closes the connection on a socket that has timed out first by sending an HTTP response (literally “HTTP/1.1 408 Request Timeout”), and then proceeding to close the connection by sending a TCP FIN.

This behavior change is causing the HTTP client to treat the response from Kibana as an actual response which is delivered to the application (Watcher) which treats the non-200 response a failure.

Prior to this change, the HTTP client would treat the FIN as the connection had been closed, and a new one needs to be re-established to then send the POST request to. In this case, the application (again, Watcher) would be none the wiser to the fact that a managed HTTP connection had been closed, and then re-opened on its behalf.

It can be reproduced without needing watcher, below confirms what he describes. These steps reproduce the problem in 6.8, 7.2, and 7.3 branches.

start a permanent tcp connection

$ socat TCP:localhost:5601 UNIX-LISTEN:/tmp/curl.sock,fork

you should be able to see a new conection in netstat

$ netstat -na | grep "5601 "
Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)
tcp4       0      0  127.0.0.1.5601         127.0.0.1.53524        ESTABLISHED
tcp4       0      0  127.0.0.1.53524        127.0.0.1.5601         ESTABLISHED
tcp4       0      0  127.0.0.1.5601         *.*                    LISTEN

after the connection times out, it looks like this (notice the 32 bytes sitting there in the receive ‘queue’)

$ netstat -na | grep "5601 "
Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)
tcp4       0      0  127.0.0.1.5601         127.0.0.1.53524        FIN_WAIT_2 
tcp4      32      0  127.0.0.1.53524        127.0.0.1.5601         CLOSE_WAIT 
tcp4       0      0  127.0.0.1.5601         *.*                    LISTEN

now send a curl request (using that tcp connection), and you’ll get a 408

$ curl -v --unix-socket /tmp/curl.sock http://localhost:5601/fov/api
*   Trying /tmp/curl.sock...
* Connected to localhost (/tmp/curl.sock`ҥ??) port 80 (#0)
> GET /fov/api HTTP/1.1
> Host: localhost:5601
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 408 Request Timeout
* no chunk, no close, no size. Assume close to signal end
< 
* Closing connection 0

Without the change, the difference is those 32 bytes aren’t there in the ‘recv-q’ and when you send the curl request, you get Empty reply from server instead of an HTTP response

peterschretlen on 24 Jul 2019

❤1 👍1

All 18 comments

Pinging @elastic/kibana-stack-services

elasticmachine on 19 Jul 2019

I'll take a look at this first thing Monday.

joelgriffith on 19 Jul 2019

For timing purposes on a failing test, here's the watch being added;

16:46:30.247: { _index: '.watches',
  _type: '_doc',
  _id: 'watcher_report1563572456661',
  _version: 1,
  result: 'created',
  _shards: { total: 1, successful: 1, failed: 0 },
  _seq_no: 15,
  _primary_term: 1 }

The watch has a 300s interval (we want it to complete before it gets triggered a second time).
At this point the watch hasn't run yet (just short or 5 minutes);

16:51:24.486: 
.watcher-history*=
{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  }
}

And the next check (about 15 seconds later), it has failed;

16:51:40.061: 
.watcher-history*=
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1.5404451,
    "hits": [
      {
        "_index": ".watcher-history-10-2019.07.19",
        "_type": "_doc",
        "_id": "watcher_report1563572456661_7dd99997-b417-40c1-937a-06554dc8e742-2019-07-19T21:51:30.409402Z",
        "_score": 1.5404451,
        "_source": {
          "watch_id": "watcher_report1563572456661",
          "node": "k9oLryI0R5-Vy68ht9wG6g",
          "state": "executed",
          "status": {
            "state": {
              "active": true,
              "timestamp": "2019-07-19T21:51:30.409Z"
            },
            "last_checked": "2019-07-19T21:51:30.409Z",
            "last_met_condition": "2019-07-19T21:51:30.409Z",
            "actions": {
              "email_admin": {
                "ack": {
                  "timestamp": "2019-07-19T21:51:30.409Z",
                  "state": "awaits_successful_execution"
                },
                "last_execution": {
                  "timestamp": "2019-07-19T21:51:30.409Z",
                  "successful": false,
                  "reason": ""
                }
              }
            },
            "execution_state": "executed",
            "version": -1
          },
          "trigger_event": {
            "type": "schedule",
            "triggered_time": "2019-07-19T21:51:30.409Z",
            "schedule": {
              "scheduled_time": "2019-07-19T21:51:30.243Z"
            }
          },
          "input": {
            "none": {}
          },
          "condition": {
            "always": {}
          },
          "result": {
            "execution_time": "2019-07-19T21:51:30.409Z",
            "execution_duration": 15,
            "input": {
              "type": "none",
              "status": "success",
              "payload": {}
            },
            "condition": {
              "type": "always",
              "status": "success",
              "met": true
            },
            "actions": [
              {
                "id": "email_admin",
                "type": "email",
                "status": "failure",
                "error": {
                  "root_cause": [
                    {
                      "type": "exception",
                      "reason": "Watch[watcher_report1563572456661] reporting[test_report.png] Error response when trying to trigger reporting generation host[localhost], port[5601] method[POST], path[/api/reporting/generate/png], status[408]"
                    }
                  ],
                  "type": "exception",
                  "reason": "Watch[watcher_report1563572456661] reporting[test_report.png] Error response when trying to trigger reporting generation host[localhost], port[5601] method[POST], path[/api/reporting/generate/png], status[408]"
                }
              }
            ]
          },
          "messages": []
        }
      }
    ]
  }
}

LeeDr on 20 Jul 2019

I ran again with logging.events.log: ['warning', 'error', 'fatal', 'info', 'reporting'] in my kibana.yml
timeout3.log

I don't see anything revealing in that log. The timestamps in the log are UTC which is +5 hours to my time and the test logging. So with the above log, here's when the watch was created;

17:08:05.440: { _index: '.watches',
  _type: '_doc',
  _id: 'watcher_report1563573751941',
  _version: 1,
  result: 'created',
  _shards: { total: 1, successful: 1, failed: 0 },
  _seq_no: 21,
  _primary_term: 1 }
>> PASS: chrome 11 on any platform - reporting app - watcher app - PNG Reporting watch - should successfully add a new watch for png Reporting (9ms)

Here's my theory now by looking at the logs which were just at the end of the previous test and just before this test;

{"type":"log","@timestamp":"2019-07-19T22:07:44Z","tags":["reporting","screenshots","debug"],"pid":28493,"message":"screenshot took 839ms"}
{"type":"log","@timestamp":"2019-07-19T22:07:45Z","tags":["reporting","browser-driver","chromium-driver-factory","debug"],"pid":28493,"message":"deleting chromium user data directory at [/tmp/chromium-L2BcNI]"}

Maybe this cleanup step of deleting a directory is interfering with the creation of the next watch. I'll try putting some sleep between the tests and see if that changes the results.

LeeDr on 20 Jul 2019

Update: Adding a 30 second sleep between deleting the first watch and creating the second watch didn't have any impact on the failure.

LeeDr on 20 Jul 2019

Could it possibly be something like the es client connection I use for the first test gets disconnected for some reason and can't post the second reporting URL?

Now that I think about it more, I don't think this is it. The Es client is used to post the watch to elasticsearch and that works. It's not to generate the report.

LeeDr on 20 Jul 2019

There's definitely a bug here, with the error being Error response when trying to trigger reporting generation, but from the logs it looks like Kibana did generate a report

tsullivan on 22 Jul 2019

The timing is interesting as well, as the time between messages is so close:07:44Z vs 07:4Z. There isn't enough time between steps to even run the job.

I'm taking a look now, but my guess is that there's an issue with a timeout set someplace.

joelgriffith on 22 Jul 2019

I've been running a watch that generates a PDF locally for awhile, and not hitting any issues.

It repeats every 5 minutes, which makes it look like Watcher is not trying to hold a connection open between triggers.

tsullivan on 23 Jul 2019

If I set server.keepaliveTimeout: 1200000 (20 minutes) in my kibana.yml then my tests pass. Without it they fail.

LeeDr on 23 Jul 2019

Here’s the snippet in watcher for the initial report generation. So the request must have been returning 200 before but returning 408 now? This happens before the polling / attachment retries begin.

https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/watcher/src/main/java/org/elasticsearch/xpack/watcher/notification/email/attachment/ReportingAttachmentParser.java#L182-L194

peterschretlen on 23 Jul 2019

Quoting an email from @jasontedor

There is a breaking difference in behavior from Kibana before this change was integrated and after this change was integrated.

The timeout is indeed unchanged from 120s to 120s (which is why this was seen as a safe change), but what is different is how Kibana handles the socket timeout.

Prior to the change being integrated, Kibana would quietly close the connection on a socket that had timed out by sending a TCP FIN.

After the change was integrated, Kibana now closes the connection on a socket that has timed out first by sending an HTTP response (literally “HTTP/1.1 408 Request Timeout”), and then proceeding to close the connection by sending a TCP FIN.

This behavior change is causing the HTTP client to treat the response from Kibana as an actual response which is delivered to the application (Watcher) which treats the non-200 response a failure.

Prior to this change, the HTTP client would treat the FIN as the connection had been closed, and a new one needs to be re-established to then send the POST request to. In this case, the application (again, Watcher) would be none the wiser to the fact that a managed HTTP connection had been closed, and then re-opened on its behalf.

It can be reproduced without needing watcher, below confirms what he describes. These steps reproduce the problem in 6.8, 7.2, and 7.3 branches.

start a permanent tcp connection

$ socat TCP:localhost:5601 UNIX-LISTEN:/tmp/curl.sock,fork

you should be able to see a new conection in netstat

$ netstat -na | grep "5601 "
Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)
tcp4       0      0  127.0.0.1.5601         127.0.0.1.53524        ESTABLISHED
tcp4       0      0  127.0.0.1.53524        127.0.0.1.5601         ESTABLISHED
tcp4       0      0  127.0.0.1.5601         *.*                    LISTEN

after the connection times out, it looks like this (notice the 32 bytes sitting there in the receive ‘queue’)

$ netstat -na | grep "5601 "
Proto Recv-Q Send-Q  Local Address          Foreign Address        (state)
tcp4       0      0  127.0.0.1.5601         127.0.0.1.53524        FIN_WAIT_2 
tcp4      32      0  127.0.0.1.53524        127.0.0.1.5601         CLOSE_WAIT 
tcp4       0      0  127.0.0.1.5601         *.*                    LISTEN

now send a curl request (using that tcp connection), and you’ll get a 408

$ curl -v --unix-socket /tmp/curl.sock http://localhost:5601/fov/api
*   Trying /tmp/curl.sock...
* Connected to localhost (/tmp/curl.sock`ҥ??) port 80 (#0)
> GET /fov/api HTTP/1.1
> Host: localhost:5601
> User-Agent: curl/7.54.0
> Accept: */*
> 
< HTTP/1.1 408 Request Timeout
* no chunk, no close, no size. Assume close to signal end
< 
* Closing connection 0

Without the change, the difference is those 32 bytes aren’t there in the ‘recv-q’ and when you send the curl request, you get Empty reply from server instead of an HTTP response

peterschretlen on 24 Jul 2019

❤1 👍1

I've removed myself from this since it's not in the reporting territory, however I'm still confused/concerned about the 408 response timing out. It seems to me that we shouldn't be "holding" this response on a keep-alive socket, and I'm not sure if this is something we can change since it appears to be how node behaves. From my perspective we're treating the underlying socket as a timeout vs the HTTP request itself. Ideally we'd only have this timeout per-http request and not per-socket.

joelgriffith on 24 Jul 2019

@jbudz What would be the drawback of removing just the 408 response on timeout - then it should behave as it did before? Do we use / intend to use the 408 to provide messaging in the Kibana app?

We'll potentially see this anywhere an idle client connection can be kept alive longer than the kibana server connection. If we configure the kibana server timeout value very low, then we should be able to see this in the Kibana app even (some will be legitimate 408s, others not, it should be clear in the network view).

peterschretlen on 24 Jul 2019

In the light amount of reading I've done on the NodeJS docs, I think @peterschretlen's comment is what I'm leaning towards as well. Seems that our client(s) in Kibana seem to have the current behavior built-in, and wonder what other users might find if the 408 starts showing up. I think simply removing that timeout handler would work.

joelgriffith on 24 Jul 2019

I'm always a bit leery of "unusual" http responses - I had to look up 408, I've never seen that one before! I would think most clients would treat it as a generic 4XX response, meaning, user error, don't try this again. The reaction from a "smart" client seems like it should be to try the request again, but request via some option a longer timeout, if that client thought the timeout was transient and success might be possible with a longer timeout.

I don't know enough about the actual scenario to suggest alternatives tho.

pmuellr on 25 Jul 2019

The drawback is mostly around messaging and debugging. The messier side of this is browser -> server interaction that result in undefined errors. Moreso when there's a proxy in front leaving another layer for interpretation.

I'd argue the watcher client should be managing its own long polling timeout (although maybe it is, timestamps look iffy). Anything in between Kibana and watcher could have different behavior and its own timeout, whether its 504 or 502 or 408. I'll remove the message and we can bring that up for discussion later.

jbudz on 30 Jul 2019

I'd argue the watcher client should be managing its own long polling timeout (although maybe it is, timestamps look iffy).

As I mentioned in my email that @peterschretlen posted here, the underlying client library does transparently as long as this is handled at the socket level, but once it’s done at the HTTP level I think we are doing the right thing by treating 408 as an error (the underlying client library has no idea for sure). Right now we treat the report creation as binary: it either succeeded or it didn’t. I am not a fan of making this more complex on the client side.