Elasticsearch: Add timeout parameter to cat api

Created on 21 Jan 2015 · 7Comments · Source: elastic/elasticsearch

The nodes cat api appears to not have a timeout by default (and certainly not a way to set one). This suggests that it will wait forever on all the nodes to respond and may seem to hang. This is a request to add a timeout parameter.

:CorFeatureCAT APIs >enhancement CorFeatures help wanted

Source

ppf2

👍2

Most helpful comment

+1.
Same use case. We have a large cluster, one node has become unresponsive. If I can get a list of all _responsive_ nodes I can find the bad one by subtracting the good set from the set of all nodes (which I already have).

jim-davis on 19 Nov 2015

👍2

All 7 comments

@ppf2 If nodes are not responding, then they will be disconnected, and then the request will return. Why does this API in particular need a specific timeout parameter, as opposed to just having a timeout on the client side?

clintongormley on 21 Jan 2015

The will indeed return once the nodes are disconnected by the master. The nodes may also recover or just take long to answer. I think the idea is to add an (optional) timeout parameter that acts on the node level - I.e., you will get a response for all the nodes that did respond instead of a global timeout with no answer at all. These cat commands are sometimes used when the cluster is under pressure and you rather have partial information.

On Wed, Jan 21, 2015 at 10:37 AM, Pius [email protected] wrote:

The nodes cat api appears to not have a timeout by default (and certainly not a way to set one). This suggests that it will wait forever on all the nodes to respond and may seem to hang. This is a request to add a timeout parameter.

Reply to this email directly or view it on GitHub:
https://github.com/elasticsearch/elasticsearch/issues/9375

bleskes on 21 Jan 2015

Specifically add it to cat-nodes, cat-indices and cat-recovery, which are broadcast to all nodes. Need to add a timeout column and an exception message column?

clintongormley on 23 Jan 2015

+1 on timeout/exception column, we have seen this quite a lot in the field, and it is a pain for admins (esp. those with very large clusters) for these apis will not come back, and they will have to go find out which node(s) are "bad" and have to be potentially restarted.

ppf2 on 11 Nov 2015

jim-davis on 19 Nov 2015

👍2

By the way, is this issue only affecting _cat apis? We are also seeing node info, node stats apis not returning when nodes became unresponsive in the cluster and the master is busy recovering the shards. If these other APIs have a timeout, what is the expected amount of wait before it returns with partial result?

insertOrder timeInQueue priority source 
     712148       13.3h URGENT   shard-started ([logstash-2016.01.18][6], node[vveylNq1SeWOHZe66Zlg8g], [P], v[23], s[INITIALIZING], a[id=N0uMNZawSMG7u5Mm66vQ0Q], unassigned_info[[reason=NODE_LEFT], at[2016-02-03T19:35:58.610Z], details[node_left[vveylNq1SeWOHZe66Zlg8g]]]), reason [after recovery from store] 
     712149       13.2h URGENT   shard-started ([logstash-2016.01.08][2], node[vveylNq1SeWOHZe66Zlg8g], [P], v[27], s[INITIALIZING], a[id=Ld5N0njGRdu4BmbVwtxffQ], unassigned_info[[reason=NODE_LEFT], at[2016-02-03T19:35:58.610Z], details[node_left[vveylNq1SeWOHZe66Zlg8g]]]), reason [after recovery from store] 
     712152        4.9h URGENT   shard-started ([logstash-2016.01.05][7], node[vveylNq1SeWOHZe66Zlg8g], [P], v[34], s[INITIALIZING], a[id=SGLUBwbSSmKtnT_TXqK42w], unassigned_info[[reason=NODE_LEFT], at[2016-02-03T19:35:58.610Z], details[node_left[vveylNq1SeWOHZe66Zlg8g]]]), reason [after recovery from store] 
     712154        4.4h URGENT   shard-started ([logstash-2016.02.02][5], node[BGFypOO2TMiZmOioHVLRZg], [R], v[9], s[INITIALIZING], a[id=DMJVmBsmT16Gv8j1M__ZwA], unassigned_info[[reason=NODE_LEFT], at[2016-02-03T19:29:03.874Z], details[node_left[BGFypOO2TMiZmOioHVLRZg]]]), reason [after recovery (replica) from node [{San Carlos}{8903M2y7S0SAGwQHG5d9xA}{10.7.92.146}{10.7.92.146:9300}{disk_type=hdd, master=false}]]                                                 
     712142       22.2h HIGH     shard-failed ([logstash-2016.01.27][9], node[vveylNq1SeWOHZe66Zlg8g], [P], v[20], s[INITIALIZING], a[id=1Z6zZBt6Rk-z58CfgyLJTQ], unassigned_info[[reason=NODE_LEFT], at[2016-02-03T19:35:58.610Z], details[node_left[vveylNq1SeWOHZe66Zlg8g]]]), message [master {Los Altos}{-VahHSUcQbeDDFfgyP86kg}{10.7.92.137}{10.7.92.137:9300}{data=false, master=true} marked shard as initializing, but shard is marked as failed, resend shard failure] 
     712153        4.7h URGENT   shard-started ([logstash-2016.01.06][8], node[vveylNq1SeWOHZe66Zlg8g], [P], v[29], s[INITIALIZING], a[id=HHivVyWnQZWMn5rRMi7LJQ], unassigned_info[[reason=NODE_LEFT], at[2016-02-03T19:35:58.610Z], details[node_left[vveylNq1SeWOHZe66Zlg8g]]]), reason [after recovery from store] 
     712126        2.3d HIGH     cluster_reroute(async_shard_fetch) 
     712016        6.8d NORMAL   master ping (from: [vveylNq1SeWOHZe66Zlg8g]) 
     712143       22.2h HIGH     shard-failed ([logstash-2016.01.30][8], node[vveylNq1SeWOHZe66Zlg8g], [P], v[9], s[INITIALIZING], a[id=lq6qETlmSByCoGBbftPSjA], unassigned_info[[reason=NODE_LEFT], at[2016-02-03T19:35:58.610Z], details[node_left[vveylNq1SeWOHZe66Zlg8g]]]), message [master {Los Altos}{-VahHSUcQbeDDFfgyP86kg}{10.7.92.137}{10.7.92.137:9300}{data=false, master=true} marked shard as initializing, but shard is marked as failed, resend shard failure]  
     712146       13.9h HIGH     shard-failed ([logstash-2016.01.27][9], node[vveylNq1SeWOHZe66Zlg8g], [P], v[20], s[INITIALIZING], a[id=1Z6zZBt6Rk-z58CfgyLJTQ], unassigned_info[[reason=NODE_LEFT], at[2016-02-03T19:35:58.610Z], details[node_left[vveylNq1SeWOHZe66Zlg8g]]]), message [master {Los Altos}{-VahHSUcQbeDDFfgyP86kg}{10.7.92.137}{10.7.92.137:9300}{data=false, master=true} marked shard as initializing, but shard is marked as failed, resend shard failure] 
     712150        5.5h HIGH     shard-failed ([logstash-2016.01.27][9], node[vveylNq1SeWOHZe66Zlg8g], [P], v[20], s[INITIALIZING], a[id=1Z6zZBt6Rk-z58CfgyLJTQ], unassigned_info[[reason=NODE_LEFT], at[2016-02-03T19:35:58.610Z], details[node_left[vveylNq1SeWOHZe66Zlg8g]]]), message [master {Los Altos}{-VahHSUcQbeDDFfgyP86kg}{10.7.92.137}{10.7.92.137:9300}{data=false, master=true} marked shard as initializing, but shard is marked as failed, resend shard failure] 
     712139        1.2d HIGH     shard-failed ([logstash-2016.01.27][9], node[vveylNq1SeWOHZe66Zlg8g], [P], v[20], s[INITIALIZING], a[id=1Z6zZBt6Rk-z58CfgyLJTQ], unassigned_info[[reason=NODE_LEFT], at[2016-02-03T19:35:58.610Z], details[node_left[vveylNq1SeWOHZe66Zlg8g]]]), message [failed to create shard] 
     711832       10.7d NORMAL   master ping (from: [sTVPg975QRGPGEebZ8tsOg]) 
     711974        7.6d NORMAL   master ping (from: [b_53kOVuQ5u8IbipsdUP5Q]) 
     712118          4d NORMAL   master ping (from: [vveylNq1SeWOHZe66Zlg8g]) 
     712056          6d NORMAL   master ping (from: [b_53kOVuQ5u8IbipsdUP5Q]) 
     712039        6.4d NORMAL   master ping (from: [vveylNq1SeWOHZe66Zlg8g]) 
     711844       10.6d NORMAL   master ping (from: [BGFypOO2TMiZmOioHVLRZg]) 
     712074        5.6d NORMAL   master ping (from: [b_53kOVuQ5u8IbipsdUP5Q]) 
     712147       13.9h HIGH     shard-failed ([logstash-2016.01.30][8], node[vveylNq1SeWOHZe66Zlg8g], [P], v[9], s[INITIALIZING], a[id=lq6qETlmSByCoGBbftPSjA], unassigned_info[[reason=NODE_LEFT], at[2016-02-03T19:35:58.610Z], details[node_left[vveylNq1SeWOHZe66Zlg8g]]]), message [master {Los Altos}{-VahHSUcQbeDDFfgyP86kg}{10.7.92.137}{10.7.92.137:9300}{data=false, master=true} marked shard as initializing, but shard is marked as failed, resend shard failure]  
     711886       10.3d NORMAL   master ping (from: [sTVPg975QRGPGEebZ8tsOg]) 
     712151        5.5h HIGH     shard-failed ([logstash-2016.01.30][8], node[vveylNq1SeWOHZe66Zlg8g], [P], v[9], s[INITIALIZING], a[id=lq6qETlmSByCoGBbftPSjA], unassigned_info[[reason=NODE_LEFT], at[2016-02-03T19:35:58.610Z], details[node_left[vveylNq1SeWOHZe66Zlg8g]]]), message [master {Los Altos}{-VahHSUcQbeDDFfgyP86kg}{10.7.92.137}{10.7.92.137:9300}{data=false, master=true} marked shard as initializing, but shard is marked as failed, resend shard failure]  
     711961        9.6d NORMAL   master ping (from: [sTVPg975QRGPGEebZ8tsOg]) 
     712141        1.1d HIGH     shard-failed ([logstash-2016.01.30][8], node[vveylNq1SeWOHZe66Zlg8g], [P], v[9], s[INITIALIZING], a[id=lq6qETlmSByCoGBbftPSjA], unassigned_info[[reason=NODE_LEFT], at[2016-02-03T19:35:58.610Z], details[node_left[vveylNq1SeWOHZe66Zlg8g]]]), message [failed to create shard] 
     712072        5.7d NORMAL   master ping (from: [vveylNq1SeWOHZe66Zlg8g])

@inqueue pointed out that the reporting of the pending tasks can be off due to a bug, so we can ignore the timeinQueue info that is in days above :)

ppf2 on 4 Feb 2016

This would be a very very helpful feature.