Query/Question
Hi,
I have an application that automatically creates an destroys some resources in Azure North Europe for use in automated tests.
Over the past week we have had many problems with resources taking a long time or failing to create, presumably due to the high demand currently.
We have also noticed that sometimes the SDK reports a timeout & throws but the resource does eventually get created. In these cases, we catch the exception & retry creating the resource, but because the original resource does eventually get created we end up with many more resources than we needed.
I've been looking at how we can increate timeouts, or otherwise handle this situation better.
There are two things I want to try but don't fully undersrand the implications of:
IAzureClient.LongRunningOperationRetryTimeout to a high value.IAzureClient.HttpClient.Timeout to a high value.Would doing either of these (or a combination of both) mean that the SDK waited until ARM had actually created or failed to create the resource?
I don't mind resources taking a long time to create, but I want the process to be deterministic. Having our system think resources have failed to create, but then actually appear some time later is problematic.
If there are other ways of handling this, and I'm looking at the wrong place, let me know!
Environment:
dotnet --info output for .NET Core projects): Linux dotnet core 2.2.6//cc: @yaohaizh
Hi Jesse , you might already knew Yaohai left Microsoft. Could you ping @weidongxu-microsoft instead? Thanks.
Weidong, could you share some insights about this question? Thank you!
Setting IAzureClient.LongRunningOperationRetryTimeou is likely not related to the issue. It is the default retry-after value for LRO (long running operation). Usually service will have a value which override it.
Setting IAzureClient.HttpClient.Timeout might help a bit. However I think ARM itself got a timeout about 1 or 2 minutes (https://github.com/Azure/azure-resource-manager-rpc/blob/master/v1.0/common-api-details.md#client-request-timeout), so value larger then 2 minutes probably will have no effect.
Depends on the nature of the problem, adding a RetryPolicy might help, if it is actually package loss or timeout in polling phase (after resource provision accepted but not completed) of LRO.
@yungezz
Thanks for that @weidongxu-microsoft. We're not having the problem any more, presumably because the azure resource manager is working as intended.
We have logging of the exceptions back from when they were happenning though, is there anything I could look for to help identify what could help (e.g. RetryPolicy as you suggest)? Or perhaps just additional conditions we could be logging on so that we have the relevant data should this happen again.
For logging, one can enable it via
Azure.Configure().WithLogLevel(HttpLoggingDelegatingHandler.Level.BodyAndHeaders)
.Authenticate(...);
with additional instructions (following prints to console)
ServiceClientTracing.AddTracingInterceptor(new ConsoleTracer());
ServiceClientTracing.IsEnabled = true;
class ConsoleTracer : IServiceClientTracingInterceptor
{
public void Information(string message)
{
Console.WriteLine(message);
}
public void TraceError(string invocationId, Exception exception)
{
Console.WriteLine("Exception in {0}: {1}", invocationId, exception);
}
public void ReceiveResponse(string invocationId, HttpResponseMessage response)
{
Console.WriteLine("ReceiveResponse {0}\n{1}", invocationId, response.AsFormattedString());
}
public void SendRequest(string invocationId, HttpRequestMessage request)
{
Console.WriteLine("SendRequest {0}\n{1}", invocationId, request.AsFormattedString());
}
public void Configuration(string source, string name, string value) { }
public void EnterMethod(string invocationId, object instance, string method, IDictionary<string, object> parameters) { }
public void ExitMethod(string invocationId, object returnValue) { }
}
It could be used with modifications, e.g. only output 4xx and 5xx responses.
Be aware that bearer token is also printed out, and better to be redacted.
For RetryPolicy, it might not be very helpful for logging. If you uses it, general suggestion is only enable it for GET method, and 408, 5xx (minus 429) status code.
Thanks, I'll add additional logging following that example.
RE the RetryPolicy: it sounds like a good idea for me to configure that too. Is there any reason not to or anything to be cautious of (apart from limiting status codes & only to GET requests)?
I would assume if it was always a good idea the SDK would have it enabled by default?
RetryPolicy is pretty complicated and hard to predict for different use cases, and hence only retry for 429 (too many requests) is included by default.
And whether it is helpful or not could depend on the nature of the failure. E.g. if the response of a PUT does not get back to you, a RetryPolicy probably not going to help much.
You can configure it, but do be careful (and you might want to exclude 501 and 505 (NOT_IMPLEMENTED and VERSION).
Thanks for the info!