Describe the bug
I am looking for any help troubleshooting a frequent error I am getting using this SDK. The error details are as below, and it happens intermittently
request to https://{accountName}.blob.core.windows.net/path failed, reason: connect ETIMEDOUT 13.70.99.30:443
at ClientRequest [anonymous] (D:\home\site\wwwroot\node_modules\@azure\core-http\node_modules\node-fetch\lib\index.js:1455)
at ClientRequest ClientRequest.clsBind (D:\home\site\wwwroot\node_modules\cls-hooked\context.js:172)
at ClientRequest ClientRequest.emit (events.js:187)
at ClientRequest ClientRequest.emitted (D:\home\site\wwwroot\node_modules\emitter-listener\listener.js:134)
at TLSSocket TLSSocket.socketErrorListener (_http_client.js:391)
at TLSSocket TLSSocket.emit (events.js:182)
at Object emitErrorNT (internal/streams/destroy.js:82)
at Object emitErrorAndCloseNT (internal/streams/destroy.js:50)
at process process._tickCallback (internal/process/next_tick.js:63)
To Reproduce
Steps to reproduce the behavior:
I do not have reliable reproduction steps
Expected behavior
A clear and concise description of what you expected to happen.
The SDK should not encounter timeouts when writing to the blob storage account
Additional context
I understand there is not reliable reproduction steps, but any guidance to help me with this bug, or debug it would be really appreciated
@ljian3377 @jiacfan Can you take a look?
@js-kyle
You can customize the retry, timeout settings by specifying them in the StoragePipelineOptions.StorageRetryOptions when creating the client. The default settings are:
// Default values of StorageRetryOptions
const DEFAULT_RETRY_OPTIONS: StorageRetryOptions = {
maxRetryDelayInMs: 120 * 1000,
maxTries: 4,
retryDelayInMs: 4 * 1000,
retryPolicyType: StorageRetryPolicyType.EXPONENTIAL,
secondaryHost: "",
tryTimeoutInMs: undefined // Use server side default timeout strategy
};
( You can also specify the individual timeout with the abortSignal option. )
You can enable request/response logging by setting the AZURE_LOG_LEVEL environment variable or dynamically by importing setLogLevel from @azure/logger and calling it with a log level.
Not sure if these can help. @jeremymeng any insight?
Just a followup on this problem (I work with @js-kyle): We think that this error is related to TCP connections not being reused despite keepAlive being enabled, which was resulting in many connections being created and hanging around waiting for other connections, and that was sometimes causing the total number of TCP connections to exceed the global limit.
We had a similar problem with storage-queue and have fixed that by changing our code implementation, but we couldn't find a fix for this so we have disabled keepAlive for now. The number of connections is still the same but they close very quickly so it's less risky for this error.
@OP-Klaus thanks for sharing your insights to this problem. The network issue might become complex when it fall into PaaS case, as it's transparent to user how underlay distributes and manages the network resources. Your work around sounds feasible, and feel free to let us know if you need further assist from us.
@ramya-rao-a @jeremymeng Can you help check the keepalibve implementation in core-http? Check there is no connection leaking when keep alive is enabled.
@js-kyle @OP-Klaus do you have more information (code pattern/api used/etc.) that can help us reproduce the issue? We made a fix in https://github.com/Azure/azure-sdk-for-js/pull/5552 to reuse agents when keepAlive or proxy is used. There could be some situation where our caching didn't work as expected. /cc @daviwil
@jeremymeng here's a script that reproduces it for me on v12.0.1:
'use strict';
const { BlobServiceClient, StorageSharedKeyCredential } = require('@azure/storage-blob');
const AZURE_CONTENT_ACCOUNT = process.env.AZURE_CONTENT_ACCOUNT;
const AZURE_CONTENT_KEY = process.env.AZURE_CONTENT_KEY;
const sharedKeyCredential = new StorageSharedKeyCredential(AZURE_CONTENT_ACCOUNT, AZURE_CONTENT_KEY);
const blobServiceClient = new BlobServiceClient(
`https://${AZURE_CONTENT_ACCOUNT}.blob.core.windows.net`,
sharedKeyCredential,
{ keepAliveOptions: { enable: true } },
);
const containerClient = blobServiceClient.getContainerClient('courses');
const setPageContentToAzure = (location, content) => {
const blobClient = containerClient.getBlobClient(location);
const blockBlobClient = blobClient.getBlockBlobClient();
return blockBlobClient.upload(content, Buffer.byteLength(content));
};
const runTest = async () => {
for (let i = 0; i < 500; i++) {
await setPageContentToAzure('pageId' + i, '{ foo: \'bar\'}');
}
}
runTest().then(_ => console.log('After runTest() resolved'));
Edit: Reduced the script size
@OP-Klaus Thank you very much for the repo code! We will try it out and report back findings.
@jeremymeng no problem, thank you for looking into it. I have edited the script to be smaller now and confirmed the issue still happens.
Edit: To clarify, this script reproduces the bug where TCP connections don't get reused, not the timeout errors
I believe the cause is that For each of our clients (BlobServiceClient, ContainerClient, BlobClient, BlockBlobClient, etc.) there's an underlying ServiceClient instance that handles sending request and receiving response from the Azure service. We have cached Http connection at ServiceClient level. However, in our current design two storage clients will not share a same connection (even if they have the same url to the corresponding Azure service resource. They do share the options from parent clients. So what's happening in this repro code is that 500 block blob clients and 500 connection are created with keepAlive enabled.
@OP-Klaus does your real scenario use blob clients with different locations? If that's the case you probably don't want to enable keepAlive because http connections are not share among them and those connection would hang around for much longer time. If you use a same blob client many times it then makes sense to enable keepAlive and also cache the blob client instance based on its location.
BTW you can directly get a block blob client from container client.
Here's my attempt to cache the block blob client
let _blobClients = {};
const getAzureBlobClient = (containerClient, location) => {
let client = _blobClients[location];
if (!client) {
client = _blobClients[location] = containerClient.getBlockBlobClient(location);
}
return client;
}
const setPageContentToAzure = (location, content) => {
const containerClient = getAzureContainerClient();
const blockBlobClient = getAzureBlobClient(containerClient, location);
return blockBlobClient.upload(content, content.length);
};
Also it might be useful for the SDK to maintain some cache of clients when getXxxxClient() is called. /cc @bterlson
Also it might be useful for the SDK to maintain some cache of clients
Or we could make our clients to share a same http client.
Another workaround: since we allow passing in an Http client when creating the client, you can do the following
const { DefaultHttpClient } = require("@azure/core-http");
const _httpClient = new DefaultHttpClient();
const getContainerClient = (account, key, container) => {
console.log("creating a new BlobServiceClient to get a container client")
const sharedKeyCredential = new StorageSharedKeyCredential(account, key);
const blobServiceClient = new BlobServiceClient(
`https://${account}.blob.core.windows.net`,
sharedKeyCredential,
{ httpClient: _httpClient, keepAliveOptions: { enable: true } },
);
return blobServiceClient.getContainerClient(container);
};
@azure/[email protected] has been released. I am closing this issue. @js-kyle @OP-Klaus please let us know if you are still seeing other issues.
Just to round this out for people viewing this issue in the future, the timeout problem we were experiencing was a problem with a server in a data center , which we have resolved by migrating.
Thanks for the work to optimise the client caching, and for all the attention this issue received. The support we have received has assured me that this library was a good choice for us.
Most helpful comment
Just to round this out for people viewing this issue in the future, the timeout problem we were experiencing was a problem with a server in a data center , which we have resolved by migrating.
Thanks for the work to optimise the client caching, and for all the attention this issue received. The support we have received has assured me that this library was a good choice for us.