I followed the code template here: https://docs.microsoft.com/en-us/azure/machine-learning/service/tutorial-train-models-with-aml
I changed it for my dataset.
I can create image successfully. ACI deployment works fine too.
But when I call the API, it times out after almost a minute with this error. I know my code will take a few minutes to run.
WebserviceException: Received bad response from service:
Response Code: 504
Headers: {'X-Ms-Request-Id': 'c58d065b-ff71-42df-b18e-1caaf23a7f10', 'Date': 'Wed, 06 Feb 2019 00:21:46 GMT', 'Content-Length': '109', 'Content-Type': 'text/plain; charset=utf-8'}
Content: b'Post http://localhost:5001/score: net/http: request canceled (Client.Timeout exceeded while awaiting headers)'
How can I change this timeout?
Thanks!
⚠Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.
@blitzprecision
Thanks for the feedback. We are investigating into the issue and will update you shortly.
@blitzprecision Could you please let us know if you are able to test the model successfully locally before deploying it as a service?
Yes, it ran fine locally. I also logged in into the docker image and could execute it there as well using the terminal. I thought there was a default 1 hour timeout. But it turns out that I am facing a 1 minute default timeout.
I have run the notebooks mentioned in the tutorial and the was able to hit the service without any timeout. There are a couple of things I would suggest to check for your model:
1) I am sure you must have changed the input data format and header to match your model but I see in the error the header for Content-Type is 'text-plain', this tutorial uses JSON data as input and would fail if 'application/json' is not passed as header when input it JSON
2) The tutorial uses default core of 1 and RAM of 1 GB. If your image needs higher values could you please update aciconfig in your configuration file, then register and redeploy your service?
3) If you need a model that needs scalable deployments we recommend using AKS instead of ACI
4) Could you please check the container logs when you hit the scoring URI for more details on the error?
I have noticed the timeout value is set correctly to 3600 from the logs
Navigate to your resource group ->
{"timestamp": "2019-02-06T17:03:13.342725Z", "message": "{\"requestId\": \"af9b9adc-e965-4388-9a71-7237bbd09cd6\", \"message\": \"Scoring Timer is set to 3600.0 seconds\", \"apiName\": \"/score\"}", "host": "wk-caas-ee8ed0a01869451a96a85a1705a0d85e-xxxxxxxxxxxxxxxxxxxxxx", "path": "/var/azureml-app/aml_logger.py", "tags": "%(module)s, %(asctime)s, %(levelname)s, %(message)s", "level": "INFO", "logger": "root", "stack_info": null}
Dear Rohit,
Many thanks for your time.
Here are my score.py and myenv.yml. I should have probably attached them earlier.
files.zip
I think text-plain is Content-Type for response.
Also, my workspace is in eastus2.
Also, this works fine when I use runUntil=3 but starts failing with runUntil=4 or more. My input file is 172MB in size. When I read the image (22000x22000x3), it creates the matrix a of size 1.5GB. I could suspect RAM to be an issue but the point is it still works with runUntil=2 as well where the matrix "a" has already been allocated anyway.
Thanks again for your time!
Just tried increasing RAM to 8GB. It still gives me the same error with runUntil=4.
@blitzprecision the service logs attached indicates the timeout is set to 3600000 ms. You can check the container logs from the portal.
Please navigate to the resource group you are using and click the container instances resources. Now navigate to Settings => Containers => Select the svc container used => Logs.
Refresh the log after the API call and it should log more information to understand the issue.
Since this forum is restricted to issues in azure documentation or steps in this tutorial we would recommend to post the issue on MSDN ML forums so that the extended community can also pitch in with their thoughts.
Thanks Rohit. Here are the logs. They clearly show that the service times out after 1 minute or so.
Not sure why this timeout.
2019/02/07 11:13:05 Setting up HTTP server with sslEnabled: false
2019/02/07 11:13:05 Routing to: http://localhost:5001
2019/02/07 11:13:05 Authorization is disabled, continuing with no authorization keys.
2019/02/07 11:13:05 Setting up AppInsights logging.
2019/02/07 11:13:10 Incoming request: a250b937-8b1e-466f-b903-8286591ca6b8 GET /
2019/02/07 11:13:14 RequestID: a250b937-8b1e-466f-b903-8286591ca6b8 Returned Status Code: 200 Total request time: 3.864929635s Overhead: 42.3µs Backend call: 3.864887335s
2019/02/07 11:13:21 Incoming request: 0815906a-3531-4938-9fde-70587f3b37c4 GET /
2019/02/07 11:13:21 RequestID: 0815906a-3531-4938-9fde-70587f3b37c4 Returned Status Code: 200 Total request time: 1.410505ms Overhead: 7.7µs Backend call: 1.402805ms
2019/02/07 11:13:23 Incoming request: b6c80611-cb24-43b7-b438-b5bf52cf1be0 POST /score
2019/02/07 11:13:23 Authorization disabled. Continuing call.
2019/02/07 11:13:30 Incoming request: b4bddbe6-c276-41c6-919c-680de847c68f GET /score
2019/02/07 11:13:30 Authorization disabled. Continuing call.
2019/02/07 11:13:41 Incoming request: 9e21d677-2f1e-478a-bb16-e13f2be14b89 HEAD /score
2019/02/07 11:13:41 Incoming request: 76846ca5-f04a-4df3-b6e7-3fa9de92b306 GET /score
2019/02/07 11:13:41 Authorization disabled. Continuing call.
2019/02/07 11:14:23 Post http://localhost:5001/score: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2019/02/07 11:14:23 RequestID: b6c80611-cb24-43b7-b438-b5bf52cf1be0 Returned Status Code: 504 Total request time: 1m0.000283223s Overhead: 8.9µs Backend call: 1m0.000274323s
@blitzprecision The logs of the SVC container should provide some insight of the errors from the python scripts. The corresponding request id should have more details, For example an error seen for a request i simulated to fail shows an incorrect padding of the data passed.
2019/02/07 12:01:38 RequestID: 6dfd19f9-3037-458a-8a08-76273cc851ce Returned Status Code: 502 Total request time: 11.166434ms Overhead: 7.5µs Backend call: 11.158934ms
Log for the same request id in the other container.
Error: Incorrect padding\n\nDuring handling of the above exception, another exception occurred
I would also suggest to add a timeout parameter to the requests.post() call. As per the documentation,
timeout (float or tuple) – (optional) How many seconds to wait for the server to send data before giving up, as a float, or a (connect timeout, read timeout) tuple.
Also, Please post this question on MSDN community forum or stack overflow.
Once you post your issue on forums, it will have visibility across the community which is a better suited audience for such types of issues.
Please create a Forum thread and post it's link here and tag me. I will now proceed to close this issue. Feel free to comment if you have any other queries.
Thanks Rohit. I tried the timeout thing -- did not work. As mentioned on the same page, 504 (which I am getting) refers to gateway timeout. 408 refers to request timeout.
Here are the full logs from the SVC container. I have masked out storage account name and key for the same.
SVC_logs.txt
I can see that the worker timeout is being reset here to 300 for some reason. This is worrisome. Ideas?
I could not see the error mentioned int he logs you attached. Try using debug level logging. Also, As per the tutorial we need to load the model in the score.py in init():
def init():
global model
# retreive the path to the model file using the model name
model_path = Model.get_model_path('sklearn_mnist')
model = joblib.load(model_path)
I do not see this in your score.py file. Could you please recheck the implementation once?
Please create a forum thread on MSDN community forum or stack overflow who can provide their insights on issues you are facing.
Since the documentation for this tutorial works fine I will close this issue here, Please proceed to update the MSDN or SO links once posted in this thread.
Thanks Rohit. It was clearly incorrect to close this ticket without providing a reasoning for why the service throws a timeout error.
Also, not sure why you could not see the error in the logs. Please open the logs and look at line 16. You will find a string "worker timeout is set to 300".
Not loading a model should certainly not lead to a timeout error. Any better resolution?
Please note that we can certainly help you for any documentation issues with this tutorial on this forum. The steps mentioned in the documentation work for the examples provided. For any custom implementation issues we suggest to post the issue on MSDN or SO as I have mentioned prior to closing. The setting of worker timeout is not an error and that should be configurable in the docker config file if required. I was actually referring to the 408 error mentioned in your earlier comment as I could not spot the same. Apologies for the confusion. Please try to introduce debug and check if the model loads correctly. If any issues please feel free to open a thread in MSDN or SO with this thread's reference.
Thanks I thought it was obvious but let me explain anyway.
408 refers to request timeout. Putting a timeout parameter to the requests.post() call will only fix request timeouts. Clearly that is what we want to fix since, as you also noticed, we do not have error 408 in our logs. Thus, putting a timeout parameter was not supposed to help anyway.
We have error 504 which is what I mentioned in my original post. Thanks for your time anyway.
@blitzprecision I have found a similar issue in one MSDN forum and the suggestions provided by community for the same. We will also reach out to product group team to check if they can provide more documentation on ACI deployments.
Thanks Rohit. I looked at the thread. This surely helps. I think since I expect this job to take up some time, I will try the AmlCompute VM cluster for batch scoring. Appreciate your response on this even when the issue was closed.
@RohitMungi-MSFT if the 1 minute limit is actually enforced on ACI (as stated in https://social.msdn.microsoft.com/Forums/en-US/3f0d08cd-4a53-4445-ab11-597725115793/after-deploying-ml-model-got-time-out-when-call-the-service?forum=AzureMachineLearningService), that'd be great to have it stated in the docs (at least e.g. in https://docs.microsoft.com/en-us/azure/container-instances/container-instances-troubleshooting).
This 504 response can indeed be puzzling otherwise.
Hello everyone,
Someone could find a solution for this issue. I have the same problem with Azure ML service, can we set the worker timeout setting with image config on python SDK or with Azure APP Service environement variable ?
Encountered the same problem and just used a trick to solve it.
Two tricks might be helpful:
a) Increase the number of cores involved when deploying a model. This would reduce the time cost so the running time may drop and touch the 1 minute threshold.
b) Split the data into many smaller pieces, send them individually, then combine results.
Nevertheless they're just tricks. Look forward to a better solution.
Seems that abandoning ACI deployment is a nice solution. But I didn't try it.
Had the exact same issue. Thanks to @SleepyRoy for his suggestion. The issue was that there was not enough power to start the ML model defined in init() function of the score.py file.
The solution was to add more cores (2CPUs) and 8 Gb RAM to enable a faster init and to finish before the 300ms timeout.