Fastapi: [QUESTION] about threads issue with fastapi.

Created on 8 Oct 2019 · 28Comments · Source: tiangolo/fastapi

Hi, I have a question about the threads issue with fastapi.

When I run the example from tutorial uvicorn main:app --reload --port 4681 --host 0.0.0.0 with the following main.py

from fastapi import FastAPI
app = FastAPI()
@app.get("/")
def read_root():
    return {"Hello": "World"}
@app.get("/items/{item_id}")
def read_item(item_id: int, q: str = None):
    return {"item_id": item_id, "q": q}

and it show up the following information

INFO: Started server process [18983]
INFO: Waiting for application startup.
INFO: Uvicorn running on http://0.0.0.0:4681 (Press CTRL+C to quit)

__then I use ps -o nlwp 18983 to see how many threads this process (18983) is using.__

However, everytime when I send a request to this service, the number of threads increase without being closed. To be more specific, __when I send 1000 requests, this process ended up with 1000 threads running.__

This is problematic because I tried to serve another more complicated applications, allocating arbitrary number of threads would finally get my machine out of resources.

Is there any thing I could have done wrong?
Thanks in advance!

question

Source

lzhbrian

👀8

Most helpful comment

Hello, I have configured the number of FastAPI threads setting the default thread pool executor as follows:

# main.py
from concurrent.futures import ThreadPoolExecutor
import asyncio

loop = asyncio.get_running_loop()
loop.set_default_executor(ThreadPoolExecutor(max_workers=5))

I've ended using this solution because Starlette run_in_threadpool uses the default executor when calls the loop.run_in_executor. I suppose it would be better if the executor could be configured explicitly in the Starlette or FastAPI configurations.

@lzhbrian do you think this could work for your project?

I am using the FastAPI Project Template with this environment:

Python 3.7.2
fastapi==0.42.0
starlette==0.12.9
uvicorn==0.9.0

emarbo on 22 Oct 2019

👍6

All 28 comments

Interested also to this question. Run it also on my machine with multiple workers and observed similar behaviour for each one of them. Would like to know what's happening as we are thinking of using fastAPI for some production service that would be running long term.

yiannis-kt on 10 Oct 2019

Can you share your environment (starlette, fastapi, uvicorn, python, and operating system versions)?

Can you check if you get the same behavior if you change the endpoints to be async def?

dmontagu on 10 Oct 2019

Also can you check if you get the same behavior when running uvicorn without the —reload flag?

dmontagu on 10 Oct 2019

I get the same behavior with --reload flag.
However interestingly, this is solved after I use async def ...

My environments:
Ubuntu 16.04.6 LTS
Python 3.6.9 :: Anaconda, Inc.
starlette 0.12.8
fastapi 0.38.1
uvicorn 0.9.0

lzhbrian on 10 Oct 2019

👍1

I'm curious - does the same thing happen when using one of the docker images listed here?

ellie on 10 Oct 2019

It’s not surprising that async def solves it — it’s because of the use of run_in_threadpool. Need to figure out why the threadpool threads aren’t going away/being reused.

dmontagu on 10 Oct 2019

In my case read_root and read_item were declared with async def... problem is that they Depend(authenticate), which if it is not also declared as async results to the same behaviour even if the first two are declared as such.

PS: maybe update also the docs https://fastapi.tiangolo.com/tutorial/security/http-basic-auth/

yiannis-kt on 10 Oct 2019

I can't reproduce this on MacOS with python 3.7. Following exactly the same steps in @lzhbrian 's original post, my threadcount after 1000 requests (executed by calling curl in a bash for loop) is just 21.

If you look at the code for starlette's run_in_threadpool function, it is just using the built-in python ThreadPoolExecutor (with default settings) to run the function. If you look in the standard library, you'll see max_workers = (os.cpu_count() or 1) * 5, so the maximum number of threads per process should be bounded by that value.

Does os.cpu_count() return a very large number on your machine?

dmontagu on 11 Oct 2019

__Sorry, I think I have made a mistake.__
__Doing what I describe above ended up with only 365 threads.__

My os.cpu_count() returns 72, I guess this makes my max_worker = 360.

However, when I serve other services (e.g. a deep learning model) which already allocate hundreds of threads per worker (say 105 threads per exec), __I did observe the total number of threads of this process increasing by 105 per calling.__

I am not sure if this number is bounded by 105 x max_workers in my case?
I guess my problem is a very large max_worker which makes a very large 105 x max_workers upper bound. And it crashes even if it haven't reached its upper bound?

Any clue?

lzhbrian on 11 Oct 2019

The max_worker = 360 just means that's the maximum number of threads that would be spawned by starlette before getting reused.

I'm guessing that the machine you are running on has a boatload of RAM; in that case, is there a problem with having a huge number of threads running?

I'm not sure what you mean by "increasing by 105 per calling", but if you have lots of services each triggering lots of threads, you might need to redesign to limit that to some degree.

I don't know how easy it currently is to override the maximum thread count used by starlette's run_in_threadpool. I'm guessing it can't be overridden at all right now, and even if it could, it would require tweaks to FastAPI to expose it.

But if you are willing to get your hands dirty in the fastapi/starlette internals, I don't think it would be hard to expose this parameter somehow. If you want to go down that route though, I would create an issue about it in the starlette repo first; there may be an easier workaround (e.g., monkeypatching os.cpu_count() to return a smaller number).

dmontagu on 11 Oct 2019

@dmontagu yes - we would also like configurability there.

I would like to link my production configuration question here - https://github.com/tiangolo/fastapi/issues/551

However, we would like to be able to tune the threads, etc since we serve a ton of models actually.

Also the "increasing by 105 per calling" is worrying. @lzhbrian can you create a minimal example that replicates this ?

sandys on 11 Oct 2019

My machine did have several hundred Gigabytes of RAM. However, I still haven't found out why when number of threads reached several thousands, the process will crash and throw the following error:

OMP: Error #34: System unable to allocate necessary resources for OMP thread:
OMP: System error #11: Resource temporarily unavailable
OMP: Hint: Try decreasing the value of OMP_NUM_THREADS.

__I think in the meantime, I will just stick to adding async to the function to solve the problem.__

And, here's a minimal example serving a image classification model using pytorch and pretrainedmodels, it would increase 36 threads per call (I am not sure if it is related to my number of physical CPU core=36). And after allocating serveral thousands of threads, it crashes with the above OMP error.

main.py

import pretrainedmodels
import torch
import pretrainedmodels.utils as utils

model_name = 'resnet50'
model = pretrainedmodels.__dict__[model_name](num_classes=1000, pretrained='imagenet')
model.eval()
load_img = utils.LoadImage()
tf_img = utils.TransformImage(model) 

path_img = '/Users/lzhbrian/Desktop/download.jpg'
input_img = load_img(path_img)
input_tensor = tf_img(input_img)         # 3x400x225 -> 3x299x299 size may differ
input_tensor = input_tensor.unsqueeze(0) # 3x299x299 -> 1x3x299x299
input = torch.autograd.Variable(input_tensor, requires_grad=False)

from fastapi import FastAPI
app = FastAPI()

@app.get("/items/{item_id}")
def read_item(item_id: int, q: str = None):
    output_logits = model(input) # 1x1000
    return {"item_id": item_id}

start api using: uvicorn main:app --port 5423

bash for calling the api.

#!/bin/bash
for i in {1..1000}
do
   curl http://127.0.0.1:5423/items/4
done

lzhbrian on 13 Oct 2019

I’m not very familiar with pytorch but I’d guess there is a better way to use the models for prediction that will persist/reuse the resources as opposed to kicking off more threads every time.

dmontagu on 13 Oct 2019

Hi
We also serve a lot of models - I'm not sure if fastapi is having an issue
with pytorch or tensorflow specifically.
These are very standard toolkits - should this issue be upstreamed ?

On Sun, 13 Oct, 2019, 10:53 dmontagu, notifications@github.com wrote:

I’m not very familiar with pytorch but I’d guess there is a better way to
use the models for prediction that will persist/reuse the resources as
opposed to kicking off more threads every time.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tiangolo/fastapi/issues/603?email_source=notifications&email_token=AAASYU5LUGWOAHZS6SNXGODQOKWELA5CNFSM4I6QMFEKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBCOWXY#issuecomment-541387615,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAASYU4EQQF5CV64STX4CTLQOKWELANCNFSM4I6QMFEA
.

sandys on 13 Oct 2019

Hello, I have configured the number of FastAPI threads setting the default thread pool executor as follows:

# main.py
from concurrent.futures import ThreadPoolExecutor
import asyncio

loop = asyncio.get_running_loop()
loop.set_default_executor(ThreadPoolExecutor(max_workers=5))

@lzhbrian do you think this could work for your project?

I am using the FastAPI Project Template with this environment:

Python 3.7.2
fastapi==0.42.0
starlette==0.12.9
uvicorn==0.9.0

emarbo on 22 Oct 2019

👍6

@emarbo Seems good, I would try it and post my result here, thanks!

lzhbrian on 23 Oct 2019

Thanks everyone for the discussion here!

@emarbo is right. Starlette just calls standard Python run_in_executor or the equivalent with context vars.

But yeah, you could fine-tune the executor. In fact, you could even use a process executor pool instead of a thread pool (not that I recommend it).

tiangolo on 29 Oct 2019

Hi @tiangolo, I was wondering if this information should be placed in the documentation. I didn't find anything related in FastAPI or Starlette docs.

I believe it is relevant to determine the resources needed to run the application - for instance, the number of database connections. In my case, I faced the same error as described here because I didn't know how many threads the application was launching and how to configure it. Finally, I ended setting the same size for the default thread pool executor and the SQLAlchemy pool.

Let me know if it makes sense to you and, in this case, if I should create a new issue for it.

emarbo on 30 Oct 2019

@emarbo i have a production configuration issue currently open. i would propose to merge these issues either here or there. https://github.com/tiangolo/fastapi/issues/551

sandys on 30 Oct 2019

Hi @sandys, I suppose it isn't my responsibility to decide it :sweat_smile:. This issue belongs to @lzhbrian, and although I agree the topics are related, in my opinion, the goals are different so I would keep them separate.
Here we are clarifying the FastAPI / Starlette threading internals (for non-async endpoints) meanwhile your discussion is related to the configuration of the Gunicorn workers and threads for production environments.

emarbo on 30 Oct 2019

So, yes, I would like to have a small section in the advanced guide about using run_in_threadpool, but I wouldn't want to go very deep in the technical details, as that varies heavily on the use case.

About your question @emarbo , the SQLAlchemy pool size would depend mostly on the DB, not on the threadpool used by FastAPI. This is because a single DB session will be used by the entire request, and it could be handled by different threads at different points (for each dependency), and at the same time, a single thread could handle more than one request at the same time. But the sessions would be separated through the dependencies, with one for each request, independent of which threads handle that request.

The idea of having the same SQLAlchemy pool size than the thread pool size is for frameworks that handle a whole request in a single thread, but that's not the case here, and the isolation of the session is not done by the thread but by the dependencies.

Anyway, if you have more questions or any other issue, you could create a new issue :nerd_face:

For now, @lzhbrian , I think your problem was solved, right? May we close this issue?

tiangolo on 12 Apr 2020

👍1

Could you explain this in terms of some easy rules of thumb that we can use
to tune this ?
We are planning to go to production with fastapi and tuning pool size is
one of our biggest questions.
I'm ok with overprovisioning here. So if you could just give a broad hint
on how we should calculate...that would be super helpful.

On Sun, 12 Apr, 2020, 22:28 Sebastián Ramírez, notifications@github.com
wrote:

So, yes, I would like to have a small section in the advanced guide about
using run_in_threadpool, but I wouldn't want to go very deep in the
technical details, as that varies heavily on the use case.

About your question @emarbo https://github.com/emarbo , the SQLAlchemy
pool size would depend mostly on the DB, not on the threadpool used by
FastAPI. This is because a single DB session will be used by the entire
request, and it could be handled by different threads at different points
(for each dependency), and at the same time, a single thread could handle
more than one request at the same time. But the sessions would be separated
through the dependencies, with one for each request, independent of which
threads handle that request.

The idea of having the same SQLAlchemy pool size than the thread pool size
is for frameworks that handle a whole request in a single thread, but
that's not the case here, and the isolation of the session is not done by
the thread but by the dependencies.

Anyway, if you have more questions or any other issue, you could create a
new issue 🤓

For now, @lzhbrian https://github.com/lzhbrian , I think your problem
was solved, right? May we close this issue?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tiangolo/fastapi/issues/603#issuecomment-612645385,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAASYU7UULLEBRSSAOUQYNDRMHXJPANCNFSM4I6QMFEA
.

sandys on 12 Apr 2020

Sorry @sandys but that's an advanced topic and deeply advanced fine-tuning will probably depend a lot on each application and infrastructure.

The quick rule of thumb is to use the defaults from the tutorial. It's all made in a way that should be sufficient for most use cases.

There's a high chance that you don't really need to finetune that much for your use case and defaults should be fine.

But if you really need to finetune to the extreme for your specific use case, then that has to be done with your own specific domain expertise and use case.

So, @lzhbrian if your question is solved, may we close this issue?

tiangolo on 13 Apr 2020

Sure, thanks the help from everybody.

lzhbrian on 13 Apr 2020

👍2

Request to add the option to change the number of threads to the main api. We have been debugging an issue due to this for the past few weeks. Our fastapi server is deployed as a single pod in an openshift kubernetes platform with shared workers. Even though a single pod is limited to 2 cpu, in a shared worker os.cpu_count() returned 16 which lead to a total of 80 threads started by fastapi. As a result our server was crashing constantly in openshift. Using the workaround above we were able to limit the thread count to 4. Btw thanks for the great library @tiangolo, this is the only library with true support for combination of cpu and io bound endpoints.

subho406 on 18 Apr 2020

👍1

Seconded here. In a pod based production environment, the developer should
have control over threads, processes, max queue depth,etc.

Virtualization would screw up the calculations for fastapi/uvicorn.

On Sat, 18 Apr, 2020, 15:04 Subhojeet Pramanik, notifications@github.com
wrote:

Request to add the option to change the number of threads to the main api.
We have been debugging an issue due to this for the past few weeks. Our
fastapi server is deployed as a single pod in an openshift kubernetes
platform which shared worker. Even though a single pod is limited to 2 cpu,
in a shared worker os.cpu_count() returned 16 which lead to a total of 80
threads started by fastapi. As a result our server was crashing constantly
in openshift. Using the workaround above we were able to limit the thread
count to 4.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/tiangolo/fastapi/issues/603#issuecomment-615832055,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AAASYUZ24R2Y3LGGUJTEZC3RNFXZDANCNFSM4I6QMFEA
.