I have just upgraded to Tensorflow 1.4 and Tensorboard 0.4. I had Tensorboard running for 20 hours. It was consuming 10GB of memory. I shut it down and restarted. It memory consumption is increasing steadily at ~10MB per second.
I observe the same behaviour, especially when there's a lot of data, be it many small experiments, or few long running ones. I've had 64GB memory systems starting to swap after a while when opening 2-3 such tensorboards.
Same observation here.
What's the progress of this issue now? @jart can you elaborate on the reason for this problem?
Any chance you guys could post tensorboard --inspect --logdir mylogdir?
Hi @jart , here's the shell output of tensorboard --inspect --logdir mylogdir for one of my experiments.
sample.txt
Checking in, I'm also getting this in spades and I have to kill tensorboard at least once a day to keep it from grinding everything to a halt.
Here's one, this only got to about 2GB RAM before I shut it down. Other instances have gotten to 10GB as others have reported.
Same here (currently at 6GB). Is there a flag to disable loading the graph for example?
Hi, I am also observing this behavior. Is this fixed in the 1.5 version?
Any updates? Or anybody found a workaround for this problem?
In tensorboard 1.5 the issue is still there. Memory consumption is increasing steadily at ~10MB per second.. Here is the output of
tensorboard --inspect --logdir mylogdir
I am having this same issue. The model is a simple LSTM that uses a pre-trained 600k x 300 dimension word embedding. I have 16 model versions and Tensorboard quickly consumes all 64Gb of memory on my machine. I am running Tensorboard 1.5. Here is the inspection log.
What helped in my case was never saving the graph. Make sure you do not add the graph somewhere and also pass graph=None to the FileWriter. Not a real solution, but maybe it helps.
+1
any news on this?
We're currently working on having a DB storage layer that puts information like the graphdef on disk rather than in memory. We'd be happy to accept a contribution that, for example, adds a flag to not load the GraphDef into memory, or perhaps saves a pointer to its file in memory to load it on demand, since the GraphDef is usually the very first thing inside an event log file.
Unfortunately graph=None to the FileWriter didn't solve the issue, running quite quickly out even with just a few models.
I'm also experiencing this issue with TensorBoard 1.9. Evicting GraphDef from memory might be an okay short-term solution but it's a fixed size, so it should only save a constant amount of memory. The problem for me is memory growth over time.
@jart is someone actively looking into this issue? It's fine if the answer is no, just want to understand where things are. Also, is there any additional information the community can provide to help diagnose what's going on?
I'm having the same thing from tensorboard 1.10
I also have the same thing from tensorboard 1.12.
The tensorboard will occupy more and more memory as the time goes by.
I run it on a server, it finally occupied up to 60GB memory...
Later I use an alternative measure:
# sleep time, hours
sleep_t=6
times=0
# while loop
while true
do
tensorboard --logdir=${logdir} --port=${port} &
last_pid=$!
sleep ${sleep_t}h
kill -9 ${last_pid}
times=`expr ${times} + 1`
echo "Restart tensorboard ${times} times."
done
Kill and restart the tensorboard periodly......
I also have the same thing from tensorboard 1.12.
The tensorboard will occupy more and more memory as the time goes by.
I run it on a server, it finally occupied up to 60GB memory...
I also meet this problem with 70+GB :(
Guess what? I encounter the same issue, the only difference here is that I ran tensorboard on a server with 512GB memory, and yeah tensorboard ate all memory!!!
Yeah I'm confused why nobody cares about this issue.
A memory leak of this magnitude make that tool basically useless.
On Fri, Feb 15, 2019, 00:11 rex-yue-wu notifications@github.com wrote:
Guess what? I encounter the same issue, the only difference here is that I
ran tensorboard on a server with 512GB memory, and yeah tensorboard ate
all memory!!!—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/tensorboard/issues/766#issuecomment-463840408,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACPN_r2wGJyPoXdpSGE66JljhNmAp7xXks5vNe0cgaJpZM4QnPHP
.
Same problem (I'm using TB 1.12.2). Any news?
Does anyone have a set of logs they can share that reliably reproduces this issue?
I know a few people have tried reproducing this with several examples but did not notice steadily increasing memory. A repro would be really helpful in starting to investigate this!
It would also be especially helpful if this can be reproduced using the latest stable TensorBoard version (1.13.0) or our nightly releases (tb-nightly), and in an environment that we have access to - e.g. a docker container or a standard GCP VM instance.
What type of memory consumption should be considered "normal" for tensorboard?
I've got an instance of tensorboard 1.13.1 that's been running for about a month now, pointed at a logdir that holds info from 127 different runs (the logdir is 3.9 GB on disk), and it's currently consuming 21.8 GB of memory (going by the res column in htop).
Memory usage does appear to be increasing even when I am not actively doing any runs, at a rate of about 1 MB a minute.
As an update, over the last week tensorboard's memory utilization has grown from 21.8 to 27 GB, despite not doing any additional runs
It seems Tensorboard (at least versions below 1.14) is always loading all logs into RAM even if they are deactivated in the UI.
So the more runs/logs we have, the higher is the memory consumption. It forces us to clean our log folder once in a while to reduce the memory taken by Tensorboard.
A good thing would be to load in memory only logs for runs selected in the Tensorboard UI. That way the memory consumption will stay constant if we just select the last few runs, even if we have thousands of runs in the Tensorboard folder. A warning could be raised if one tries to load too many runs (if memory consumption reaches a limit specified in the TB configuration).
Another good thing would be to release the memory if the TB server has not received any ping from the TB client for a long time. That way it would prevent a not used TB server to eat up all the memory.
Apart from that, there could also exist a real memory leak...
I have the same issue, Tensorboard 1.14 was at 24 GB after running for about a day. This can't be only due to Tensorboard loading all logs into memory, since the total size of my logs on disk is between 1 and 2 GB. After restarting and waiting for all data to be shown in the UI, Tensorboard now uses 400 MB (although it will probably grow again over time).
Here is a set of logs that reliably causes Tensorboard's memory to increase over time for me (after around 1-2 days it uses several 10s of GBs):
https://drive.google.com/file/d/1h16uu2GsW5qFNLzuqIu1HV7rkNJEUosR/view?usp=sharing
The log directory contains a lot of non-Tensorboard files as well, maybe this is what causes the issue?
I usually leave a browser tab with the Tensorboard client open in the background as well.
I run Tensorboard 1.14 on Ubuntu 16.04 in a docker container. The log directory is in a volume mounted in the container. Let me know if you could use any other information.
I don't think any contributor care about this issue. Nobody has
investigated although it has been open for years.
So don't expect much to happen.
(Except if someone wants to pay for support or something)
On Sun, Aug 25, 2019, 17:09 paulguerrero notifications@github.com wrote:
Here is a set of logs that reliably causes Tensorboard's memory to
increase over time for me (after around 1-2 days it uses several 10s of
GBs):https://drive.google.com/file/d/1h16uu2GsW5qFNLzuqIu1HV7rkNJEUosR/view?usp=sharing
The log directory contains a lot of non-Tensorboard files as well, maybe
this is what causes the issue?
I usually leave a browser tab with the Tensorboard client open in the
background as well.
I run Tensorboard 1.14 on Ubuntu 16.04 in a docker container. The log
directory is in a volume mounted in the container. Let me know if you could
use any other information.—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/tensorflow/tensorboard/issues/766?email_source=notifications&email_token=AAR437TTAST4TFNRZUNY3HTQGKOEBA5CNFSM4EE46HH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5CVLDY#issuecomment-524637583,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAR437W5P66J6VRR7AJRZT3QGKOEBANCNFSM4EE46HHQ
.
I don't think any contributor care about this issue
@rom1504: We care. I ran some profiling last week over a couple of days,
and will continue investigating as time permits:
This issue brought my machine to a grind today. Tensorboard chewed through ~ 50GB of RAM. @wchargin I appreciate you looking into this issue. Is it possible to raise its priority to P1 so you or someone on the team can carve out the necessary time to chase this issue down?
Seeing the same issue. Leaving it running has it consume 56/64gb of ram on my machine.
Ran into this issue again today. Tensorboard is basically unusable for me due to this bug.
Same here, I'm using tensorboard 2.0.0 and when I open two log directories with 6MB each, it uses 10GB of memory.
Using tensorboard 2.0.0. I have 900 different logs which takes about 200MB of hard disk space. When I start tensorboard, RAM consumption increases by 30GB.
I encountered the same issue. My remote server has 128GB memory, and Tensorboard ate 35% of them, causing other programs halted.
Happy 2nd birthday #766! 🎂🥳🎉
Look at how big you are now! You've eaten all the RAM we've given you like a good little bug, and you've grown stronger for it. Best wishes, and see you again in 2020.
Lots of love,
Tensorflow community
The problem still remains in the latest Tensorboard v2.1.0. I have ~10MB of log files and Tensorboard is allocating 13GB of RAM.
Hi folks - we're trying to get to the bottom of this, and we're sorry it's been such a longstanding problem.
For those of you on the thread who have experienced this, it would really help if you can comment with the following information:
python -c "import sys; print(sys.version)"Hi, I was about to open a new issue for this but found you're already working on it. In my case, Tensorboard used 12GB of RAM and 20% of my CPU resources. I'll provide the details you asked for.
hparam-tuning to my machine, then opened it with tensorboard --logdir C:\Users\Oscar\PycharmProjects\________\hparam-tuning on my own PC to view the results.7.
Additional:
Diagnostics output
``````
--- check: autoidentify
INFO: diagnose_tensorboard.py version d515ab103e2b1cfcea2b096187741a0eeb8822ef
--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=7, micro=5, releaselevel='final', serial=0)
INFO: os.name: nt
INFO: os.uname(): N/A
INFO: sys.getwindowsversion(): sys.getwindowsversion(major=10, minor=0, build=18363, platform=2, service_pack='')
--- check: package_management
INFO: has conda-meta: True
INFO: $VIRTUAL_ENV: None
--- check: installed_packages
WARNING: Could not generate requirement for distribution -ensorflow-gpu 2.0.0 (c:\users\oscar\anaconda3\envs_________\lib\site-packages): Parse error at "'-ensorfl'": Expected W:(abcd...)
INFO: installed: tensorboard==2.0.2
INFO: installed: tensorflow-gpu==2.0.0
INFO: installed: tensorflow-estimator==2.0.1
--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.0.2'
--- check: tensorflow_python_version
2019-12-20 09:58:34.346839: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
INFO: tensorflow.__version__: '2.0.0'
INFO: tensorflow.__git_version__: 'v2.0.0-rc2-26-g64c3d382ca'
--- check: tensorboard_binary_path
INFO: Could not find files for the given pattern(s).
INFO: which tensorboard: None
--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC =
socket.SOCK_STREAM =
socket.AI_ADDRCONFIG =
socket.AI_PASSIVE =
Loopback flags:
Loopback infos: [(
Wildcard flags:
Wildcard infos: [(
--- check: readable_fqdn
INFO: socket.getfqdn(): 'Oscar-XPS-Laptop.lan'
--- check: stat_tensorboardinfo
INFO: directory: C:\Users\Oscar\AppData\Local\Temp.tensorboard-info
INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=9570149209514493, st_dev=2585985196, st_nlink=1, st_uid=0, st_gid=0, st_size=0, st_atime=1576835418, st_mtime=1576835418, st_ctime=1576764046)
INFO: mode: 0o40777
--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['C:\Users\Oscar\Anaconda3\envs\____________\lib\site-packages']; bad_roots (0): []
--- check: full_pip_freeze
WARNING: Could not generate requirement for distribution -ensorflow-gpu 2.0.0 (c:\users\oscar\anaconda3\envs________________\lib\site-packages): Parse error at "'-ensorfl'": Expected W:(abcd...)
INFO: pip freeze --all:
absl-py==0.8.1
astor==0.8.1
attrs==19.3.0
backcall==0.1.0
bleach==3.1.0
cachetools==3.1.1
certifi==2019.11.28
chardet==3.0.4
colorama==0.4.1
cycler==0.10.0
decorator==4.4.1
defusedxml==0.6.0
entrypoints==0.3
gast==0.2.2
google-auth==1.8.2
google-auth-oauthlib==0.4.1
google-pasta==0.1.8
grpcio==1.25.0
h5py==2.10.0
idna==2.8
importlib-metadata==1.2.0
ipykernel==5.1.3
ipython==7.10.1
ipython-genutils==0.2.0
ipywidgets==7.5.1
jedi==0.15.1
Jinja2==2.10.3
jsonschema==3.2.0
jupyter==1.0.0
jupyter-client==5.3.4
jupyter-console==5.2.0
jupyter-core==4.6.1
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
Markdown==3.1.1
MarkupSafe==1.1.1
matplotlib==3.1.2
mistune==0.8.4
more-itertools==7.2.0
nbconvert==5.6.1
nbformat==4.4.0
notebook==6.0.2
numpy==1.17.4
oauthlib==3.1.0
opt-einsum==3.1.0
pandas==0.25.3
pandocfilters==1.4.2
parso==0.5.1
pickleshare==0.7.5
pip==19.3.1
prometheus-client==0.7.1
prompt-toolkit==3.0.2
protobuf==3.11.1
pyasn1==0.4.8
pyasn1-modules==0.2.7
Pygments==2.5.2
pyparsing==2.4.5
pyrsistent==0.15.6
python-dateutil==2.8.1
pytz==2019.3
pywin32==223
pywinpty==0.5.5
pyzmq==18.1.0
qtconsole==4.6.0
requests==2.22.0
requests-oauthlib==1.3.0
rsa==4.0
Send2Trash==1.5.0
setuptools==42.0.2.post20191203
six==1.13.0
tensorboard==2.0.2
tensorflow-estimator==2.0.1
tensorflow-gpu==2.0.0
termcolor==1.1.0
terminado==0.8.3
testpath==0.4.4
tornado==6.0.3
traitlets==4.3.3
urllib3==1.25.7
wcwidth==0.1.7
webencodings==0.5.1
Werkzeug==0.16.0
wheel==0.33.6
widgetsnbextension==3.5.1
wincertstore==0.2
wrapt==1.11.2
zipp==0.6.0
``````
No action items identified. Please copy ALL of the above output,
including the lines containing only backticks, into your GitHub issue
or comment. Be sure to redact any sensitive information.
Assigning this to @nfelt who is actively looking into this. Please reassign or unassign as appropriate.
Quick update everyone - we think we've narrowed this down to a memory leak in tf.io.gfile.isdir() which we've reported in TensorFlow as https://github.com/tensorflow/tensorflow/issues/35292.
In terms of a fix, it appears that by pure coincidence a change landed in TensorFlow yesterday that replaces the leaking code, so in our testing we're seeing at least a much lower rate of memory leakage when running TensorBoard against today's tf-nightly==2.1.0.dev20191220.
If you're still seeing the issue, please try running TensorBoard in an environment with that version of TensorFlow (the actual version of TensorFlow you use for generating the log data should not affect this) and let us know if it seems to resolve the issue or not.
We will see what we can do to try to work around the issue so that we can get a fix to you sooner than the next TF release that would include yesterday's change (2.2) - if possible we'll see if we can fix this on the TB side so that those who can't easily update TF to the most recent version have access to a fix.
@nfelt hi this is good news, thanks. Curious though: are you planning for an independent tensorboard build with this issue fixed?
Hi all,
I'm running tensorboard without tensorflow, and I no longer experience the huge memory consumptio.
Tried to use the tf-nightly==2.1.0.dev20191220 version, but without success, the same problem still remains.
2.2.0a202001062.1.0-dev201912203.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]Distributor ID: Ubuntu
Description: Ubuntu 18.04.3 LTS
Release: 18.04
Codename: bionic
pip install tf-nightly==2.1.0.dev20191220tensorboard --port 9090 --bind_all --logdir ./RES of htop.I noticed that if I add a lot of files inside of the logdir folder, TensorBoard throws an exception:
TensorBoard 2.2.0a20200106 at http://anonymized:9090/ (Press CTRL+C to quit)
Exception in thread Reloader:
Traceback (most recent call last):
File "/env/lib/python3.7/threading.py", line 917, in _bootstrap_inner
File "/env/lib/python3.7/threading.py", line 865, in run
File "/env/lib/python3.7/site-packages/tensorboard/backend/application.py", line 660, in _reload
File "/env/lib/python3.7/site-packages/tensorboard/backend/event_processing/plugin_event_multiplexer.py", line 202, in AddRunsFromDirectory
File "/env/lib/python3.7/site-packages/tensorboard/backend/event_processing/io_wrapper.py", line 213, in <genexpr>
File "/env/lib/python3.7/site-packages/tensorboard/backend/event_processing/io_wrapper.py", line 164, in ListRecursivelyViaWalking
File "/env/lib/python3.7/site-packages/tensorflow_core/python/lib/io/file_io.py", line 676, in walk_v2
File "/env/lib/python3.7/site-packages/tensorflow_core/python/lib/io/file_io.py", line 606, in list_directory
File "/env/lib/python3.7/site-packages/tensorflow_core/python/lib/io/file_io.py", line 635, in list_directory_v2
tensorflow.python.framework.errors_impl.ResourceExhaustedError: ./; Too many open files
The memory issue happens without adding a lot of small files inside the logdir as well, but since there is this recursive process opening a lot of files, it might be one of the root causes of this quick memory growth that happens upon starting it (as related to tf.io.gfile.isdir()), but if the fix is really on tf-nightly==2.1.0.dev20191220, then it might be another leak hidden somewhere on these directory/file handling routines.
Just to add another comment, if I run:
pip uninstall tf-nightly
As suggested by @adizhol, TensorBoard works fine and takes only 310MB of resident memory, which really seems to solve the issue. So it seems that this is definitely caused by tensorflow code. It gives the warning:
TensorFlow installation not found - running with reduced feature set
Which seems to limit the available features on TensorBoard.
Just adding more info, I think I found the culprit.
If you just use (on tensorboard/compat/__init__.py):
from tensorboard.compat.tensorflow_stub import pywrap_tensorflow
To force it to use the pywrap_tensorflow from TensorBoard itself, the memory issue just disappears. However, if you let it import the tensorflow.python.pywrap_tensorflow, which seems to be a swig extension, the memory leak returns. That explains why removing TensorFlow solves the issue. It seems that one method on the TensorFlow's pywrap_tensorflow is leaking a lot of memory.
It changes the memory usage from 16GB to around ~500MB.
I have 16 GB RAM and also suffer from this memory leak problem. After executing tensorboard through command prompt (Windows 10), it shows:
Unable to get first event timestamp for run W11-LSTM64-FC16L0D0-Run_0
W11-LSTM64-FC16L0D0-Run_0 is the name I assigned for my architecture and I ran this approximately one month ago, roughly equivalent to 200 models ago. I did shut down my PC and restart all the process, so this is not because I keep the PC running.
There are lots of lines that shows the same "Unable to get first event timestamp".
After I moved all the old logs, the lines stopped showing and the problem with memory leak seems to disappear as well. I don't really know what happened but I guess @ismael-elatifi 's guess is correct.
One practical, easy way that I tried last night is that I right-clicked on the C drive and chose Properties. Then, I chose to clean up and selected Temporary files to be deleted from my computer.
By doing this, I've free up nearly 10 GB of my Ram, which was used by Tensorboard.
If you’re interested in testing an alpha version of TensorBoard that
loads much faster (~100× throughput) and should have fewer memory leaks,
keep reading…
Caveats:
--samples_per_plugin, are not yet implemented.To try this out:
Uninstall both tensorboard and tb-nightly, if you have them
installed.
Install nightly TensorBoard and the new data server:
pip install tb-nightly==2.5.0a20210121 tensorboard-data-server==0.2.0
Run TensorBoard as usual, but add the --load_fast argument:
tensorboard --logdir my/logs/ --bind_all --load_fast
On the dataset shared by @paulguerrero above in this thread, it takes
about 18.6 minutes to load without --load_fast, but only 8 seconds to
load with --load_fast.
You also do not need to have TensorFlow installed to use this.
Please provide feedback:
Thanks!
Most helpful comment
Happy 2nd birthday #766! 🎂🥳🎉
Look at how big you are now! You've eaten all the RAM we've given you like a good little bug, and you've grown stronger for it. Best wishes, and see you again in 2020.
Lots of love,
Tensorflow community