Models: im2txt becomes 10x slower after a few thousand steps

Created on 9 Feb 2017  路  4Comments  路  Source: tensorflow/models

@cshallue I've used im2txt for several months on multiple data sets, and computationally it is wonderful in so many ways, while it seems invariably, at least for my OS, after a few thousand steps to suddenly get at least 10x slower. You can see this in this log.txt where between steps 1001394-1002828 the model is computing between 0.80-0.90 sec/step, and then suddenly like a step function a sort of regime change occurs where steps are always between 10-40 seconds...

I've not been able to figure this out, but what I have noticed is that it usually happens when I walk away from the OS with no other programs running (like some sort of Schrodinger's cat), because when I "jitter" the OS with random commands and programs it often locks back into good performance...

In terms of how this looks across the CPUs and such, what I observe is that in the time the model is computing efficiently, of 12 CPUs being worked with, all are averaging more or less around 30% consumption - the system is doing okay. But whenever the system is in its "slow mode" usually most of the 12 are completely idle and just one is actually doing work, and more sporadically.

I'm not sure, e.g., if this is some result of an additional set of threads from TensorFlow that get automatically launched for e.g. responding to the evictions stuff, and maybe this is causing some sort of thrashing on my operating system.

I'm working on a fresh Ubuntu 16.04 with a Titan X GPU and 12 CPUs, running an Nvidia Docker for TensorFlow.

Any thoughts are welcomed, I'm figuring others are probably encountering this more or less and it's worth trying to fix this if we're using this in production because 10x slowdown in the long-run is pretty rough...

Most helpful comment

Great news!

The problem has been solved. The issue was that the default batch size of examples to compute per step was too large. When I lowered the default from 32 examples to 16 examples, it's smooth sailing:
image 4
I'm not sure exactly what was bottlenecking it, as my images were only ~125 kB each, but very happy to have this resolved.

All 4 comments

@caisq Do you know of any docker issue that might cause this? The mouse jiggling aspect is amusing.

@girving @caisq @cshallue et al.

Following up here, I found that the back-and-forth bouncing into faster and slower training for im2txt is indeed discrete stepwise:
im2txt_training_efficiency
The above is also highly sensitive to choice of hyperparameters. When I lower the batch size, e.g., the slowdown factor is more like 3x versus 10-20x. I'm considering hacking some sort of a "jiggler" to the system, but not entirely sure where to start besides just restarting training whenever it slows down (the best idea so far, given that nice runway of 30,000 unadulterated steps you see above), or else maybe trying something more silly and sophisticated, like a script that gets fired up about while true: print("<3 Tensorflow") x12 processors for a little while until the global optimum (of OS harmony) is found...

Ongoing investigation, we see that GPU utilization is also "hit or miss" / "bursty" / throttled in im2txt's "slow mode":
image 3

Great news!

The problem has been solved. The issue was that the default batch size of examples to compute per step was too large. When I lowered the default from 32 examples to 16 examples, it's smooth sailing:
image 4
I'm not sure exactly what was bottlenecking it, as my images were only ~125 kB each, but very happy to have this resolved.

Was this page helpful?
0 / 5 - 0 ratings