Tfjs: Out of memory exception when fitting model

Created on 30 Jun 2018 · 11Comments · Source: tensorflow/tfjs

To get help from the community, check out our Google group.

TensorFlow.js version

5.6.0

Browser version

Version 67.0.3396.99 (Official Build) (64-bit)

Describe the problem or feature request

When fitting a fairly simple dense model (100-125-75 nodes per layer), we hit an out of Memory exception. Note: performance noticeably drops over time too.

I'm training multiple models, small dataset (~100 rows, 16 inputs per row). We get this exception on the second run, on epoch ~1000

The machine has 16 G of ram, and AMD RX-480 video card

If we continue the exception, progress continues, but still very slowly.

Code to reproduce the bug / link to feature request

I'm training for a lot of epochs
Here is the full training function.

async trainModel(model, Xdata, Ydata) {
    const xs = tf.tensor2d(Xdata);
    const ys = tf.tensor2d(Ydata);

    const yscaled = ys.mul(this.scaleFactor);
    await model.fit(xs, yscaled, {
        batchSize: 25,
        epochs: this.epochs,
        callbacks: {
            onEpochEnd: async (epoch, log) => {
                console.log(`Epoch ${epoch}: loss = ${log.loss}`);
            }
        }
    });

    // Do a quick test on the setting value
    let ypred = model.predict(xs);
    let ypredDescaled = ypred.div(this.scaleFactor);
    let pdata = ypredDescaled.dataSync();
    for (let i = 0; i < pdata.length; i++) {
        // TODO: evaluate differences
    }

    // As this is an async operation, manually dispose of allocated memory
    xs.dispose();
    ys.dispose();
    yscaled.dispose();
    ypred.dispose();
    ypredDescaled.dispose();
}

Source

FrozenKiwi

Most helpful comment

Ok, so not sure if this helps, but I tested my code running on nodejs/ubuntu as well. I had the similar sorts of results there, with my entire 12G of ram being eaten when trying to train a moderate sized graph (65K rows, 2000 epochs). I would start with around 500megs of memory, then grow till well over 2 gigs per process (running 4 simultaneous trainings).

NOTE: I'm running on the CPU for this test.

It appears that (for node anyway) the leak is on the JS side. Periodically printing process.memoryUsage() shows the heap and RSS would grow continuously while training, while external is fairly constant. I assume tensors refer to external memory, rather than anything on the JS heap.

Some things I tried:
Calling GC() in the epoch callback did not resolve the issue.
Running multiple iterations of model.fit, with a GC between iterations did not resolve the issue
Running multiple iterations of model.fit by saving the model, releasing it, calling GC(), calling tf.disposeVariables(), reloading the model, and continuing training, still did not fix the issue.
All of the above, plus calling tf.setBackend(tf.getBackend()) FINALLY FIXED IT!

I now do a full flush of the system every 250 epochs. All of this significantly improved the training performance and dropped my total memory usage to a nice low 200m - even lower than starting conditions.

FrozenKiwi on 20 Jul 2018

👍3

All 11 comments

I removed the logging function (and the predict post-train) and lowered the epochs to 2500. This improved things considerably - trained about 14 models before the exception threw. Again, continuing seemed to resume training (although I haven't yet verified the trained models).

FrozenKiwi on 30 Jun 2018

Having a similar issue with a super basic model. I'm running on a 2014 Dell XPS 15 though (i7-4712HQ, GT750M).

TensorFlow.js version
0.11.7

Browser Version
Version 67.0.3396.99 (Official Build) (64-bit)

Describe the problem
Performance degrades quickly and it soon completely locks up my chrome tab (even though it should be Async?), after which I have to kill it with the Chrome task manager. I'm not getting any 'this tab is slowing down your browser' alerts either.

const xs = tf.tensor2d([[0], [1]])
const ys = tf.tensor2d([[1], [0]])

const predictXs = tf.tensor2d([[0], [0.5], [1]])
const model = tf.sequential()
const hidden = tf.layers.dense({
    units: 4,
    inputShape: [1],
    activation: 'sigmoid'
})
model.add(hidden)

const output = tf.layers.dense({
    units: 1,
    activation: 'sigmoid'
})
model.add(output)

const sgdOpt = tf.train.sgd(0.1)
model.compile({
    optimizer: sgdOpt,
    loss: 'meanSquaredError'
})

const fitConfig = {
    epochs: 4000
}

model.fit(xs, ys, fitConfig).then((result)=>{
    console.log(result.history.loss[0])

    let outputs = model.predict(predictXs)
    outputs.print()
})

I know I'm doing a sigmoid activation on a linear regression. I'm just trying to figure out how this all works. It doesn't have to be sensible as long as I can get a grasp on what's going on.

nwesthoff on 1 Jul 2018

It would be great for tfjs to have some sort of basic memory check against the target user's system; I've seen issues on low-spec systems such as Microsoft Surface tablets where they technically have some VRAM, but the amount is so low that it should just use the CPU. From what I've read it's impossible for WebGL to reveal how much VRAM is available to it (security issue?), but it's a real pain in the ass, the only way around it is to identify the User Agent as a low-spec system.

guillochon on 1 Jul 2018

This looks like it might be a memory leak. Could you check how many tensors are created as your program runs using tf.memory(), this will return an object that has a numTensors property, if that is always increasing that would indicate a memory leak. @nwesthoff for you program you may want to add an onEpochEnd callback to print the number of tensors after each epoch.

We do have some work coming in future releases on preventing runaway memory leaks from bringing down the program (though it is still ideal to not leak memory).

cc @caisq, does model.fit call tf.nextFrame internally?

tafsiri on 2 Jul 2018

@tafsiri I'm having trouble with the onEpochEnd callback, I can make it fire by running fitConfig.onEpochEnd(), but it isn't fired by the training function.

const fitConfig = {
    epochs: 2000,
    shuffle: true,
    onEpochEnd: () => {
        console.log(tf.memory().numTensors + " tensors")
    }
}

model.fit(tf.tensor(xs), tf.tensor(ys), fitConfig).then(() => {
    // stuff happens
}

Running fitConfig.onEpochEnd() manually after 2000 epochs returns 48 tensors, though I'm not sure how useful this is, as it doesn't show progress during training.

nwesthoff on 2 Jul 2018

@nwesthoff the syntax for onEpochEnd looks more like this (it is part of a callbacks option, see model.fit for more details)

const fitConfig = {
    epochs: 2000,
    shuffle: true,
    callbacks: {
        onEpochEnd: () => {
            console.log(tf.memory().numTensors + " tensors")
        }
    }

}

model.fit(tf.tensor(xs), tf.tensor(ys), fitConfig).then(() => {
    // stuff happens
}

Try that and see if it works.

tafsiri on 3 Jul 2018

@tafsiri Ah, misunderstood the docs, that works indeed. When running 2000 epochs it goes up from 279 tensors all the way to 2278 tensors (curiously close to one tensor per epoch). It doesn't reset after running the fit function for new data either. So this does indeed look like a memory leak. I've just read about the tf.dispose and tf.tidy functions, though I'm not sure how to apply them, as they happen during the model fitting. Is this something I can fix using the layers API?

nwesthoff on 3 Jul 2018

Going to refer this to @caisq in case this because of a memory leak inside model.fit. We did jave one that was fixed here https://github.com/tensorflow/tfjs-layers/pull/252 so a fix may be incoming.

tafsiri on 3 Jul 2018

👍2

Any update on this issue? I'm also having issues with fix() leaking memory. after 10 epochs I run out of memory. Each epoch is very small too. 16 32x32 RGB images. I've 2gb of VRAM + 8gb shared.

Thanks

sam-g-steel on 18 Jul 2018

NOTE: I'm running on the CPU for this test.

FrozenKiwi on 20 Jul 2018

👍3

I got the same problem, specifically with tfjs-node backend, it turned up that the easiest solution that worked for me was to use jemalloc, in this Stack Overflow answer it's pretty well explained what you have to do.