INFO:root:Epoch[38] Batch [20] Speed: 18.34 samples/sec Train-accuracy=0.544643
INFO:root:Epoch[38] Batch [40] Speed: 18.33 samples/sec Train-accuracy=0.534375
INFO:root:Epoch[38] Batch [60] Speed: 18.32 samples/sec Train-accuracy=0.512500
INFO:root:Epoch[38] Batch [80] Speed: 18.31 samples/sec Train-accuracy=0.481250
INFO:root:Epoch[38] Train-accuracy=0.526786
INFO:root:Epoch[38] Time cost=76.241
INFO:root:Saved checkpoint to "./models/resnet-152/resnet-152-train-224-0039.params"
INFO:root:Epoch[38] Validation-accuracy=0.462500
INFO:root:Epoch[39] Batch [20] Speed: 18.32 samples/sec Train-accuracy=0.559524
INFO:root:Epoch[39] Batch [40] Speed: 18.35 samples/sec Train-accuracy=0.562500
INFO:root:Epoch[39] Batch [60] Speed: 18.33 samples/sec Train-accuracy=0.493750
INFO:root:Epoch[39] Batch [80] Speed: 18.32 samples/sec Train-accuracy=0.503125
INFO:root:Epoch[39] Train-accuracy=0.479167
INFO:root:Epoch[39] Time cost=75.345
[16:24:41] include/dmlc/logging.h:303: [16:24:41] src/io/local_filesys.cc:39: Check failed: std::fwrite(ptr, 1, size, fp_) == size FileStream.Write incomplete
Stack trace returned 10 entries:
[bt] (0) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN4dmlc2io10FileStream5WriteEPKvm+0x305) [0x7fce21576385]
[bt] (1) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZNK5mxnet7NDArray4SaveEPN4dmlc6StreamE+0x7a2) [0x7fce21281a92]
[bt] (2) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(_ZN5mxnet7NDArray4SaveEPN4dmlc6StreamERKSt6vectorIS0_SaIS0_EERKS4_ISsSaISsEE+0xb7) [0x7fce21281ec7]
[bt] (3) /home/yuanshuai/code/mxnet/python/mxnet/../../lib/libmxnet.so(MXNDArraySave+0x595) [0x7fce21168945]
[bt] (4) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call_unix64+0x4c) [0x7fce67532adc]
[bt] (5) /usr/lib/x86_64-linux-gnu/libffi.so.6(ffi_call+0x1fc) [0x7fce6753240c]
[bt] (6) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(_ctypes_callproc+0x48e) [0x7fce677495fe]
[bt] (7) /usr/lib/python2.7/lib-dynload/_ctypes.x86_64-linux-gnu.so(+0x15f9e) [0x7fce6774af9e]
[bt] (8) python(PyEval_EvalFrameEx+0x98d) [0x5244dd]
[bt] (9) python(PyEval_EvalCodeEx+0x2b1) [0x555551]
Traceback (most recent call last):
File "train_ccs-train.py", line 57, in <module>
fit.fit(args, sym, data.get_rec_iter)
File "/home/yuanshuai/code/mxnet/example/image-classification/common/fit.py", line 187, in fit
monitor = monitor)
File "/home/yuanshuai/code/mxnet/python/mxnet/module/base_module.py", line 506, in fit
callback(epoch, self.symbol, arg_params, aux_params)
File "/home/yuanshuai/code/mxnet/python/mxnet/callback.py", line 58, in _callback
save_checkpoint(prefix, iter_no + 1, sym, arg, aux)
File "/home/yuanshuai/code/mxnet/python/mxnet/model.py", line 345, in save_checkpoint
nd.save(param_name, save_dict)
File "/home/yuanshuai/code/mxnet/python/mxnet/ndarray.py", line 2102, in save
keys))
File "/home/yuanshuai/code/mxnet/python/mxnet/base.py"
I ran into a similar problem. Check if you are running out of space on the disk. The checkpoint models are pretty large and if you are frequently check-pointing, you can easily run out of space. A single resnet-50 model can be as large as 96mb.
@jk314 running out of space on the disk. Thanks!!! :+1:
Most helpful comment
@jk314 running out of space on the disk. Thanks!!! :+1: