I am trying to run this command:
python example/image-classification/train_cifar10.py
And getting the segfault. It is very consistent.
INFO:root:Epoch[0] Batch [20] Speed: 18.95 samples/sec accuracy=0.164807
INFO:root:Epoch[0] Batch [40] Speed: 20.34 samples/sec accuracy=0.246875
INFO:root:Epoch[0] Batch [60] Speed: 21.66 samples/sec accuracy=0.308984
INFO:root:Epoch[0] Batch [80] Speed: 22.15 samples/sec accuracy=0.328125
INFO:root:Epoch[0] Batch [100] Speed: 22.38 samples/sec accuracy=0.362500
INFO:root:Epoch[0] Batch [120] Speed: 21.52 samples/sec accuracy=0.378125
INFO:root:Epoch[0] Batch [140] Speed: 22.96 samples/sec accuracy=0.417969
INFO:root:Epoch[0] Batch [160] Speed: 20.04 samples/sec accuracy=0.426563
INFO:root:Epoch[0] Batch [180] Speed: 18.58 samples/sec accuracy=0.430078
INFO:root:Epoch[0] Batch [200] Speed: 21.39 samples/sec accuracy=0.443750
INFO:root:Epoch[0] Batch [220] Speed: 17.69 samples/sec accuracy=0.469531
INFO:root:Epoch[0] Batch [240] Speed: 18.59 samples/sec accuracy=0.469141
INFO:root:Epoch[0] Batch [260] Speed: 21.68 samples/sec accuracy=0.470313
INFO:root:Epoch[0] Batch [280] Speed: 21.05 samples/sec accuracy=0.487891
INFO:root:Epoch[0] Batch [300] Speed: 22.30 samples/sec accuracy=0.503125
INFO:root:Epoch[0] Batch [320] Speed: 20.98 samples/sec accuracy=0.534766
INFO:root:Epoch[0] Batch [340] Speed: 15.53 samples/sec accuracy=0.526172
INFO:root:Epoch[0] Batch [360] Speed: 13.88 samples/sec accuracy=0.528516
Traceback (most recent call last):
File "example/image-classification/train_cifar10.py", line 79, in <module>
fit.fit(args, sym, data.get_rec_iter)
File "/Users/vikumar/incubator-mxnet/example/image-classification/common/fit.py", line 333, in fit
monitor=monitor)
File "/Users/vikumar/incubator-mxnet/python/mxnet/module/base_module.py", line 563, in fit
next_data_batch = next(data_iter)
File "/Users/vikumar/incubator-mxnet/python/mxnet/io/io.py", line 228, in __next__
return self.next()
File "/Users/vikumar/incubator-mxnet/python/mxnet/io/io.py", line 840, in next
check_call(_LIB.MXDataIterNext(self.handle, ctypes.byref(next_res)))
File "/Users/vikumar/incubator-mxnet/python/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [09:58:31] src/io/recordio_split.cc:29: Check failed: (reinterpret_cast<size_t>(end) & 3UL) == 0U (1 vs. 0)
Stack trace returned 10 entries:
[bt] (0) 0 libmxnet.so 0x000000010ae2cf30 dmlc::StackTrace() + 272
[bt] (1) 1 libmxnet.so 0x000000010ae2ccdf dmlc::LogMessageFatal::~LogMessageFatal() + 47
[bt] (2) 2 libmxnet.so 0x000000010c68a01c dmlc::io::RecordIOSplitter::FindLastRecordBegin(char const*, char const*) + 444
[bt] (3) 3 libmxnet.so 0x000000010c68d10c dmlc::io::InputSplitBase::ReadChunk(void*, unsigned long*) + 284
[bt] (4) 4 libmxnet.so 0x000000010c68d230 dmlc::io::InputSplitBase::Chunk::Load(dmlc::io::InputSplitBase*, unsigned long) + 144
[bt] (5) 5 libmxnet.so 0x000000010c6a0b8f dmlc::ThreadedIter<dmlc::io::InputSplitBase::Chunk>::Init(std::__1::function<bool (dmlc::io::InputSplitBase::Chunk**)>, std::__1::function<void ()>)::'lambda'()::operator()() const + 895
[bt] (6) 6 libmxnet.so 0x000000010c6a071d void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, dmlc::ThreadedIter<dmlc::io::InputSplitBase::Chunk>::Init(std::__1::function<bool (dmlc::io::InputSplitBase::Chunk**)>, std::__1::function<void ()>)::'lambda'()> >(void*) + 45
[bt] (7) 7 libsystem_pthread.dylib 0x00007fff903c293b _pthread_body + 180
[bt] (8) 8 libsystem_pthread.dylib 0x00007fff903c2887 _pthread_body + 0
[bt] (9) 9 libsystem_pthread.dylib 0x00007fff903c208d thread_start + 13
build command - make -j8 USE_DIST_KVSTORE=1
os: mac
What to do:
1. o/p of diagnose script:
88e9fe53272d:incubator-mxnet vikumar$ python /tmp/lk.py
----------Python Info----------
Version : 3.6.5
Compiler : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
Build : ('default', 'Apr 26 2018 08:42:37')
Arch : ('64bit', '')
------------Pip Info-----------
Version : 10.0.1
Directory : /Users/vikumar/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
/Users/vikumar/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
objc[59758]: Class CaptureDelegate is implemented in both /usr/local/opt/opencv/lib/libopencv_videoio.3.4.dylib (0x10ada5938) and /Users/vikumar/anaconda3/lib/python3.6/site-packages/cv2/cv2.cpython-36m-darwin.so (0x1a17065ce0). One of the two will be used. Which one is undefined.
Version : 1.3.0
Directory : /Users/vikumar/incubator-mxnet/python/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform : Darwin-16.7.0-x86_64-i386-64bit
system : Darwin
node : 88e9fe53272d.ant.amazon.com
release : 16.7.0
version : Darwin Kernel Version 16.7.0: Thu Jun 21 20:07:39 PDT 2018; root:xnu-3789.73.14~1/RELEASE_X86_64
----------Hardware Info----------
machine : x86_64
processor : i386
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0115 sec, LOAD: 0.5897 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0183 sec, LOAD: 0.2578 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0227 sec, LOAD: 0.1714 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0186 sec, LOAD: 0.2626 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0141 sec, LOAD: 0.3084 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0148 sec, LOAD: 0.0534 sec.
I am trying to build and repro on linux machine.
Thanks for reporting this issue @Vikas89
@mxnet-label-bot [Example, Bug]
I think there was a fix made in relevant area of code: https://github.com/dmlc/dmlc-core/commit/e3377de933d3e069635150fb330ce6cc33b57950#diff-855ba648d1f4003608aa37ba3d060043
I will retry this on latest code and check if I can reproduce this again. I will keep this thread updated.
I am not seeing this issue now, closing this
I encountered the issue again, sorry for the confusion.
If I revert the changes made in this commit in dmlc_core submodule, the issue is gone
https://github.com/dmlc/dmlc-core/commit/e3377de933d3e069635150fb330ce6cc33b57950#diff-855ba648d1f4003608aa37ba3d060043
My changes in dmlc-core/src/io/input_split_base.cc which fixes the problem
+ if (nread != max_size) {
+ *size = nread;
+ return true;
+ } else {
+ const char *bptr = reinterpret_cast<const char*>(buf);
+ // return the last position where a record starts
+ const char *bend = this->FindLastRecordBegin(bptr, bptr + max_size);
+ *size = bend - bptr;
+ overflow_.resize(max_size - *size);
+ if (overflow_.length() != 0) {
+ std::memcpy(BeginPtr(overflow_), bend, overflow_.length());
+ }
+
+ return true;
+ }
+ /* COMMENTOUT
if (nread == olen) { // add extra newline to handle files with NOEOL
char *bufptr = reinterpret_cast<char*>(buf);
bufptr[nread] = '\n';
nread++;
+ LOG(INFO) << " Added extra line to the file";
}
const char *bptr = reinterpret_cast<const char*>(buf);
@@ -241,6 +258,7 @@ bool InputSplitBase::ReadChunk(void *buf, size_t *size) {
std::memcpy(BeginPtr(overflow_), bend, overflow_.length());
}
return true;
+ */
@Vikas89 I was not able to reproduce the bug with dmlc-core hash 0a0e8addf92e1287fd7a25c6314016b8c0138dee. On the other hand, I was able to reproduce it when the dmlc-core submodule was brought up to the latest master. Which commit hash for dmlc-core were you using when running the example?
@Vikas89 @piyushghai I created a minimal reproduction of the bug:
#include <dmlc/io.h>
#include <string>
#include <utility>
#include <vector>
int main(int argc, char** argv) {
std::unique_ptr<dmlc::InputSplit> source(
dmlc::InputSplit::Create("./cifar10_val.rec", 0, 1, "recordio"));
source->BeforeFirst();
dmlc::InputSplit::Blob rec;
size_t sum = 0;
while (source->NextRecord(&rec)) {
sum += rec.size;
}
return 0;
}
It shows that the bug came into existence since dmlc/dmlc-core#452 is merged. For a running example, see https://github.com/hcho3/mxnet-issue12800-repro.
Submitted dmlc/dmlc-core#471, along with a test case.
@hcho3 Is this still an issue or has been fixed after your PR?
This is fixed.