Incubator-mxnet: getting segfault while running train_cifar10.py program in example directory

Created on 11 Oct 2018 · 9Comments · Source: apache/incubator-mxnet

I am trying to run this command:
python example/image-classification/train_cifar10.py

And getting the segfault. It is very consistent.

INFO:root:Epoch[0] Batch [20]   Speed: 18.95 samples/sec    accuracy=0.164807
INFO:root:Epoch[0] Batch [40]   Speed: 20.34 samples/sec    accuracy=0.246875
INFO:root:Epoch[0] Batch [60]   Speed: 21.66 samples/sec    accuracy=0.308984
INFO:root:Epoch[0] Batch [80]   Speed: 22.15 samples/sec    accuracy=0.328125
INFO:root:Epoch[0] Batch [100]  Speed: 22.38 samples/sec    accuracy=0.362500
INFO:root:Epoch[0] Batch [120]  Speed: 21.52 samples/sec    accuracy=0.378125
INFO:root:Epoch[0] Batch [140]  Speed: 22.96 samples/sec    accuracy=0.417969
INFO:root:Epoch[0] Batch [160]  Speed: 20.04 samples/sec    accuracy=0.426563
INFO:root:Epoch[0] Batch [180]  Speed: 18.58 samples/sec    accuracy=0.430078
INFO:root:Epoch[0] Batch [200]  Speed: 21.39 samples/sec    accuracy=0.443750
INFO:root:Epoch[0] Batch [220]  Speed: 17.69 samples/sec    accuracy=0.469531
INFO:root:Epoch[0] Batch [240]  Speed: 18.59 samples/sec    accuracy=0.469141
INFO:root:Epoch[0] Batch [260]  Speed: 21.68 samples/sec    accuracy=0.470313
INFO:root:Epoch[0] Batch [280]  Speed: 21.05 samples/sec    accuracy=0.487891
INFO:root:Epoch[0] Batch [300]  Speed: 22.30 samples/sec    accuracy=0.503125
INFO:root:Epoch[0] Batch [320]  Speed: 20.98 samples/sec    accuracy=0.534766
INFO:root:Epoch[0] Batch [340]  Speed: 15.53 samples/sec    accuracy=0.526172
INFO:root:Epoch[0] Batch [360]  Speed: 13.88 samples/sec    accuracy=0.528516
Traceback (most recent call last):
  File "example/image-classification/train_cifar10.py", line 79, in <module>
    fit.fit(args, sym, data.get_rec_iter)
  File "/Users/vikumar/incubator-mxnet/example/image-classification/common/fit.py", line 333, in fit
    monitor=monitor)
  File "/Users/vikumar/incubator-mxnet/python/mxnet/module/base_module.py", line 563, in fit
    next_data_batch = next(data_iter)
  File "/Users/vikumar/incubator-mxnet/python/mxnet/io/io.py", line 228, in __next__
    return self.next()
  File "/Users/vikumar/incubator-mxnet/python/mxnet/io/io.py", line 840, in next
    check_call(_LIB.MXDataIterNext(self.handle, ctypes.byref(next_res)))
  File "/Users/vikumar/incubator-mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [09:58:31] src/io/recordio_split.cc:29: Check failed: (reinterpret_cast<size_t>(end) & 3UL) == 0U (1 vs. 0) 

Stack trace returned 10 entries:
[bt] (0) 0   libmxnet.so                         0x000000010ae2cf30 dmlc::StackTrace() + 272
[bt] (1) 1   libmxnet.so                         0x000000010ae2ccdf dmlc::LogMessageFatal::~LogMessageFatal() + 47
[bt] (2) 2   libmxnet.so                         0x000000010c68a01c dmlc::io::RecordIOSplitter::FindLastRecordBegin(char const*, char const*) + 444
[bt] (3) 3   libmxnet.so                         0x000000010c68d10c dmlc::io::InputSplitBase::ReadChunk(void*, unsigned long*) + 284
[bt] (4) 4   libmxnet.so                         0x000000010c68d230 dmlc::io::InputSplitBase::Chunk::Load(dmlc::io::InputSplitBase*, unsigned long) + 144
[bt] (5) 5   libmxnet.so                         0x000000010c6a0b8f dmlc::ThreadedIter<dmlc::io::InputSplitBase::Chunk>::Init(std::__1::function<bool (dmlc::io::InputSplitBase::Chunk**)>, std::__1::function<void ()>)::'lambda'()::operator()() const + 895
[bt] (6) 6   libmxnet.so                         0x000000010c6a071d void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, dmlc::ThreadedIter<dmlc::io::InputSplitBase::Chunk>::Init(std::__1::function<bool (dmlc::io::InputSplitBase::Chunk**)>, std::__1::function<void ()>)::'lambda'()> >(void*) + 45
[bt] (7) 7   libsystem_pthread.dylib             0x00007fff903c293b _pthread_body + 180
[bt] (8) 8   libsystem_pthread.dylib             0x00007fff903c2887 _pthread_body + 0
[bt] (9) 9   libsystem_pthread.dylib             0x00007fff903c208d thread_start + 13

Environment info

build command - make -j8 USE_DIST_KVSTORE=1
os: mac

What to do:
1. o/p of diagnose script:
88e9fe53272d:incubator-mxnet vikumar$ python /tmp/lk.py
----------Python Info----------
Version      : 3.6.5
Compiler     : GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)
Build        : ('default', 'Apr 26 2018 08:42:37')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 10.0.1
Directory    : /Users/vikumar/anaconda3/lib/python3.6/site-packages/pip
----------MXNet Info-----------
/Users/vikumar/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
objc[59758]: Class CaptureDelegate is implemented in both /usr/local/opt/opencv/lib/libopencv_videoio.3.4.dylib (0x10ada5938) and /Users/vikumar/anaconda3/lib/python3.6/site-packages/cv2/cv2.cpython-36m-darwin.so (0x1a17065ce0). One of the two will be used. Which one is undefined.
Version      : 1.3.0
Directory    : /Users/vikumar/incubator-mxnet/python/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Darwin-16.7.0-x86_64-i386-64bit
system       : Darwin
node         : 88e9fe53272d.ant.amazon.com
release      : 16.7.0
version      : Darwin Kernel Version 16.7.0: Thu Jun 21 20:07:39 PDT 2018; root:xnu-3789.73.14~1/RELEASE_X86_64
----------Hardware Info----------
machine      : x86_64
processor    : i386
b'machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI'
b'machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT'
b'machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C'
b'machdep.cpu.brand_string: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz'
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0115 sec, LOAD: 0.5897 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0183 sec, LOAD: 0.2578 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0227 sec, LOAD: 0.1714 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0186 sec, LOAD: 0.2626 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0141 sec, LOAD: 0.3084 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0148 sec, LOAD: 0.0534 sec.

I am trying to build and repro on linux machine.

Bug Example

Source

Vikas-kum

All 9 comments

Thanks for reporting this issue @Vikas89
@mxnet-label-bot [Example, Bug]

piyushghai on 11 Oct 2018

I think there was a fix made in relevant area of code: https://github.com/dmlc/dmlc-core/commit/e3377de933d3e069635150fb330ce6cc33b57950#diff-855ba648d1f4003608aa37ba3d060043

I will retry this on latest code and check if I can reproduce this again. I will keep this thread updated.

Vikas-kum on 11 Oct 2018

I am not seeing this issue now, closing this

Vikas-kum on 11 Oct 2018

👍1

I encountered the issue again, sorry for the confusion.

If I revert the changes made in this commit in dmlc_core submodule, the issue is gone
https://github.com/dmlc/dmlc-core/commit/e3377de933d3e069635150fb330ce6cc33b57950#diff-855ba648d1f4003608aa37ba3d060043

My changes in dmlc-core/src/io/input_split_base.cc which fixes the problem

+  if (nread != max_size) {
+    *size = nread;
+    return true;
+  } else {
+    const char *bptr = reinterpret_cast<const char*>(buf);
+    // return the last position where a record starts
+    const char *bend = this->FindLastRecordBegin(bptr, bptr + max_size);
+    *size = bend - bptr;
+    overflow_.resize(max_size - *size);
+    if (overflow_.length() != 0) {
+      std::memcpy(BeginPtr(overflow_), bend, overflow_.length());
+    }
+  
+    return true;
+  }
+    /* COMMENTOUT
   if (nread == olen) {  // add extra newline to handle files with NOEOL
     char *bufptr = reinterpret_cast<char*>(buf);
     bufptr[nread] = '\n';
     nread++;
+    LOG(INFO) << " Added extra line to the file";
   }

   const char *bptr = reinterpret_cast<const char*>(buf);
@@ -241,6 +258,7 @@ bool InputSplitBase::ReadChunk(void *buf, size_t *size) {
     std::memcpy(BeginPtr(overflow_), bend, overflow_.length());
   }
   return true;
+  */

Vikas-kum on 18 Oct 2018

👍1

@Vikas89 I was not able to reproduce the bug with dmlc-core hash 0a0e8addf92e1287fd7a25c6314016b8c0138dee. On the other hand, I was able to reproduce it when the dmlc-core submodule was brought up to the latest master. Which commit hash for dmlc-core were you using when running the example?

hcho3 on 19 Oct 2018

@Vikas89 @piyushghai I created a minimal reproduction of the bug:

#include <dmlc/io.h>
#include <string>
#include <utility>
#include <vector>

int main(int argc, char** argv) {
  std::unique_ptr<dmlc::InputSplit> source(
    dmlc::InputSplit::Create("./cifar10_val.rec", 0, 1, "recordio"));

  source->BeforeFirst();
  dmlc::InputSplit::Blob rec;
  size_t sum = 0;
  while (source->NextRecord(&rec)) {
    sum += rec.size;
  }

  return 0;
}

It shows that the bug came into existence since dmlc/dmlc-core#452 is merged. For a running example, see https://github.com/hcho3/mxnet-issue12800-repro.

hcho3 on 19 Oct 2018

Submitted dmlc/dmlc-core#471, along with a test case.

hcho3 on 19 Oct 2018

@hcho3 Is this still an issue or has been fixed after your PR?