Models: Can't train detector model using Google Cloud

Created on 25 Nov 2017 · 16Comments · Source: tensorflow/models

I'm trying to train my own Detector model based on Tensorflow sample and this post. And I did succeed on training locally on a Macbook Pro. The problem is that I don't have a GPU and doing it on the CPU is too slow (about 25-40s per iteration).

This way, I'm trying to run on Google Cloud ML Engine following the tutorial, but I can't make it run properly.

My folder structures is described below:

+ data
 - train.record
 - test.record
+ models
 + train
 + eval
+ training
 - ssd_mobilenet_v1_coco

My steps to change from local training to Google Cloud training were:

Create a bucket in Google Cloud storage and copy my local folder structure with files;
Edit my pipeline.config file and change all paths from Users/dev/detector/ to gcc://bucketname/;
Create a YAML file with the default configuration provided in the tutorial;
Run

gcloud ml-engine jobs submit training object_detection_date +%s \ 
--job-dir=gs://bucketname/models/train \ 
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \ 
--module-name object_detection.train \ 
--region us-east1 \ 
--config /Users/dev/detector/training/cloud.yml \ 
-- \ 
--train_dir=gs://bucketname/models/train \ 
--pipeline_config_path=gs://bucketname/data/pipeline.config

Doing so, gives me the following error message from the MLUnits:

The replica ps 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 49, in from object_detection import trainer File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 27, in from object_detection.builders import preprocessor_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/preprocessor_builder.py", line 21, in from object_detection.protos import preprocessor_pb2 File "/root/.local/lib/python2.7/site-packages/object_detection/protos/preprocessor_pb2.py", line 71, in options=None, file=DESCRIPTOR), TypeError: __new__() got an unexpected keyword argument 'file'

Thanks in advance.

awaiting response builinstall

Source

lucasharada

Most helpful comment

Now that TF 1.4 is available on CloudML Engine, the following changes fix this problem:

Make sure your yaml version is 1.4, eg:

trainingInput:
  runtimeVersion: "1.4"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

Change setup.py to the following:

"""Setup script for object_detection."""

import logging
import subprocess
from setuptools import find_packages
from setuptools import setup
from setuptools.command.install import install

class CustomCommands(install):

    def RunCustomCommand(self, command_list):
        p = subprocess.Popen(
        command_list,
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT)
        stdout_data, _ = p.communicate()
        logging.info('Log command output: %s', stdout_data)
        if p.returncode != 0:
            raise RuntimeError('Command %s failed: exit code: %s' %
                         (command_list, p.returncode))

    def run(self):
        self.RunCustomCommand(['apt-get', 'update'])
        self.RunCustomCommand(
          ['apt-get', 'install', '-y', 'python-tk'])
        install.run(self)

REQUIRED_PACKAGES = ['Pillow>=1.0', 'protobuf>=3.3.0', 'Matplotlib>=2.1']

setup(
    name='object_detection',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    include_package_data=True,
    packages=[p for p in find_packages() if p.startswith('object_detection')],
    description='Tensorflow Object Detection Library',
 cmdclass={
        'install': CustomCommands,
    }
)

In object_detection/utils/visualization_utils.py, line 24 (before import matplotlib.pyplot as plt) add:

import matplotlib
matplotlib.use('agg')

In line 184 of object_detection/evaluator.py, change

tf.train.get_or_create_global_step()

tf.contrib.framework.get_or_create_global_step()

Finally, in line 103 of object_detection/builders/optimizer_builder.py, change

tf.train.get_or_create_global_step()

tf.contrib.framework.get_or_create_global_step()

Hope this helps!

andersskog on 12 Dec 2017

👍3

All 16 comments

Make sure you have run the following from the models/research/ directory before running setup.py

export PYTHONPATH=$PYTHONPATH:pwd:pwd/slim

law826 on 25 Nov 2017

@law826 I already did it. Unfortunatelly, getting the same error.

lucasharada on 26 Nov 2017

i heard someone fixed it by modifying setup.py. Maybe this will help

macro-dadt on 28 Nov 2017

Did @macro-dadt, suggestion help?

aselle on 29 Nov 2017

I'm also trying what @lucasharada is doing (training with my own dataset). I'm running fine locally on macbook pro (although very slow... approx 25 sec per step). I get the exact same error when trying to run Google Clound Engine.

I did try the link @macro-dadt suggested and did not have any luck (resulted in the same error).

Happy to provide more information if it helps diagnose what is happening.

jeffrwatts on 30 Nov 2017

I am getting this problem too!

cclough on 1 Dec 2017

👍2

Now that TF 1.4 is available on CloudML Engine, the following changes fix this problem:

Make sure your yaml version is 1.4, eg:

trainingInput:
  runtimeVersion: "1.4"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

Change setup.py to the following:

"""Setup script for object_detection."""

import logging
import subprocess
from setuptools import find_packages
from setuptools import setup
from setuptools.command.install import install

class CustomCommands(install):

    def RunCustomCommand(self, command_list):
        p = subprocess.Popen(
        command_list,
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT)
        stdout_data, _ = p.communicate()
        logging.info('Log command output: %s', stdout_data)
        if p.returncode != 0:
            raise RuntimeError('Command %s failed: exit code: %s' %
                         (command_list, p.returncode))

    def run(self):
        self.RunCustomCommand(['apt-get', 'update'])
        self.RunCustomCommand(
          ['apt-get', 'install', '-y', 'python-tk'])
        install.run(self)

REQUIRED_PACKAGES = ['Pillow>=1.0', 'protobuf>=3.3.0', 'Matplotlib>=2.1']

setup(
    name='object_detection',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    include_package_data=True,
    packages=[p for p in find_packages() if p.startswith('object_detection')],
    description='Tensorflow Object Detection Library',
 cmdclass={
        'install': CustomCommands,
    }
)

In object_detection/utils/visualization_utils.py, line 24 (before import matplotlib.pyplot as plt) add:

import matplotlib
matplotlib.use('agg')

In line 184 of object_detection/evaluator.py, change

tf.train.get_or_create_global_step()

tf.contrib.framework.get_or_create_global_step()

Finally, in line 103 of object_detection/builders/optimizer_builder.py, change

tf.train.get_or_create_global_step()

tf.contrib.framework.get_or_create_global_step()

Hope this helps!

andersskog on 12 Dec 2017

👍3

@andersskog
Your answer doesn't work in my case.

Janekxyz on 3 Jan 2018

@andersskog I have try your answer，but only run a few steps, it throws out of memory error, image resolution is not large, less than 600

puma007 on 7 Jan 2018

@lucasharada I have the same error, have you solved it?

puma007 on 7 Jan 2018

@aselle I have the same error, is there a solution now? Thanks！I have set the gcloud command line with runtime-version=1.4，and the yml file is also set runtimeVersion: "1.4"，but have the same error orker-replica-1 Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 49, in <module> from object_detection import trainer File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 27, in <module> from object_detection.builders import preprocessor_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/preprocessor_builder.py", line 21, in <module> from object_detection.protos import preprocessor_pb2 File "/root/.local/lib/python2.7/site-packages/object_detection/protos/preprocessor_pb2.py", line 71, in <module> options=None, file=DESCRIPTOR), TypeError: __new__() got an unexpected keyword argument 'file'

puma007 on 7 Jan 2018

@lucasharada got the same error, any ideas?

ttungl on 3 Feb 2018

l have trained my model successfully on my macbookpro, but l cannot do the same thing on google cloud, l tried all the methods mentioned above, but l cannot make it now.

mrainezty on 20 Jun 2018

👍1

https://github.com/tensorflow/models/pull/3490

can help you

zyxcambridge on 1 Jul 2018

maxwang7 added some commits on 28 Feb
@maxwang7
annotates / fixes tutorial instructions
199e254
@maxwang7
fixes tf_example_decoder.py
82857bd
@maxwang7
adds dependencies
5ffed73
@maxwang7
FOR DEMONSTRATION ONLY; NOT FOR PUSHING …
2d76dce
@maxwang7 maxwang7 requested review from derekjchow and jch1 as code owners on 28 Feb

这个人修改的四个类，靠谱

zyxcambridge on 1 Jul 2018

Closing since this is resolved. Feel free to reopen if the issue still persists. Thanks!