Models: Can't train detector model using Google Cloud

Created on 25 Nov 2017  ·  16Comments  ·  Source: tensorflow/models

I'm trying to train my own Detector model based on Tensorflow sample and this post. And I did succeed on training locally on a Macbook Pro. The problem is that I don't have a GPU and doing it on the CPU is too slow (about 25-40s per iteration).

This way, I'm trying to run on Google Cloud ML Engine following the tutorial, but I can't make it run properly.

My folder structures is described below:

+ data
 - train.record
 - test.record
+ models
 + train
 + eval
+ training
 - ssd_mobilenet_v1_coco

My steps to change from local training to Google Cloud training were:

  1. Create a bucket in Google Cloud storage and copy my local folder structure with files;
  2. Edit my pipeline.config file and change all paths from Users/dev/detector/ to gcc://bucketname/;
  3. Create a YAML file with the default configuration provided in the tutorial;
  4. Run
gcloud ml-engine jobs submit training object_detection_date +%s \ 
--job-dir=gs://bucketname/models/train \ 
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \ 
--module-name object_detection.train \ 
--region us-east1 \ 
--config /Users/dev/detector/training/cloud.yml \ 
-- \ 
--train_dir=gs://bucketname/models/train \ 
--pipeline_config_path=gs://bucketname/data/pipeline.config

Doing so, gives me the following error message from the MLUnits:

The replica ps 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 49, in from object_detection import trainer File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 27, in from object_detection.builders import preprocessor_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/preprocessor_builder.py", line 21, in from object_detection.protos import preprocessor_pb2 File "/root/.local/lib/python2.7/site-packages/object_detection/protos/preprocessor_pb2.py", line 71, in options=None, file=DESCRIPTOR), TypeError: __new__() got an unexpected keyword argument 'file'

Thanks in advance.

awaiting response builinstall

Most helpful comment

Now that TF 1.4 is available on CloudML Engine, the following changes fix this problem:

Make sure your yaml version is 1.4, eg:

trainingInput:
  runtimeVersion: "1.4"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

Change setup.py to the following:

"""Setup script for object_detection."""

import logging
import subprocess
from setuptools import find_packages
from setuptools import setup
from setuptools.command.install import install

class CustomCommands(install):

    def RunCustomCommand(self, command_list):
        p = subprocess.Popen(
        command_list,
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT)
        stdout_data, _ = p.communicate()
        logging.info('Log command output: %s', stdout_data)
        if p.returncode != 0:
            raise RuntimeError('Command %s failed: exit code: %s' %
                         (command_list, p.returncode))

    def run(self):
        self.RunCustomCommand(['apt-get', 'update'])
        self.RunCustomCommand(
          ['apt-get', 'install', '-y', 'python-tk'])
        install.run(self)

REQUIRED_PACKAGES = ['Pillow>=1.0', 'protobuf>=3.3.0', 'Matplotlib>=2.1']

setup(
    name='object_detection',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    include_package_data=True,
    packages=[p for p in find_packages() if p.startswith('object_detection')],
    description='Tensorflow Object Detection Library',
 cmdclass={
        'install': CustomCommands,
    }
)

In object_detection/utils/visualization_utils.py, line 24 (before import matplotlib.pyplot as plt) add:

import matplotlib
matplotlib.use('agg')

In line 184 of object_detection/evaluator.py, change

tf.train.get_or_create_global_step()

to

tf.contrib.framework.get_or_create_global_step()

Finally, in line 103 of object_detection/builders/optimizer_builder.py, change

tf.train.get_or_create_global_step()

to

tf.contrib.framework.get_or_create_global_step()

Hope this helps!

All 16 comments

Make sure you have run the following from the models/research/ directory before running setup.py

export PYTHONPATH=$PYTHONPATH:pwd:pwd/slim

@law826 I already did it. Unfortunatelly, getting the same error.

i heard someone fixed it by modifying setup.py. Maybe this will help

Did @macro-dadt, suggestion help?

I'm also trying what @lucasharada is doing (training with my own dataset). I'm running fine locally on macbook pro (although very slow... approx 25 sec per step). I get the exact same error when trying to run Google Clound Engine.

I did try the link @macro-dadt suggested and did not have any luck (resulted in the same error).

Happy to provide more information if it helps diagnose what is happening.

I am getting this problem too!

Now that TF 1.4 is available on CloudML Engine, the following changes fix this problem:

Make sure your yaml version is 1.4, eg:

trainingInput:
  runtimeVersion: "1.4"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

Change setup.py to the following:

"""Setup script for object_detection."""

import logging
import subprocess
from setuptools import find_packages
from setuptools import setup
from setuptools.command.install import install

class CustomCommands(install):

    def RunCustomCommand(self, command_list):
        p = subprocess.Popen(
        command_list,
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT)
        stdout_data, _ = p.communicate()
        logging.info('Log command output: %s', stdout_data)
        if p.returncode != 0:
            raise RuntimeError('Command %s failed: exit code: %s' %
                         (command_list, p.returncode))

    def run(self):
        self.RunCustomCommand(['apt-get', 'update'])
        self.RunCustomCommand(
          ['apt-get', 'install', '-y', 'python-tk'])
        install.run(self)

REQUIRED_PACKAGES = ['Pillow>=1.0', 'protobuf>=3.3.0', 'Matplotlib>=2.1']

setup(
    name='object_detection',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    include_package_data=True,
    packages=[p for p in find_packages() if p.startswith('object_detection')],
    description='Tensorflow Object Detection Library',
 cmdclass={
        'install': CustomCommands,
    }
)

In object_detection/utils/visualization_utils.py, line 24 (before import matplotlib.pyplot as plt) add:

import matplotlib
matplotlib.use('agg')

In line 184 of object_detection/evaluator.py, change

tf.train.get_or_create_global_step()

to

tf.contrib.framework.get_or_create_global_step()

Finally, in line 103 of object_detection/builders/optimizer_builder.py, change

tf.train.get_or_create_global_step()

to

tf.contrib.framework.get_or_create_global_step()

Hope this helps!

@andersskog
Your answer doesn't work in my case.

@andersskog I have try your answer,but only run a few steps, it throws out of memory error, image resolution is not large, less than 600

@lucasharada I have the same error, have you solved it?

@aselle I have the same error, is there a solution now? Thanks!I have set the gcloud command line with runtime-version=1.4,and the yml file is also set runtimeVersion: "1.4",but have the same error orker-replica-1 Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 49, in <module> from object_detection import trainer File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 27, in <module> from object_detection.builders import preprocessor_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/preprocessor_builder.py", line 21, in <module> from object_detection.protos import preprocessor_pb2 File "/root/.local/lib/python2.7/site-packages/object_detection/protos/preprocessor_pb2.py", line 71, in <module> options=None, file=DESCRIPTOR), TypeError: __new__() got an unexpected keyword argument 'file'

@lucasharada got the same error, any ideas?

l have trained my model successfully on my macbookpro, but l cannot do the same thing on google cloud, l tried all the methods mentioned above, but l cannot make it now.

maxwang7 added some commits on 28 Feb
@maxwang7
annotates / fixes tutorial instructions
199e254
@maxwang7
fixes tf_example_decoder.py
82857bd
@maxwang7
adds dependencies
5ffed73
@maxwang7
FOR DEMONSTRATION ONLY; NOT FOR PUSHING …
2d76dce
@maxwang7 maxwang7 requested review from derekjchow and jch1 as code owners on 28 Feb

这个人 修改的四个类 ,靠谱

Closing since this is resolved. Feel free to reopen if the issue still persists. Thanks!

Was this page helpful?
0 / 5 - 0 ratings