I'm trying to train my own Detector model based on Tensorflow sample and this post. And I did succeed on training locally on a Macbook Pro. The problem is that I don't have a GPU and doing it on the CPU is too slow (about 25-40s per iteration).
This way, I'm trying to run on Google Cloud ML Engine following the tutorial, but I can't make it run properly.
My folder structures is described below:
+ data
 - train.record
 - test.record
+ models
 + train
 + eval
+ training
 - ssd_mobilenet_v1_coco
My steps to change from local training to Google Cloud training were:
gcloud ml-engine jobs submit training object_detection_date +%s \ 
--job-dir=gs://bucketname/models/train \ 
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \ 
--module-name object_detection.train \ 
--region us-east1 \ 
--config /Users/dev/detector/training/cloud.yml \ 
-- \ 
--train_dir=gs://bucketname/models/train \ 
--pipeline_config_path=gs://bucketname/data/pipeline.config
Doing so, gives me the following error message from the MLUnits:
The replica ps 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 49, in from object_detection import trainer File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 27, in from object_detection.builders import preprocessor_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/preprocessor_builder.py", line 21, in from object_detection.protos import preprocessor_pb2 File "/root/.local/lib/python2.7/site-packages/object_detection/protos/preprocessor_pb2.py", line 71, in options=None, file=DESCRIPTOR), TypeError: __new__() got an unexpected keyword argument 'file'
Thanks in advance.
Make sure you have run the following from the models/research/ directory before running setup.py
export PYTHONPATH=$PYTHONPATH:pwd:pwd/slim
@law826 I already did it. Unfortunatelly, getting the same error.
i heard someone fixed it by modifying setup.py. Maybe this will help
Did @macro-dadt, suggestion help?
I'm also trying what @lucasharada is doing (training with my own dataset). I'm running fine locally on macbook pro (although very slow... approx 25 sec per step). I get the exact same error when trying to run Google Clound Engine.
I did try the link @macro-dadt suggested and did not have any luck (resulted in the same error).
Happy to provide more information if it helps diagnose what is happening.
I am getting this problem too!
Now that TF 1.4 is available on CloudML Engine, the following changes fix this problem:
Make sure your yaml version is 1.4, eg:
trainingInput:
  runtimeVersion: "1.4"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard
Change setup.py to the following:
"""Setup script for object_detection."""
import logging
import subprocess
from setuptools import find_packages
from setuptools import setup
from setuptools.command.install import install
class CustomCommands(install):
    def RunCustomCommand(self, command_list):
        p = subprocess.Popen(
        command_list,
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT)
        stdout_data, _ = p.communicate()
        logging.info('Log command output: %s', stdout_data)
        if p.returncode != 0:
            raise RuntimeError('Command %s failed: exit code: %s' %
                         (command_list, p.returncode))
    def run(self):
        self.RunCustomCommand(['apt-get', 'update'])
        self.RunCustomCommand(
          ['apt-get', 'install', '-y', 'python-tk'])
        install.run(self)
REQUIRED_PACKAGES = ['Pillow>=1.0', 'protobuf>=3.3.0', 'Matplotlib>=2.1']
setup(
    name='object_detection',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    include_package_data=True,
    packages=[p for p in find_packages() if p.startswith('object_detection')],
    description='Tensorflow Object Detection Library',
 cmdclass={
        'install': CustomCommands,
    }
)
In object_detection/utils/visualization_utils.py, line 24 (before import matplotlib.pyplot as plt) add:
import matplotlib
matplotlib.use('agg')
In line 184 of object_detection/evaluator.py, change
tf.train.get_or_create_global_step()
to
tf.contrib.framework.get_or_create_global_step()
Finally, in line 103 of object_detection/builders/optimizer_builder.py, change
tf.train.get_or_create_global_step()
to
tf.contrib.framework.get_or_create_global_step()
Hope this helps!
@andersskog
Your answer doesn't work in my case.
@andersskog I have try your answer,but only run a few steps, it throws out of memory error, image resolution is not large, less than 600
@lucasharada I have the same error, have you solved it?
@aselle I have the same error, is there a solution now? Thanks!I have set the gcloud command line with runtime-version=1.4,and the yml file is also set runtimeVersion: "1.4",but have the same error orker-replica-1
Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 49, in <module> from object_detection import trainer File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 27, in <module> from object_detection.builders import preprocessor_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/preprocessor_builder.py", line 21, in <module> from object_detection.protos import preprocessor_pb2 File "/root/.local/lib/python2.7/site-packages/object_detection/protos/preprocessor_pb2.py", line 71, in <module> options=None, file=DESCRIPTOR), TypeError: __new__() got an unexpected keyword argument 'file'
@lucasharada got the same error, any ideas?
l have trained my model successfully on my macbookpro, but l cannot do the same thing on google cloud, l tried all the methods mentioned above, but l cannot make it now.
https://github.com/tensorflow/models/pull/3490
can help you
maxwang7 added some commits on 28 Feb
 @maxwang7
annotates / fixes tutorial instructions
199e254
 @maxwang7
fixes tf_example_decoder.py
82857bd
 @maxwang7
adds dependencies
5ffed73
 @maxwang7
FOR DEMONSTRATION ONLY; NOT FOR PUSHING  …
2d76dce
 @maxwang7 maxwang7 requested review from derekjchow and jch1 as code owners on 28 Feb
这个人 修改的四个类 ,靠谱
Closing since this is resolved. Feel free to reopen if the issue still persists. Thanks!
Most helpful comment
Now that TF 1.4 is available on CloudML Engine, the following changes fix this problem:
Make sure your yaml version is 1.4, eg:
Change setup.py to the following:
In object_detection/utils/visualization_utils.py, line 24 (before import matplotlib.pyplot as plt) add:
In line 184 of object_detection/evaluator.py, change
to
Finally, in line 103 of object_detection/builders/optimizer_builder.py, change
to
Hope this helps!