Environment:
(tensorflow_p36) ubuntu@ip-172-31-38-183:~$ mpirun --version
mpirun (Open MPI) 4.0.1
CUDA version:
CUDA Version 10.0.130
NCCL version:
Not sure
So I'm running Deep Learning AMI (Ubuntu) Version 24.2 (ami-02c253ecf7eaba73e) on AWS and using source activate tensorflow_p36 which gives tensorflow and keras 2.2 I then updated Horovod to the latest version because of the this problem, it was still the same before.
Initially I was just trying to Horovod locally and I'm getting this:
(tensorflow_p36) ubuntu@ip-172-31-38-183:~$ rm -rf data && horovodrun --verbose -np 1 -H localhost:1 python train.py
Filtering local host names.
Remote host found:
All hosts are local, finding the interfaces with address 127.0.0.1
Local interface found lo
Checking whether extension tensorflow was built with MPI.
WARNING: Logging before flag parsing goes to stderr.
W0924 16:36:28.773288 140112561821440 deprecation_wrapper.py:119] From /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
W0924 16:36:28.773586 140112561821440 deprecation_wrapper.py:119] From /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
Extension tensorflow was built with MPI.
mpirun --allow-run-as-root --tag-output -np 1 -H localhost:1 -bind-to none -map-by slot -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_include lo -x NCCL_SOCKET_IFNAME=lo -x CUDA_PATH -x XDG_SESSION_ID -x JAVA_LD_LIBRARY_PATH -x HOROVOD_CUDA_HOME -x TERM -x SHELL -x SSH_CLIENT -x CONDA_SHLVL -x CONDA_PROMPT_MODIFIER -x SSH_TTY -x USER -x HOROVOD_NCCL_HOME -x LS_COLORS -x LD_LIBRARY_PATH -x LD_LIBRARY_PATH_WITH_DEFAULT_CUDA -x CONDA_EXE -x PYTHON_VERSION -x MAIL -x PATH -x HOROVOD_GPU_ALLREDUCE -x CONDA_PREFIX -x ENV_NAME -x PWD -x JAVA_HOME -x LANG -x LD_LIBRARY_PATH_WITHOUT_CUDA -x TF_CPP_MIN_LOG_LEVEL -x SHLVL -x HOME -x KERAS_BACKEND -x CONDA_PYTHON_EXE -x PYTHONPATH -x LOGNAME -x XDG_DATA_DIRS -x SSH_CONNECTION -x JAVA_HOME_CONDA_BACKUP -x CONDA_DEFAULT_ENV -x LESSOPEN -x PKG_CONFIG_PATH -x XDG_RUNTIME_DIR -x JAVA_LD_LIBRARY_PATH_BACKUP -x LESSCLOSE -x _ -x HOROVOD_STALL_CHECK_TIME_SECONDS -x HOROVOD_STALL_SHUTDOWN_TIME_SECONDS -x HOROVOD_NUM_NCCL_STREAMS -x HOROVOD_MLSL_BGT_AFFINITY python train.py
[1,0]<stderr>:Using TensorFlow backend.
[1,0]<stderr>:WARNING: Logging before flag parsing goes to stderr.
[1,0]<stderr>:W0924 16:36:31.064738 140530409674496 deprecation_wrapper.py:119] From /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/horovod/tensorflow/__init__.py:117: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.
[1,0]<stderr>:
[1,0]<stderr>:W0924 16:36:31.065012 140530409674496 deprecation_wrapper.py:119] From /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/horovod/tensorflow/__init__.py:143: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.
[1,0]<stderr>:
[1,0]<stderr>:W0924 16:36:31.361919 140530409674496 deprecation_wrapper.py:119] From train.py:99: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
[1,0]<stderr>:
[1,0]<stdout>:Extracting all the files now...
[1,0]<stdout>:Done!
[1,0]<stdout>:Found 14894 images belonging to 6 classes.
[1,0]<stdout>:Found 3662 images belonging to 6 classes.
[1,0]<stderr>:W0924 16:36:45.087139 140530409674496 deprecation.py:506] From /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/ops/init_ops.py:1251: calling VarianceScaling.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version.
[1,0]<stderr>:Instructions for updating:
[1,0]<stderr>:Call initializer instance with the dtype argument instead of passing it to the constructor
[1,0]<stdout>:Model: "sequential"
[1,0]<stdout>:_________________________________________________________________
[1,0]<stdout>:Layer (type) Output Shape Param #
[1,0]<stdout>:=================================================================
[1,0]<stdout>:conv2d (Conv2D) (None, 453, 255, 16) 448
[1,0]<stdout>:_________________________________________________________________
[1,0]<stdout>:max_pooling2d (MaxPooling2D) (None, 226, 127, 16) 0
[1,0]<stdout>:_________________________________________________________________
[1,0]<stdout>:conv2d_1 (Conv2D) (None, 226, 127, 32) 4640
[1,0]<stdout>:_________________________________________________________________
[1,0]<stdout>:max_pooling2d_1 (MaxPooling2 (None, 113, 63, 32) 0
[1,0]<stdout>:_________________________________________________________________
[1,0]<stdout>:conv2d_2 (Conv2D) (None, 113, 63, 64) 18496
[1,0]<stdout>:_________________________________________________________________
[1,0]<stdout>:max_pooling2d_2 (MaxPooling2 (None, 56, 31, 64) 0
[1,0]<stdout>:_________________________________________________________________
[1,0]<stdout>:flatten (Flatten) (None, 111104) 0
[1,0]<stdout>:_________________________________________________________________
[1,0]<stdout>:dense (Dense) (None, 512) 56885760
[1,0]<stdout>:_________________________________________________________________
[1,0]<stdout>:dense_1 (Dense) (None, 6) 3078
[1,0]<stdout>:=================================================================
[1,0]<stdout>:Total params: 56,912,422
[1,0]<stdout>:Trainable params: 56,912,422
[1,0]<stdout>:Non-trainable params: 0
[1,0]<stdout>:_________________________________________________________________
[1,0]<stderr>:W0924 16:36:45.344241 140530409674496 deprecation_wrapper.py:119] From /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.
[1,0]<stderr>:
[1,0]<stderr>:W0924 16:36:45.346050 140530409674496 deprecation_wrapper.py:119] From /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:186: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
[1,0]<stderr>:
[1,0]<stdout>:Epoch 1/100
[1,0]<stderr>:Traceback (most recent call last):
[1,0]<stderr>: File "train.py", line 283, in <module>
[1,0]<stderr>: validation_steps=round(3830 // batch_size)
[1,0]<stderr>: File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1433, in fit_generator
[1,0]<stderr>: steps_name='steps_per_epoch')
[1,0]<stderr>: File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_generator.py", line 260, in model_iteration
[1,0]<stderr>: callbacks._call_batch_hook(mode, 'begin', step, batch_logs)
[1,0]<stderr>: File "/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/tensorflow/python/keras/callbacks.py", line 247, in _call_batch_hook
[1,0]<stderr>: batch_hook = getattr(callback, hook_name)
[1,0]<stderr>:AttributeError: 'BroadcastGlobalVariablesCallback' object has no attribute 'on_train_batch_begin'
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[60231,1],0]
Exit code: 1
--------------------------------------------------------------------------
(tensorflow_p36) ubuntu@ip-172-31-38-183:~$
My code looks like this - I've included the full code - although verbose I thought it best:
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
import horovod.keras as hvd
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, Dropout, MaxPooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from zipfile import ZipFile
import glob
from PIL import Image
import boto3
import os
import numpy as np
import matplotlib.pyplot as plt
import shutil
import numpy as np
hvd.init()
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.gpu_options.visible_device_list = str(hvd.local_rank())
ACCESS_KEY='...'
SECRET_KEY='.....'
BUCKET_NAME = 'product-ml'
KEY = 'data.zip'
# Download data
s3 = boto3.resource('s3', region_name='eu-west-1', aws_access_key_id=ACCESS_KEY,
aws_secret_access_key= SECRET_KEY)
try:
s3.Bucket(BUCKET_NAME).download_file('data.zip', 'data.zip')
except botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] == "404":
print("The object does not exist.")
else:
raise
with ZipFile('data.zip', 'r') as zip:
# printing all the contents of the zip file
# extracting all the files
print('Extracting all the files now...')
zip.extractall()
print('Done!')
# Extract data
dirs = os.listdir('./data')
if '.DS_Store' in dirs:
dirs.remove('.DS_Store')
for d in dirs:
files = os.listdir('./data/' + d)
if not os.path.exists('./data/' + 'validation/'):
os.mkdir('./data/' + 'validation/')
if not os.path.exists('./data/' + 'train/'):
os.mkdir('./data/' + 'train/')
if not os.path.exists('./data/validation/' + d):
os.mkdir('./data/validation/' + d)
if not os.path.exists('./data/train/' + d):
os.mkdir('./data/train/' + d)
for file in files:
if np.random.rand(1) < 0.2:
shutil.move( './data/'+ d + '/' + file, './data/' + 'validation' + '/' + d + '/' + file)
else:
shutil.move( './data/'+ d + '/' + file, './data/' + 'train' + '/' + d + '/' + file)
PATH = os.path.join(os.path.dirname('./'), 'data')
train_dir = os.path.join(PATH, 'train')
validation_dir = os.path.join(PATH, 'validation')
batch_size = 60
epochs = 100
IMG_HEIGHT = 453
IMG_WIDTH = 255
train_image_generator = ImageDataGenerator(rescale=1./255) # Generator for our training data
validation_image_generator = ImageDataGenerator(rescale=1./255) # Generator for our validation data
train_data_gen = train_image_generator.flow_from_directory(batch_size=batch_size,
directory=train_dir,
shuffle=True,
target_size=(IMG_HEIGHT, IMG_WIDTH))
val_data_gen = validation_image_generator.flow_from_directory(batch_size=batch_size,
directory=validation_dir,
target_size=(IMG_HEIGHT, IMG_WIDTH))
model = Sequential([
Conv2D(16, 3, padding='same', activation='relu', input_shape=(IMG_HEIGHT, IMG_WIDTH ,3)),
MaxPooling2D(),
Conv2D(32, 3, padding='same', activation='relu'),
MaxPooling2D(),
Conv2D(64, 3, padding='same', activation='relu'),
MaxPooling2D(),
Flatten(),
Dense(512, activation='relu'),
Dense(6, activation='softmax')
])
# Adjust learning rate based on number of GPUs (naive approach).
opt = tf.keras.optimizers.Adadelta(1.0 * hvd.size())
# Add Horovod Distributed Optimizer.
opt = hvd.DistributedOptimizer(opt)
model.compile(optimizer=opt,
loss=tf.keras.losses.categorical_crossentropy,
metrics=['accuracy'])
model.summary()
filepath="weights-improvement-{epoch:02d}-{val_acc:.2f}.hdf5"
callbacks_list = []
if hvd.rank() == 0:
modelCheckPointCallBack = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list.append(modelCheckPointCallBack)
callbacks_list.append(hvd.callbacks.BroadcastGlobalVariablesCallback(0))
callbacks_list.append(hvd.callbacks.MetricAverageCallback())
callbacks_list.append(hvd.callbacks.LearningRateWarmupCallback(warmup_epochs=5, verbose=1))
callbacks_list.append(tf.keras.callbacks.ReduceLROnPlateau(patience=10, verbose=1))
history = model.fit_generator(
train_data_gen,
steps_per_epoch=round(14726 // batch_size),
epochs=epochs,
validation_data=val_data_gen,
callbacks=callbacks_list,
validation_steps=round(3830 // batch_size)
)
I've checked every example I can find with the model.fit_generator and even the one on here seems to do the same thing, could anyone give me a pointer to where I'm going wrong please?
Yep got it, ignore this, I was being an idiot and mixing tf.keras and keras together - I'll post an answer shortly sorry guys. It's funny how typing out a problem in detail generally solves it!
So yeah if you see above where I imported tensflow.keras models, callbacks I just needed to remove the tensorflow bit since HV is expecting keras not tf.keras! Silly me
Most helpful comment
So yeah if you see above where I imported tensflow.keras models, callbacks I just needed to remove the tensorflow bit since HV is expecting keras not tf.keras! Silly me