Keras: Numpy np.load(path) should be updated to np.load(path, allow_pickle) in keras/datasets/

Created on 24 Apr 2019 · 7Comments · Source: keras-team/keras

System information

Have I written custom code (as opposed to using example directory): No, using keras/datasets/imdb.py
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10 b1803
TensorFlow backend (yes / no): yes
TensorFlow version: 1.13.1
Keras version: 2.2.4 (using keras from tensorflow)
Python version: 3.7.3
CUDA/cuDNN version: 10.0
GPU model and memory: NVIDIA GTX 960M 2GB

Describe the current behavior
ValueError: Object arrays cannot be loaded when allow_pickle=False

Describe the expected behavior
np.load(path) should be updated to np.load(path, allow_pickle)

Code to reproduce the issue
Line 86 at keras/datasets/imdb.py,

with np.load(path, allow_pickle=True) as f:
    x_train, labels_train = f['x_train'], f['y_train']
    x_test, labels_test = f['x_test'], f['y_test']

Other info / logs
numpy\lib\npyio.py", line 262, in __getitem__ pickle_kwargs=self.pickle_kwargs)
numpy\lib\format.py", line 692, in read_array raise ValueError("Object arrays cannot be loaded when " ValueError: Object arrays cannot be loaded when allow_pickle=False

Source

bwalsh0

👍5

Most helpful comment

@ranahamzaintisar12
@andrea0713
@aloko001

Anywhere that you are calling a load function for an object array, you need to change the method signature from load(path) to load(path, allow_pickles=True). In andrea0713's example link, imdb.load_data() calls np.load(path), without specifying for allowing pickles. Find imdb.py, and where .load() is called, and update it's parameters.

Some context about why this is the case is available here: https://github.com/DistrictDataLabs/yellowbrick/issues/765

bwalsh0 on 26 Apr 2019

👍2

All 7 comments

This has been resolved, my local environment was behind a few hours.

bwalsh0 on 24 Apr 2019

Hi, how did you resolve this error?

ranahamzaintisar12 on 25 Apr 2019

I thought the official tutorials should be bug-free, until this popped out when I ran the example at https://www.tensorflow.org/tutorials/keras/basic_text_classification.

andrea0713 on 25 Apr 2019

@andrea0713 the line u sent still doesn't solve the issue. @bwalsh0 how did u solve the error?

jerryduncan on 26 Apr 2019

👍1

@ranahamzaintisar12
@andrea0713
@aloko001

Some context about why this is the case is available here: https://github.com/DistrictDataLabs/yellowbrick/issues/765

bwalsh0 on 26 Apr 2019

👍2

@bwalsh0 this sounds like you're altering the source code of keras itself correct?

boompig on 3 May 2019

@bwalsh0 dont downgrade or upgrade anything just use this to load dataset imdb:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

from keras.utils.data_utils import get_file
from keras.preprocessing.sequence import _remove_long_seq
import numpy as np
import json
import warnings

def load_data(path='imdb.npz', num_words=None, skip_top=0,
maxlen=None, seed=113,
start_char=1, oov_char=2, index_from=3, **kwargs):
"""Loads the IMDB dataset.
# Arguments
path: where to cache the data (relative to ~/.keras/dataset).
num_words: max number of words to include. Words are ranked
by how often they occur (in the training set) and only
the most frequent words are kept
skip_top: skip the top N most frequently occurring words
(which may not be informative).
maxlen: sequences longer than this will be filtered out.
seed: random seed for sample shuffling.
start_char: The start of a sequence will be marked with this character.
Set to 1 because 0 is usually the padding character.
oov_char: words that were cut out because of the num_words
or skip_top limit will be replaced with this character.
index_from: index actual words with this index and higher.
# Returns
Tuple of Numpy arrays: (x_train, y_train), (x_test, y_test).
# Raises
ValueError: in case maxlen is so low
that no input sequence could be kept.
Note that the 'out of vocabulary' character is only used for
words that were present in the training set but are not included
because they're not making the num_words cut here.
Words that were not seen in the training set but are in the test set
have simply been skipped.
"""
# Legacy support
if 'nb_words' in kwargs:
warnings.warn('The nb_words argument in load_data '
'has been renamed num_words.')
num_words = kwargs.pop('nb_words')
if kwargs:
raise TypeError('Unrecognized keyword arguments: ' + str(kwargs))

path = get_file(path,
                origin='https://s3.amazonaws.com/text-datasets/imdb.npz',
                file_hash='599dadb1135973df5b59232a0e9a887c')
with np.load(path, allow_pickle=True) as f:
    x_train, labels_train = f['x_train'], f['y_train']
    x_test, labels_test = f['x_test'], f['y_test']

rng = np.random.RandomState(seed)
indices = np.arange(len(x_train))
rng.shuffle(indices)
x_train = x_train[indices]
labels_train = labels_train[indices]

indices = np.arange(len(x_test))
rng.shuffle(indices)
x_test = x_test[indices]
labels_test = labels_test[indices]

xs = np.concatenate([x_train, x_test])
labels = np.concatenate([labels_train, labels_test])

if start_char is not None:
    xs = [[start_char] + [w + index_from for w in x] for x in xs]
elif index_from:
    xs = [[w + index_from for w in x] for x in xs]

if maxlen:
    xs, labels = _remove_long_seq(maxlen, xs, labels)
    if not xs:
        raise ValueError('After filtering for sequences shorter than maxlen=' +
                         str(maxlen) + ', no sequence was kept. '
                         'Increase maxlen.')
if not num_words:
    num_words = max([max(x) for x in xs])

# by convention, use 2 as OOV word
# reserve 'index_from' (=3 by default) characters:
# 0 (padding), 1 (start), 2 (OOV)
if oov_char is not None:
    xs = [[w if (skip_top <= w < num_words) else oov_char for w in x]
          for x in xs]
else:
    xs = [[w for w in x if skip_top <= w < num_words]
          for x in xs]

idx = len(x_train)
x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])
x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])

return (x_train, y_train), (x_test, y_test)

def get_word_index(path='imdb_word_index.json'):
"""Retrieves the dictionary mapping words to word indices.
# Arguments
path: where to cache the data (relative to ~/.keras/dataset).
# Returns
The word index dictionary.
"""
path = get_file(
path,
origin='https://s3.amazonaws.com/text-datasets/imdb_word_index.json',
file_hash='bfafd718b763782e994055a2d397834f')
with open(path) as f:
return json.load(f)

load the dataset but only keep the top n words, zero the rest

top_words = 10000
(X_train, y_train), (X_test, y_test) = load_data(path="imdb.npz",
num_words=top_words,
skip_top=0,
maxlen=None,
seed=113,
start_char=1,
oov_char=2,
index_from=3)

truncate and pad input sequences

max_review_length = 500
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)