master.Operating System:
Linux version 3.10.0-862.9.1.el7.x86_64
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"
GPU: Tesla V100
CUDA 9.2
Python 3.5.5
Arrow: N/A
pandas==0.20.3
numpy==1.14.5
Converting a dataframe created from a numpy array results in: ValueError: Array contains non-contiguous buffer and cannot be transferred as a single memory region. Please ensure contiguous buffer with numpy .ascontiguousarray().
The underlying numpy array appears to be a C contiguous array, and recasting with np.ascontiguousarray as suggested in the traceback doesn't resolve the error as the array already is in C contiguous order. When loading the same data from disk (written with the .to_csv method) directly into a dataframe, the issue doesn't seem to come up. In this case, the underlying array appears to be a FORTRAN contiguous array when read from disk using pd.read_csv. However, converting the array created in memory to FORTRAN contiguous order doesn't seem to solve it either, as it results in a different error: TypeError: __getitem__ on type 0 is not supported.
The following code snippet results in the ValueError listed above:
import pandas as pd
import numpy as np
import pygdf
arr1 = np.random.sample([5000, 10])
df = pd.DataFrame(arr1)
gdf = pygdf.DataFrame.from_pandas(df)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-36-a66f55e3e62f> in <module>()
7 df = pd.DataFrame(arr1)
8
----> 9 gdf = pygdf.DataFrame.from_pandas(df)
/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/dataframe.py in from_pandas(cls, dataframe)
1045 # Set columns
1046 for colk in dataframe.columns:
-> 1047 df[colk] = dataframe[colk].values
1048 # Set index
1049 return df.set_index(dataframe.index.values)
/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/dataframe.py in __setitem__(self, name, col)
199 self._cols[name] = self._prepare_series_for_add(col)
200 else:
--> 201 self.add_column(name, col)
202
203 def __delitem__(self, name):
/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/dataframe.py in add_column(self, name, data, forceindex)
416 raise NameError('duplicated column name {!r}'.format(name))
417
--> 418 series = self._prepare_series_for_add(data, forceindex=forceindex)
419 self._cols[name] = series
420
/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/dataframe.py in _prepare_series_for_add(self, col, forceindex)
391 The prepared Series object.
392 """
--> 393 col = self._sanitize_columns(col)
394 empty_index = isinstance(self._index, EmptyIndex)
395 series = Series(col)
/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/dataframe.py in _sanitize_columns(self, col)
347 col values
348 """
--> 349 series = Series(col)
350 if len(self) == 0 and len(self.columns) > 0 and len(series) > 0:
351 ind = series.index
/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/series.py in __init__(self, data, index)
64 data = data._column
65 if not isinstance(data, columnops.TypedColumnBase):
---> 66 data = columnops.as_column(data)
67
68 if index is not None and not isinstance(index, Index):
/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/columnops.py in as_column(arbitrary)
149 return datetime.DatetimeColumn.from_numpy(arbitrary)
150 else:
--> 151 return as_column(Buffer(arbitrary))
152 else:
153 return as_column(np.asarray(arbitrary))
/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/buffer.py in __init__(self, mem, size, capacity, categorical)
33 if capacity is None:
34 capacity = size
---> 35 self.mem = cudautils.to_device(mem)
36 _BufferSentry(self.mem).ndim(1)
37 self.size = size
/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/cudautils.py in to_device(ary)
15
16 def to_device(ary):
---> 17 dary, _ = cuda._auto_device(ary)
18 return dary
19
/conda/envs/gdf/lib/python3.5/site-packages/numba/cuda/api.py in _auto_device(ary, stream, copy)
319
320 def _auto_device(ary, stream=0, copy=True):
--> 321 return devicearray.auto_device(ary, stream=stream, copy=copy)
322
323
/conda/envs/gdf/lib/python3.5/site-packages/numba/cuda/cudadrv/devicearray.py in auto_device(obj, stream, copy)
644 copy=False,
645 subok=True)
--> 646 sentry_contiguous(obj)
647 devobj = from_array_like(obj, stream=stream)
648 if copy:
/conda/envs/gdf/lib/python3.5/site-packages/numba/cuda/cudadrv/devicearray.py in sentry_contiguous(ary)
618
619 else:
--> 620 raise ValueError(errmsg_contiguous_buffer)
621
622
ValueError: Array contains non-contiguous buffer and cannot be transferred as a single memory region. Please ensure contiguous buffer with numpy .ascontiguousarray()
However, the follow code snippet works as expected. After reading the dataframe into memory from disk, the from_pandas method succeeds.
import pandas as pd
import numpy as np
import pygdf
arr1 = np.random.sample([5000, 10])
df = pd.DataFrame(arr1)
df.to_csv('temp.csv')
df = pd.read_csv('temp.csv')
gdf = pygdf.DataFrame.from_pandas(df)
gdf
Initially, the array is C contiguous, and using np.ascontiguousarray doesn't change the flags.
arr1 = np.random.sample([5000, 10])
print(arr1.flags, '\n')
print(np.ascontiguousarray(arr1).flags)
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
However, after reading the csv file from disk into a pandas dataframe, the flags of the underlying array shows it is FORTRAN contiguous.
df = pd.read_csv('temp.csv')
df.values.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
Simply converting the original array to be FORTRAN contiguous results in a different error.
import pandas as pd
import numpy as np
import pygdf
arr1 = np.random.sample([5000, 10])
df = pd.DataFrame(np.asfortranarray(arr1))
gdf = pygdf.DataFrame.from_pandas(df)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-47-748e296ff81e> in <module>()
7 df = pd.DataFrame(np.asfortranarray(arr1))
8
----> 9 gdf = pygdf.DataFrame.from_pandas(df)
/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/dataframe.py in from_pandas(cls, dataframe)
1047 df[colk] = dataframe[colk].values
1048 # Set index
-> 1049 return df.set_index(dataframe.index.values)
1050
1051 def to_records(self, index=True):
/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/dataframe.py in set_index(self, index)
323 df = DataFrame()
324 for k in self.columns:
--> 325 df[k] = self[k].set_index(index)
326 return df
327
/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/dataframe.py in __getitem__(self, arg)
190 else:
191 msg = "__getitem__ on type {!r} is not supported"
--> 192 raise TypeError(msg.format(arg))
193
194 def __setitem__(self, name, col):
TypeError: __getitem__ on type 0 is not supported
I noticed some interesting behavior as well. If we create a pandas dataframe from an underlying ndarray that is c contiguous, the dataframe.values are C contiguous but each column within the dataframe is not C contiguous.
import pandas as pd
import numpy as np
import pygdf
arr1 = np.random.sample([50, 2])
df = pd.DataFrame(arr1)
print(df.values.flags,'\n')
for col in df.columns:
print(df[col].flags,'\n')
Produces the following:
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
Converting the initial numpy array to be F contiguous would result in each column of the dataframe being C contiguous.
import pandas as pd
import numpy as np
import pygdf
arr1 = np.random.sample([50, 2])
arr1 = np.asfortranarray(arr1)
df = pd.DataFrame(arr1)
print(df.values.flags,'\n')
for col in df.columns:
print(df[col].flags,'\n')
Output
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : False
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
It seems like individual columns (series) within the dataframe should be C contiguous for the pygdf dataframe creation to work.
The second observation is that creating dataframes with numeric column names does not seem to work. Extending the same example above, we can get rid of the TypeError: __getitem__ on type 0 is not supported error by casting the columns to string.
import pandas as pd
import numpy as np
import pygdf
arr1 = np.random.sample([50, 2])
arr1 = np.asfortranarray(arr1)
df.columns = df.columns.astype('U2')
print(df.columns)
gdf = DataFrame.from_pandas(df)
print(gdf)
Output
Index(['0', '1'], dtype='object')
0 1
0 0.43086811049897766 0.30983810639925025
1 0.8829413121641994 0.6736099406985548
2 0.08481403911274055 0.8763565782167897
3 0.4240924267302395 0.17952897668958967
4 0.07004482230698317 0.3053835868163095
5 0.561770047921889 0.07266376764817684
6 0.1336645571467754 0.6001427745345261
7 0.3645319056738773 0.4069711616970072
8 0.9586693497650042 0.1047024360961687
9 0.900327384985905 0.43899216261934604
[40 more rows]
Most helpful comment
I noticed some interesting behavior as well. If we create a pandas dataframe from an underlying ndarray that is c contiguous, the
dataframe.valuesare C contiguous but each column within the dataframe is not C contiguous.Produces the following:
Converting the initial numpy array to be F contiguous would result in each column of the dataframe being C contiguous.
Output
It seems like individual columns (series) within the dataframe should be C contiguous for the pygdf dataframe creation to work.
The second observation is that creating dataframes with numeric column names does not seem to work. Extending the same example above, we can get rid of the
TypeError: __getitem__ on type 0 is not supportederror by casting the columns to string.Output