Cudf: Converting dataframe created from numpy array to GDF has contiguity issues

Created on 17 Aug 2018 · 1Comment · Source: rapidsai/cudf

[x] I am using the latest version of PyGDF from conda or built from master.
[x] I have included the following environment details:
Linux Distro, Linux Kernel, GPU Model
[x] I have included the following version information for:
Arrow, CUDA, Numpy, Pandas, Python
[x] I have included below a minimal working reproducer (if you are unsure how
to write one see http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports).

Operating System:
Linux version 3.10.0-862.9.1.el7.x86_64
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.5 LTS"

GPU:  Tesla V100

CUDA 9.2

Python 3.5.5

Arrow: N/A
pandas==0.20.3
numpy==1.14.5

Converting a dataframe created from a numpy array results in: ValueError: Array contains non-contiguous buffer and cannot be transferred as a single memory region. Please ensure contiguous buffer with numpy .ascontiguousarray().

The underlying numpy array appears to be a C contiguous array, and recasting with np.ascontiguousarray as suggested in the traceback doesn't resolve the error as the array already is in C contiguous order. When loading the same data from disk (written with the .to_csv method) directly into a dataframe, the issue doesn't seem to come up. In this case, the underlying array appears to be a FORTRAN contiguous array when read from disk using pd.read_csv. However, converting the array created in memory to FORTRAN contiguous order doesn't seem to solve it either, as it results in a different error: TypeError: __getitem__ on type 0 is not supported.

The following code snippet results in the ValueError listed above:

import pandas as pd
import numpy as np
import pygdf


arr1 = np.random.sample([5000, 10])
df = pd.DataFrame(arr1)

gdf = pygdf.DataFrame.from_pandas(df)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-36-a66f55e3e62f> in <module>()
      7 df = pd.DataFrame(arr1)
      8 
----> 9 gdf = pygdf.DataFrame.from_pandas(df)

/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/dataframe.py in from_pandas(cls, dataframe)
   1045         # Set columns
   1046         for colk in dataframe.columns:
-> 1047             df[colk] = dataframe[colk].values
   1048         # Set index
   1049         return df.set_index(dataframe.index.values)

/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/dataframe.py in __setitem__(self, name, col)
    199             self._cols[name] = self._prepare_series_for_add(col)
    200         else:
--> 201             self.add_column(name, col)
    202 
    203     def __delitem__(self, name):

/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/dataframe.py in add_column(self, name, data, forceindex)
    416             raise NameError('duplicated column name {!r}'.format(name))
    417 
--> 418         series = self._prepare_series_for_add(data, forceindex=forceindex)
    419         self._cols[name] = series
    420 

/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/dataframe.py in _prepare_series_for_add(self, col, forceindex)
    391         The prepared Series object.
    392         """
--> 393         col = self._sanitize_columns(col)
    394         empty_index = isinstance(self._index, EmptyIndex)
    395         series = Series(col)

/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/dataframe.py in _sanitize_columns(self, col)
    347            col values
    348         """
--> 349         series = Series(col)
    350         if len(self) == 0 and len(self.columns) > 0 and len(series) > 0:
    351             ind = series.index

/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/series.py in __init__(self, data, index)
     64             data = data._column
     65         if not isinstance(data, columnops.TypedColumnBase):
---> 66             data = columnops.as_column(data)
     67 
     68         if index is not None and not isinstance(index, Index):

/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/columnops.py in as_column(arbitrary)
    149             return datetime.DatetimeColumn.from_numpy(arbitrary)
    150         else:
--> 151             return as_column(Buffer(arbitrary))
    152     else:
    153         return as_column(np.asarray(arbitrary))

/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/buffer.py in __init__(self, mem, size, capacity, categorical)
     33         if capacity is None:
     34             capacity = size
---> 35         self.mem = cudautils.to_device(mem)
     36         _BufferSentry(self.mem).ndim(1)
     37         self.size = size

/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/cudautils.py in to_device(ary)
     15 
     16 def to_device(ary):
---> 17     dary, _ = cuda._auto_device(ary)
     18     return dary
     19 

/conda/envs/gdf/lib/python3.5/site-packages/numba/cuda/api.py in _auto_device(ary, stream, copy)
    319 
    320 def _auto_device(ary, stream=0, copy=True):
--> 321     return devicearray.auto_device(ary, stream=stream, copy=copy)
    322 
    323 

/conda/envs/gdf/lib/python3.5/site-packages/numba/cuda/cudadrv/devicearray.py in auto_device(obj, stream, copy)
    644                 copy=False,
    645                 subok=True)
--> 646             sentry_contiguous(obj)
    647             devobj = from_array_like(obj, stream=stream)
    648         if copy:

/conda/envs/gdf/lib/python3.5/site-packages/numba/cuda/cudadrv/devicearray.py in sentry_contiguous(ary)
    618 
    619         else:
--> 620             raise ValueError(errmsg_contiguous_buffer)
    621 
    622 

ValueError: Array contains non-contiguous buffer and cannot be transferred as a single memory region. Please ensure contiguous buffer with numpy .ascontiguousarray()

However, the follow code snippet works as expected. After reading the dataframe into memory from disk, the from_pandas method succeeds.

import pandas as pd
import numpy as np
import pygdf

arr1 = np.random.sample([5000, 10])
df = pd.DataFrame(arr1)
df.to_csv('temp.csv')

df = pd.read_csv('temp.csv')
gdf = pygdf.DataFrame.from_pandas(df)
gdf

Initially, the array is C contiguous, and using np.ascontiguousarray doesn't change the flags.

arr1 = np.random.sample([5000, 10])
print(arr1.flags, '\n')
print(np.ascontiguousarray(arr1).flags)

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False 

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

However, after reading the csv file from disk into a pandas dataframe, the flags of the underlying array shows it is FORTRAN contiguous.

df = pd.read_csv('temp.csv')
df.values.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

Simply converting the original array to be FORTRAN contiguous results in a different error.

import pandas as pd
import numpy as np
import pygdf


arr1 = np.random.sample([5000, 10])
df = pd.DataFrame(np.asfortranarray(arr1))

gdf = pygdf.DataFrame.from_pandas(df)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-47-748e296ff81e> in <module>()
      7 df = pd.DataFrame(np.asfortranarray(arr1))
      8 
----> 9 gdf = pygdf.DataFrame.from_pandas(df)

/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/dataframe.py in from_pandas(cls, dataframe)
   1047             df[colk] = dataframe[colk].values
   1048         # Set index
-> 1049         return df.set_index(dataframe.index.values)
   1050 
   1051     def to_records(self, index=True):

/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/dataframe.py in set_index(self, index)
    323             df = DataFrame()
    324             for k in self.columns:
--> 325                 df[k] = self[k].set_index(index)
    326             return df
    327 

/conda/envs/gdf/lib/python3.5/site-packages/pygdf-0.1.0a2+293.g69f8656-py3.5.egg/pygdf/dataframe.py in __getitem__(self, arg)
    190         else:
    191             msg = "__getitem__ on type {!r} is not supported"
--> 192             raise TypeError(msg.format(arg))
    193 
    194     def __setitem__(self, name, col):

TypeError: __getitem__ on type 0 is not supported

bug cuDF (Python) numpy pandas

Source

beckernick

Most helpful comment

I noticed some interesting behavior as well. If we create a pandas dataframe from an underlying ndarray that is c contiguous, the dataframe.values are C contiguous but each column within the dataframe is not C contiguous.

import pandas as pd
import numpy as np
import pygdf

arr1 = np.random.sample([50, 2])
df = pd.DataFrame(arr1)
print(df.values.flags,'\n')
for col in df.columns:
    print(df[col].flags,'\n')

Produces the following:

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False 

  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False 

  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

Converting the initial numpy array to be F contiguous would result in each column of the dataframe being C contiguous.

import pandas as pd
import numpy as np
import pygdf

arr1 = np.random.sample([50, 2])
arr1 = np.asfortranarray(arr1)
df = pd.DataFrame(arr1)
print(df.values.flags,'\n')
for col in df.columns:
    print(df[col].flags,'\n')

Output

  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False 

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False 

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

It seems like individual columns (series) within the dataframe should be C contiguous for the pygdf dataframe creation to work.

The second observation is that creating dataframes with numeric column names does not seem to work. Extending the same example above, we can get rid of the TypeError: __getitem__ on type 0 is not supported error by casting the columns to string.

import pandas as pd
import numpy as np
import pygdf

arr1 = np.random.sample([50, 2])
arr1 = np.asfortranarray(arr1)    
df.columns = df.columns.astype('U2')
print(df.columns)
gdf = DataFrame.from_pandas(df)
print(gdf)

Output

Index(['0', '1'], dtype='object')
                     0                   1
 0 0.43086811049897766 0.30983810639925025
 1  0.8829413121641994  0.6736099406985548
 2 0.08481403911274055  0.8763565782167897
 3  0.4240924267302395 0.17952897668958967
 4 0.07004482230698317  0.3053835868163095
 5   0.561770047921889 0.07266376764817684
 6  0.1336645571467754  0.6001427745345261
 7  0.3645319056738773  0.4069711616970072
 8  0.9586693497650042  0.1047024360961687
 9   0.900327384985905 0.43899216261934604
[40 more rows]

ayushdg on 17 Aug 2018

👍2

>All comments

import pandas as pd
import numpy as np
import pygdf

arr1 = np.random.sample([50, 2])
df = pd.DataFrame(arr1)
print(df.values.flags,'\n')
for col in df.columns:
    print(df[col].flags,'\n')

Produces the following:

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False 

  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False 

  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

Converting the initial numpy array to be F contiguous would result in each column of the dataframe being C contiguous.

import pandas as pd
import numpy as np
import pygdf

arr1 = np.random.sample([50, 2])
arr1 = np.asfortranarray(arr1)
df = pd.DataFrame(arr1)
print(df.values.flags,'\n')
for col in df.columns:
    print(df[col].flags,'\n')

Output

  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False 

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False 

  C_CONTIGUOUS : True
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False
  UPDATEIFCOPY : False

It seems like individual columns (series) within the dataframe should be C contiguous for the pygdf dataframe creation to work.

import pandas as pd
import numpy as np
import pygdf

arr1 = np.random.sample([50, 2])
arr1 = np.asfortranarray(arr1)    
df.columns = df.columns.astype('U2')
print(df.columns)
gdf = DataFrame.from_pandas(df)
print(gdf)

Output

Index(['0', '1'], dtype='object')
                     0                   1
 0 0.43086811049897766 0.30983810639925025
 1  0.8829413121641994  0.6736099406985548
 2 0.08481403911274055  0.8763565782167897
 3  0.4240924267302395 0.17952897668958967
 4 0.07004482230698317  0.3053835868163095
 5   0.561770047921889 0.07266376764817684
 6  0.1336645571467754  0.6001427745345261
 7  0.3645319056738773  0.4069711616970072
 8  0.9586693497650042  0.1047024360961687
 9   0.900327384985905 0.43899216261934604
[40 more rows]

ayushdg on 17 Aug 2018

👍2

Was this page helpful?

0 / 5 - 0 ratings