Cudf: [BUG] JSON reader throws error when ":" is present in data.

Created on 19 Aug 2020  路  14Comments  路  Source: rapidsai/cudf

# With cuDF
import cudf 
json_input_string = '{"field": "s3://path"}'
gdf = cudf.read_json(json_input_string, lines=True, engine="cudf")

# With Pandas
import pandas as pd 
json_input_string = '{"field": "s3://path"}'
pdf = pd.read_json(json_input_string, lines=True)

throws an error ValueError: Protocol not known: {"field": "s3 with cuDF, but succeeds with Pandas.

bug cuIO

Most helpful comment

I don't see the root cause yet, but it should take less than a day to fix. I would prefer to address this immediately.

All 14 comments

@kkraus14, is this easy to fix? It is blocking one of our pipelines from using the accelerated Kafka reader directly, so it would be great if you guys can look into this.

@kkraus14, is this easy to fix? It is blocking one of our pipelines from using the accelerated Kafka reader directly, so it would be great if you guys can look into this.

I don't know if it's easy to fix but we're also in the middle of starting to execute a large cuIO refactor so the timing is very non-ideal.

I'd defer to @vuule here for his thoughts.

I don't see the root cause yet, but it should take less than a day to fix. I would prefer to address this immediately.

I don't see the issue in a C++ test that should be equivalent to the posted steps:

  std::string buffer = "{\"field\": \"s3://path\"}";
  cudf_io::read_json_args in_args{cudf_io::source_info{buffer.c_str(), buffer.size()}};
  in_args.lines                       = true;
  cudf_io::table_with_metadata result = cudf_io::read_json(in_args);

  EXPECT_EQ(result.tbl->num_columns(), 1);
  EXPECT_EQ(result.tbl->get_column(0).type().id(), cudf::type_id::STRING);

Trying Python test next.

Yup, fails as described with a Python test.
Although, it only fails when the field value contains "://", runs fine with only ":" or ":/".

Got it: this is probably an issue with Python recognizing the whole input as a path. Lemme see if it's easy to patch

@chinmaychandak I have a workaround: if you wrap the input data in a StringIO, it runs fine! Does this unblock you?

@kkraus14 this is a Python issue. str input is treated as potentially being a filepath, while StringIO and ByteIO are assumed to be data buffers. We should either iron out this behavior or clearly document the parameter meaning vs. type.

if you wrap the input data in a StringIO, it runs fine! Does this unblock you?

I see. Unfortunately, that won't work, since the data is being read using custreamz.kafka which uses read_json internally, so no chance of modifying the data before it hits read_json. :(

Thanks @vuule. I'll dig in on the Python side.

FYI: as of Pandas 1.1.0 this throws the same error as in cudf if fsspec is installed:

In [12]: test = '{"field": "s3://path"}'

In [13]: pd.read_json(test)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-13-2df30c944ee1> in <module>
----> 1 pd.read_json(test)

~/miniconda3/envs/dev/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    197                 else:
    198                     kwargs[new_arg_name] = new_arg_value
--> 199             return func(*args, **kwargs)
    200
    201         return cast(F, wrapper)

~/miniconda3/envs/dev/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    294                 )
    295                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296             return func(*args, **kwargs)
    297
    298         return wrapper

~/miniconda3/envs/dev/lib/python3.7/site-packages/pandas/io/json/_json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression, nrows)
    592     compression = infer_compression(path_or_buf, compression)
    593     filepath_or_buffer, _, compression, should_close = get_filepath_or_buffer(
--> 594         path_or_buf, encoding=encoding, compression=compression
    595     )
    596

~/miniconda3/envs/dev/lib/python3.7/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
    220         try:
    221             file_obj = fsspec.open(
--> 222                 filepath_or_buffer, mode=mode or "rb", **(storage_options or {})
    223             ).open()
    224         # GH 34626 Reads from Public Buckets without Credentials needs anon=True

~/miniconda3/envs/dev/lib/python3.7/site-packages/fsspec/core.py in open(urlpath, mode, compression, encoding, errors, protocol, newline, **kwargs)
    397         newline=newline,
    398         expand=False,
--> 399         **kwargs
    400     )[0]
    401

~/miniconda3/envs/dev/lib/python3.7/site-packages/fsspec/core.py in open_files(urlpath, mode, compression, encoding, errors, name_function, num, protocol, newline, auto_mkdir, expand, **kwargs)
    248         storage_options=kwargs,
    249         protocol=protocol,
--> 250         expand=expand,
    251     )
    252     if "r" not in mode and auto_mkdir:

~/miniconda3/envs/dev/lib/python3.7/site-packages/fsspec/core.py in get_fs_token_paths(urlpath, mode, num, name_function, storage_options, protocol, expand)
    558                     "share the same protocol"
    559                 )
--> 560         cls = get_filesystem_class(protocol)
    561         optionss = list(map(cls._get_kwargs_from_urls, urlpath))
    562         paths = [cls._strip_protocol(u) for u in urlpath]

~/miniconda3/envs/dev/lib/python3.7/site-packages/fsspec/registry.py in get_filesystem_class(protocol)
    181     if protocol not in registry:
    182         if protocol not in known_implementations:
--> 183             raise ValueError("Protocol not known: %s" % protocol)
    184         bit = known_implementations[protocol]
    185         try:

ValueError: Protocol not known: {"field": "s3

We could handle this on the cuDF side, but given Pandas seems to have the same broken behavior I'm wondering if the fix should be upstream to not pass a Python string as input.

I believe the problem lies in this call:
https://github.com/rapidsai/cudf/blob/5fbffb6688f8df7a395f54fa552a615a70346b12/python/cudf/cudf/utils/ioutils.py#L963
From the context it looks like a passed string might be a path, whereas buffer types (BytesIO, StringIO) are assumed to be raw data.
IMO it makes sense to require a buffer input for raw data, just not sure about performance overhead from something like StringIO(a_str_object).

The problem is actually in this call: https://github.com/rapidsai/cudf/blob/branch-0.16/python/cudf/cudf/utils/ioutils.py#L965-L967

Because it has a :// in it, fsspec thinks it's a protocol.

We can work around it by wrapping the fsspec call in a try except for now.

Opened PR https://github.com/rapidsai/cudf/pull/6082, which fixes the issue as filed in a pretty hacky way.

@kkraus14 is this PR along the lines of what you have in mind?

Looks like we came to about the same solution 馃槅. I think I covered a bit more of the edge case of where we get a different ValueError so should we should go with #6081?

Ah, I didn't realize you're making the change too. I'll close my PR.

Was this page helpful?
0 / 5 - 0 ratings