# With cuDF
import cudf
json_input_string = '{"field": "s3://path"}'
gdf = cudf.read_json(json_input_string, lines=True, engine="cudf")
# With Pandas
import pandas as pd
json_input_string = '{"field": "s3://path"}'
pdf = pd.read_json(json_input_string, lines=True)
throws an error ValueError: Protocol not known: {"field": "s3 with cuDF, but succeeds with Pandas.
@kkraus14, is this easy to fix? It is blocking one of our pipelines from using the accelerated Kafka reader directly, so it would be great if you guys can look into this.
@kkraus14, is this easy to fix? It is blocking one of our pipelines from using the accelerated Kafka reader directly, so it would be great if you guys can look into this.
I don't know if it's easy to fix but we're also in the middle of starting to execute a large cuIO refactor so the timing is very non-ideal.
I'd defer to @vuule here for his thoughts.
I don't see the root cause yet, but it should take less than a day to fix. I would prefer to address this immediately.
I don't see the issue in a C++ test that should be equivalent to the posted steps:
std::string buffer = "{\"field\": \"s3://path\"}";
cudf_io::read_json_args in_args{cudf_io::source_info{buffer.c_str(), buffer.size()}};
in_args.lines = true;
cudf_io::table_with_metadata result = cudf_io::read_json(in_args);
EXPECT_EQ(result.tbl->num_columns(), 1);
EXPECT_EQ(result.tbl->get_column(0).type().id(), cudf::type_id::STRING);
Trying Python test next.
Yup, fails as described with a Python test.
Although, it only fails when the field value contains "://", runs fine with only ":" or ":/".
Got it: this is probably an issue with Python recognizing the whole input as a path. Lemme see if it's easy to patch
@chinmaychandak I have a workaround: if you wrap the input data in a StringIO, it runs fine! Does this unblock you?
@kkraus14 this is a Python issue. str input is treated as potentially being a filepath, while StringIO and ByteIO are assumed to be data buffers. We should either iron out this behavior or clearly document the parameter meaning vs. type.
if you wrap the input data in a StringIO, it runs fine! Does this unblock you?
I see. Unfortunately, that won't work, since the data is being read using custreamz.kafka which uses read_json internally, so no chance of modifying the data before it hits read_json. :(
Thanks @vuule. I'll dig in on the Python side.
FYI: as of Pandas 1.1.0 this throws the same error as in cudf if fsspec is installed:
In [12]: test = '{"field": "s3://path"}'
In [13]: pd.read_json(test)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-13-2df30c944ee1> in <module>
----> 1 pd.read_json(test)
~/miniconda3/envs/dev/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
197 else:
198 kwargs[new_arg_name] = new_arg_value
--> 199 return func(*args, **kwargs)
200
201 return cast(F, wrapper)
~/miniconda3/envs/dev/lib/python3.7/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
294 )
295 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296 return func(*args, **kwargs)
297
298 return wrapper
~/miniconda3/envs/dev/lib/python3.7/site-packages/pandas/io/json/_json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression, nrows)
592 compression = infer_compression(path_or_buf, compression)
593 filepath_or_buffer, _, compression, should_close = get_filepath_or_buffer(
--> 594 path_or_buf, encoding=encoding, compression=compression
595 )
596
~/miniconda3/envs/dev/lib/python3.7/site-packages/pandas/io/common.py in get_filepath_or_buffer(filepath_or_buffer, encoding, compression, mode, storage_options)
220 try:
221 file_obj = fsspec.open(
--> 222 filepath_or_buffer, mode=mode or "rb", **(storage_options or {})
223 ).open()
224 # GH 34626 Reads from Public Buckets without Credentials needs anon=True
~/miniconda3/envs/dev/lib/python3.7/site-packages/fsspec/core.py in open(urlpath, mode, compression, encoding, errors, protocol, newline, **kwargs)
397 newline=newline,
398 expand=False,
--> 399 **kwargs
400 )[0]
401
~/miniconda3/envs/dev/lib/python3.7/site-packages/fsspec/core.py in open_files(urlpath, mode, compression, encoding, errors, name_function, num, protocol, newline, auto_mkdir, expand, **kwargs)
248 storage_options=kwargs,
249 protocol=protocol,
--> 250 expand=expand,
251 )
252 if "r" not in mode and auto_mkdir:
~/miniconda3/envs/dev/lib/python3.7/site-packages/fsspec/core.py in get_fs_token_paths(urlpath, mode, num, name_function, storage_options, protocol, expand)
558 "share the same protocol"
559 )
--> 560 cls = get_filesystem_class(protocol)
561 optionss = list(map(cls._get_kwargs_from_urls, urlpath))
562 paths = [cls._strip_protocol(u) for u in urlpath]
~/miniconda3/envs/dev/lib/python3.7/site-packages/fsspec/registry.py in get_filesystem_class(protocol)
181 if protocol not in registry:
182 if protocol not in known_implementations:
--> 183 raise ValueError("Protocol not known: %s" % protocol)
184 bit = known_implementations[protocol]
185 try:
ValueError: Protocol not known: {"field": "s3
We could handle this on the cuDF side, but given Pandas seems to have the same broken behavior I'm wondering if the fix should be upstream to not pass a Python string as input.
I believe the problem lies in this call:
https://github.com/rapidsai/cudf/blob/5fbffb6688f8df7a395f54fa552a615a70346b12/python/cudf/cudf/utils/ioutils.py#L963
From the context it looks like a passed string might be a path, whereas buffer types (BytesIO, StringIO) are assumed to be raw data.
IMO it makes sense to require a buffer input for raw data, just not sure about performance overhead from something like StringIO(a_str_object).
The problem is actually in this call: https://github.com/rapidsai/cudf/blob/branch-0.16/python/cudf/cudf/utils/ioutils.py#L965-L967
Because it has a :// in it, fsspec thinks it's a protocol.
We can work around it by wrapping the fsspec call in a try except for now.
Opened PR https://github.com/rapidsai/cudf/pull/6082, which fixes the issue as filed in a pretty hacky way.
@kkraus14 is this PR along the lines of what you have in mind?
Looks like we came to about the same solution 馃槅. I think I covered a bit more of the edge case of where we get a different ValueError so should we should go with #6081?
Ah, I didn't realize you're making the change too. I'll close my PR.
Most helpful comment
I don't see the root cause yet, but it should take less than a day to fix. I would prefer to address this immediately.