Cudf: [FEA] Add support for unsigned types in the ORC reader/writer

Created on 1 Jun 2020  路  10Comments  路  Source: rapidsai/cudf

5292 will add support for unsigned types in libcudf columns.

The ORC reader and writers code will need to be updated to support these new types: UINT8, UINT16, UINT32, and UINT64.

The specification https://orc.apache.org/specification/ORCv1/ indicates unsigned and signed type values are stored a differently:

_For signed integer types, the number is converted into an unsigned number using a zigzag encoding. Zigzag encoding moves the sign bit to the least significant bit using the expression (val 芦 1) ^ (val 禄 63) and derives its name from the fact that positive and negative numbers alternate once encoded._

cuIO feature request libcudf

All 10 comments

It doesn't seem like ORC supports unsigned types. This is from their implementation.

The specification https://orc.apache.org/specification/ORCv1/ indicates unsigned and signed type values are stored a differently

This is referring to how values are run-length encoded, as there are some cases where a type that needs to be encoded is known to be unsigned (e.g.: the size of something). However I do not see anything in the schema definition section of the ORC spec that would indicate how an unsigned type would be encoded as a data type of a column.

Without the ability to describe an unsigned type in the schema, proper interoperability between apps will not work. For proper interop we would need to encode it in a supported superset type (e.g.: INT64, 33-bit decimal type, etc.). This also implies unsigned types would not be loaded directly from ORC files, since they cannot be described directly in the file schema.

If we restrict the use-case to the realm of libcudf reading and writing, we could make round-tripping unsigned types through the ORC format via libcudf-specific user metadata in the footer. For example, we could write some metadata in the footer describing which columns are actually unsigned rather than signed types and load them accordingly.

However the custom metadata approach would only work properly for libcudf, and other applications attempting to load this data could end up misinterpreting the data in a bad way. They will also likely misinterpret the min/max values in the stripe statistics, e.g. interpreting a max value as negative and wondering why the reported max value is less than the reported min.

Because of the potential interop issues, I personally would be OK if libcudf refused to write unsigned type columns, requiring the application to cast them to an ORC-supported type.

@kkraus14 @randerzander is there any demand for unsigned type support in ORC reader/writer from the Python side?

I'm guessing ORC doesn't support unsigned types because the main implementation is Java based which doesn't natively support unsigned integers.

@jlowe how does Spark write unsigned integers into ORC? From what I can tell Impala / Hive don't support unsigned integers with ORC and Arrow doesn't have an ORC writer.

how does Spark write unsigned integers into ORC?

By treating them as signed integers. Spark does not have an unsigned integer type.

Spark does not have an unsigned integer type.

Did not know this! Thanks.

@harrism sounds like we shouldn't prioritize this unless someone comes around and gives the explicit encoding expected with a reasoning for the encoding.

For what it's worth, the informal recommendation for Hive users with questions about unsigned ints has been to use Decimal types instead, since they can represent an even higher number of digits.

We are considering resolving this issue by creating a PR to throw an exception if a user tries to create an ORC file with unsigned integer columns. Or should we just leave the issue open in case someone wants us to support this?

Throwing and closing this issue sounds like a great idea from my perspective. If someone comes back with a proposal as to how to do it we can reopen and discuss.

The writer already throws Unsupported ORC type when passed an unsigned column. And it's unlikely the reader would have to encounter unsigned columns in ORC if no one's writing it.

Was this page helpful?
0 / 5 - 0 ratings