Ksql: Define Access Pattern for Metadata Fields

Created on 4 Nov 2019 · 8Comments · Source: confluentinc/ksql

Background: https://github.com/confluentinc/ksql/pull/3722#issuecomment-549401032

Meta columns, i.e. currently only ROWTIME, are the only columns which aren't driven by the data, i.e. they are implicit columns. As KSQL is enhanced we want to expose more metadata, e.g. the partition number, the message offset and, eventually, headers. However, doing so using the current pattern of one column per metadata field, will result in breaking changes: if a user has data with a column named partition then their query will break if we release a new version of KSQL with a partition meta column.

To avoid this, we should consider moving all the metadata under a single METADATA column of type STRUCT. We are then free to add new fields to the struct without risk of breaking peoples queries.

Initially, it would just be STRUCT<timestamp BIGINT>, but can quickly become STRUCT<timestamp BIGINT, partition INT, offset BIGINT> etc.

breaking-change engine

Source

big-andy-coates

Most helpful comment

a random, but related!, thought is that the 'meta-columns' may be better
exposed as scalar functions rather than masquerading as columns - e.g.
ROWTIME(), PARTITION() and so on. C.f. how the ROWNUM()
function works in Oracle and derivative systems (although not quite the
same, cos that's scoped to a resultset). I guess this goes back to the
age-old argument of one person's metadata being another person's data...

On Mon, Nov 4, 2019 at 7:25 AM Andy Coates notifications@github.com wrote:

Background: #3722 (comment)
https://github.com/confluentinc/ksql/pull/3722#issuecomment-549401032

Meta columns, i.e. currently only ROWTIME, are the only columns which
aren't driven by the data, i.e. they are implicit columns. As KSQL is
enhanced we want to expose more metadata, e.g. the partition number, the
message offset and, eventually, headers. However, doing so using the
current pattern of one column per metadata field, will result in breaking
changes: if a user has data with a column named partition then their
query will break if we release a new version of KSQL with a partition
meta column.

To avoid this, we should consider moving all the metadata under a single
METADATA column of type STRUCT. We are then free to add new fields to the
struct without risk of breaking peoples queries.

Initially, it would just be STRUCT, but can quickly
become STRUCT etc.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/confluentinc/ksql/issues/3734?email_source=notifications&email_token=ABCXJIH4VR3ABH436E6SRV3QSA5HDA5CNFSM4JIVFJT2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HWUHFDA,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABCXJIBF4POMO4QKOHTABDTQSA5HDANCNFSM4JIVFJTQ
.

blueedgenick on 4 Nov 2019

👍3

All 8 comments

On Mon, Nov 4, 2019 at 7:25 AM Andy Coates notifications@github.com wrote:

Background: #3722 (comment)
https://github.com/confluentinc/ksql/pull/3722#issuecomment-549401032

Meta columns, i.e. currently only ROWTIME, are the only columns which
aren't driven by the data, i.e. they are implicit columns. As KSQL is
enhanced we want to expose more metadata, e.g. the partition number, the
message offset and, eventually, headers. However, doing so using the
current pattern of one column per metadata field, will result in breaking
changes: if a user has data with a column named partition then their
query will break if we release a new version of KSQL with a partition
meta column.

To avoid this, we should consider moving all the metadata under a single
METADATA column of type STRUCT. We are then free to add new fields to the
struct without risk of breaking peoples queries.

Initially, it would just be STRUCT, but can quickly
become STRUCT etc.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/confluentinc/ksql/issues/3734?email_source=notifications&email_token=ABCXJIH4VR3ABH436E6SRV3QSA5HDA5CNFSM4JIVFJT2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HWUHFDA,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABCXJIBF4POMO4QKOHTABDTQSA5HDANCNFSM4JIVFJTQ
.

blueedgenick on 4 Nov 2019

👍3

I much prefer having them as scalar functions - it makes it so much easier to reason about these columns! They're not in the schema but they're always accessible. This basically models them exactly as we wanted them, and if a user wants them in the schema they could always have SELECT ROWTIME() as ROWTIME or whatever they want.

agavra on 4 Nov 2019

Interesting. This is certainly worth considering.

Related discussion: https://github.com/confluentinc/ksql/pull/3722#issuecomment-549401032

cc @MichaelDrogalis @derekjn what do you think to the idea of using system functions to access metadata like ROWTIME, partition and offset, rather than them being implicitly part of the schema.

big-andy-coates on 6 Nov 2019

That's a pretty interesting idea. If we project out what a stream or table's schema will look like when we support all the pieces of data that a Kafka record has, the schema will be pretty wide by default. I think scalar function sense here, but @derekjn has the best experience to know the pros/cons.

MichaelDrogalis on 7 Nov 2019

I think that storing row-level metadata in a more flexible format is a _really_ good idea that will undoubtedly pay dividends moving forward as we evolve the system. It also seems to me that metadata _storage_ and _access_ are somewhat orthogonal though. We can store metadata as a Struct and access that metadata using a number of different approaches, e.g.:

Direct access to the Struct: METADATA->ROWTIME
Scalar functions: ROWTIME()

And we can also just use the existing column reference approach if we simply recognized a ROWTIME reference (or any other metadata column) as a special column that results in a METADATA Struct lookup.

As far as I know, ROWNUM, ROWID etc. in Oracle are actually more like psuedocolumns that are referenced as columns and not necessarily function calls (could be wrong though!). I'm sure they map to a function call internally but cosmetically they look and feel like columns to a user. PostgreSQL's version of this pattern is system columns, where every table has a bunch of metadata columns that may be referenced but aren't shown as part of a table's schema. System columns in PostgreSQL are also referenced like regular columns and not function calls.

I think both the scalar function call approach and the psuedocolumn approach are pretty good, although I'm not sure that using function call syntax adds much value for users. Are the extra () bytes helpful?

What does everyone think about simply mapping metadata column references (e.g. ROWTIME) to a METADATA Struct lookup internally? That approach has the additional benefit of being congruent with how KSQL currently feels, and it's arguably more consistent with what's happening internally (row-level value retrieval).

derekjn on 7 Nov 2019

Makes sense @derekjn

I think what's key here is that the meta columns should not be returned if a user does SELECT * FROM X. If they _want_ the metadata they can via SELECT *, ROWTIME FROM X;.

If we go down this route, then it's probably ok to keep all the meta columns flat, i.e. no nesting in a STRUCT, just ROWTIME ROWOFFSET ROWPARTITION. Headers might need more thought... here a set of functions _may_ make more sense, but we don't need to worry about that now.

big-andy-coates on 7 Nov 2019

👍1

If we go down this route, then it's probably ok to keep all the meta columns flat, i.e. no nesting in a STRUCT

I think the key thing is storing metadata columns in such a way that adding new meta columns doesn't break the existing storage layout.

derekjn on 7 Nov 2019

👍1

I think when we say "storage" of the record metadata we need to be careful
to qualify what we mean - the metadata is, after all, actually stored in a
Kafka record, in a very specific manner. What I think we are discussing
here is the "model" of these specific metadata items, and there are 2 users
of that model: the human ksql user, who cares about how to access and
manipulate it; and also the rest of the ksql codebase, which wants to
understand the implications of how it should manipulate that metadata under
certain circumstances.

It's a long time ago now (2.5 years already? gulp!) since we first decided
to have ROWTIME in it's current form and, if memory serves, I was
responsible for that decision. The intent at that time was for it to be a
pseudocolumn, just like Oracle ROWNUM, primarily so that (a) it doesn't
have to be declared in every DDL to be accessible; and (b) we wanted to be
able, in a future enhancement, to allow some "pseudocolumns" to be
explicitly updated by a query (obviously more applicable to timestamp and
headers than to offset or partition number!). I say this because we thought
at the time that making them accessible via scalar functions would make
value-changing more syntactically difficult (what 'column' do you assign
the result of the manipulation to?).

I mention all this because it seems that the driving factor in how we
represent these metadata values to the user depends largely on what we want
a user to be able to do with them.... If they are read-only in every
circumstance, then pseudocolumns (whether presented as being in a
pseudo-struct or not) are pretty much equivalent to scalar functions for
access, with the exception of potential naming clashes.
I think we created confusion around what happens in projections, especially
through joins and groupings, by taking a shortcut for short-term expediency
and making the ROWKEY accessible in a similar way as ROWTIME (and that was
definitely my fault too). I've always said that ROWKEY should go away as
soon as we can properly represent a subset of the normal fields of a
message as forming the key, and the confusion this has caused about what
pseudocolumns are for and how they should behave has unfortunately
compounded as it took us much longer than anticipated to fix up the
handling of key fields.

So in conclusion I want to advocate that:
1 - ROWKEY should burn in fire asap (on which I think we all agree);
2 - other metadata fields (time, headers, offset, partition) should be
accessible through ubiquitous pseudocolumns much like ROWTIME today.
Namespacing them into a special struct may reduce the potential for name
conflicts but not remover it, and that suggests to me that the added
complexity isn't worth it. There are better ways to deal with naming
conflicts via aliasing.
3 - in a future version, some of them should be capable of being assigned
(i.e. updated), which is slightly different than the way something like
ROWNUM works but will be handy for certain usecases. (Imagine an aggregate
function which can combine the entries of multiple maps, and how you could
then use this to combine the header values of multiple messages flowing
through a groupby, and then you could assign the result of that udaf to be
the headers of the output record. I've been asked for this a few times
already).

On Thu, Nov 7, 2019, 10:36 PM Derek Nelson notifications@github.com wrote:

If we go down this route, then it's probably ok to keep all the meta
columns flat, i.e. no nesting in a STRUCT

I think the key thing is storing metadata columns in such a way that
adding new meta columns doesn't break the existing storage layout.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/confluentinc/ksql/issues/3734?email_source=notifications&email_token=ABCXJIHCXPPDWXHXQF35DH3QSSJ6DA5CNFSM4JIVFJT2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDOCH5Y#issuecomment-551298039,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/ABCXJIAKEUEDNDNYG43XYATQSSJ6DANCNFSM4JIVFJTQ
.