A CTAS is like a materialized view. Yet, we allow an insert values into the sink topic of a CTAS, e.g. (RQTT test):
{
"name": "ctas",
"statements": [
"CREATE SOURCE test (id INT) WITH (kafka_topic='test_topic', value_format='JSON');",
"CREATE TABLE other AS SELECT COUNT(1) AS COUNT FROM TEST GROUP BY ROWKEY;",
"INSERT INTO other (ROWTIME, ROWKEY, COUNT) VALUES (1234, 'x', 10);"
],
"inputs": [
],
"outputs": [
{"topic": "OTHER", "timestamp": 1234, "x": "x", "value": {"ID": 10}}
]
}
We should NOT be allowing this, because we're not updating the aggregate state-store at the same time. In the above example we're INSERTing the value 10 for key x. But if a new message was to come in to test with key x, then KSQL will write key-> {"COUNT":1} to the sink topic, where the update should be key->{"COUNT":11}.
If we think of other as a materialized view, (and we should change or syntax too: #3773), then it makes sense that users can insert into other... it's a view!
The same is true for test in the example above. If the topic already exists we must assume there is some other system publishing to it, e.g. Connect dumping a postgres table. Again, allowing users to write to the topic is semantically incorrect. Any update we write will be overwritten by the source the next it has an update to that key.
As discussed in #3773, we should consider switching CREATE TABLE -> CREATE VIEW for tables where the topic is populated by systems outside of KSQL's control.
However, the scope of this bug is to disable the inserts for tables KSQL doesn't own. (Though we may need to split CREATE VIEW from CREATE TABLE to enable this).
If we _really_ want to keep insert into table, i.e. into materialized view, then we should consider disallowing by default, but having the error message explain the pitfalls and providing syntax to FORCE it through.
TBH, I'm also not convinced we should be supporting insert values into topics KSQL doesn't own/create. This is especially true of tables. Consider two KSQL clusters, where a sink topic from one is being used as a source for the other. This would suffer from the same problem as above: we may support producing to the topic, but the upstream processing isn't going to respect this new value, it's just going to overwrite it on the next update.
I feel similar for streams. If the source topic was created by another system then we should treat it as immutable, (including not being able to delete).
We should still allow inserts into streams and tables KSQL created, eg. where the users supplies WITH(PARTITIONS=x) and the underlying topic did not already exist. We should track if the topic was created by KSQL in the message written to the command topic.
In fact we should _only_ allow inserts into tables that are backed by topics KSQL created and which have no queries writing into them. The same argument probably exists for streams.
I've been thinking about this a lot in context of #3799, and a big question is defining the "ownership model" (see discussion on #3585). A simple "I own what I created" model breaks down pretty easily (again, see #3585), and this isn't an easy problem to solve (that's what ACL systems handle, and we can't rely on those in all environments). I definitely see the concern of letting it run wild as it can seriously mess up the state of things (especially tables, like you point out) and I agree we should figure out a good way to limit that feature.
On the other sign of the coin, I've found INSERT VALUES an immensely powerful development tool to prod at the system and see what happens. In fact, it's so powerful that I use it as a drop in replacement for the console-producer (which I struggle with every time) - so I don't think we shouldn't allow it to produce to topics KSQL didn't create.
My 2c and not any good suggestion...
+1 to the reported bug.
But @big-andy-coates also started a slightly larger discussion in https://github.com/confluentinc/ksql/issues/3954#issuecomment-557614760. Perhaps we should move this elsewhere, e.g. into a new issue? I have a few comments to share, too, but don't want to distract from the actual bug report here.
Also, what's interesting is that INSERT INTO does not work when you try it against a table in ksqlDB 0.6, even though @big-andy-coates test above suggests the opposite:
Failed to prepare statement: INSERT INTO can only be used to insert into a stream. USERS_ORIGINAL is a table.
This was originally reported by @mikebin elsewhere.
@miguno - could the distinction here be between INSERT VALUES (single record) - which _does_ work with tables in ksqlDB 0.6.0 - and INSERT INTO (streaming), which causes the error message?
Yes, that's right @mikebin!
Yep this is specifically about INSERT VALUES not INSERT INTO.
I think the same is true for tables imported into KSQL by CT statements: we should not allow users to insert rows into them.
Consider the situation where Connect is used to pull a table from Postgres into a table and then imported into KSQL using a CREATE TABLE command. At the moment INSERT VALUES will allow the user to produce new records to this topic. However, such updates will be overwritten if/when the source Postgres row is updated. So allowing users to insert into the table is semantically incorrect. Correct updates require the user to update the _source_ table, i.e. the one in post-gres. (See #3773, where I mention switch CT to CREATE VIEW for the cases where KSQL does not own the topic - the use of the term VIEW makes it clear that it is not updatable)
The only time we should allow inserts into tables is if KSQL created, (i.e. owns), the underlying topic. In such a situation KSQL manages the source TABLE and can have inserts with correct semantics, (though we should probably differentiate between INSERT and UPDATE).
On the other sign of the coin, I've found INSERT VALUES an immensely powerful development tool to prod at the system and see what happens. In fact, it's so powerful that I use it as a drop in replacement for the console-producer (which I struggle with every time) - so I don't think we shouldn't allow it to produce to topics KSQL didn't create.
I love this feature too! And I'm sure others do. However, I think we need to limit it to only allowing users to insert into streams, (which have no per-key update semantics), and tables that KSQL owns. If KSQL doesn't own the table than its actually a view.
Introducing CREATE MATERIALIZED VIEW as proposed in KLIP-11 would address this issue to some extent as it make clear that the table is read-only.
Copied from klip-11: https://github.com/confluentinc/ksql/pull/3799#issuecomment-559825880
Thinking more on the table vs mv discussion, I think @mjsax and @PeterLindner are close to being in alignment with the following, but writing it out to clarify my thoughts and get it down in.
Assumption:
- A stream is a stream of facts. Each fact stands alone. Hence it is possible to have a stream with many people writing to it.
- A table has per-key update semantics. It is therefore crucial that there is control over how a table is updated. In terms of KSQL and changelog topics, this means only a single writer per partition. Allowing arbitrary writing to a table leads to undesirable/undefined behavior, e.g. if two writers updating a row, the later writer will overwrite the first writers update.
- A materialized view is like a table, except its data is computed from upstream source(s). Updates to the view are not allowed. If you want to update the view, you need to update the source(s).
Proposal:
Split current
CREATE TABLEstatement in two:
CREATE MATERIALIZED VIEW: imports an existing topic containing a changelog. It is assumed the changelog is updated by some upstream entity. KSQL does not own the topic, will not allow writes to it and will not delete it.CREATE TABLE: creates a new topic or _explicitly_ takes ownership of the topic containing a changelog into KSQL. It is any assumed any pre-existing topic is _not_ updated by an another system. If it is updated by another system the behaviour is _undefined_. KSQL owns the topic. writes are allowed and KSQL willl automatically delete the topic if/when the table is dropped. This is essentially a 'static' rdbs style table. We should not allow persistent queries to write to it.Split
CREATE TABLE AS SELECTstatement in two:
CREATE MATERIALIZED VIEW AS SELECT: creates a new changelog topic, failing if the topic already exists, (current behaviour). KSQL owns the topic, but as its a view updates to it are not allowed. KSQL will delete the topic if the view is dropped.CREATE TABLE AS SELECT: future work: not proposed for now. Would create a static rdbs style table, which KSQL owns. Updates allowed. KSQL will delete on drop.IMHO this is clean and intuitive. Views do not support INSERT/UPDATE/DELETE, tables do. Users can be explicit about how they want KSQL to manage the changelog topic.
Switching to
MATERIALIZED VIEWs also sets us up in the future to support non-materializedVIEWS, which would allow people to define a set of views and then use them to build a final materialized output, i.e. it would allow users to avoid materializing intermediate steps like they have to today.Extending this to streams:
Spit
CREATE STREAMin two:
CREATE STREAM VIEW: likeCREATE MV, _imports_ a topic containing a stream into KSQL. KSQL does not own the topic. updates are not allowed. KSQL will not delete the topic.CREATE STREAM: likeCREATE TABLE: creates a new topic, or takes ownership of existing. KSQL owns topic and will delete it on a drop. Inserts only will be allowed. As streams don't need a single writer, inserts by other systems are supported.- Split
CREATE STREAM AS SELECTin two:CREATE STREAM VIEW AS SELECT: creates a read-only stream derived from some source(s). KSQL owns the topic. Inserts not allowed. KSQL will delete the topic on drop.CREATE STREAM AS SELECT: creates a stream, (and topic), from a one-time query. Owned by KSQL. Inserts allowed. Will be deleted on drop.
Most helpful comment
Yep this is specifically about
INSERT VALUESnotINSERT INTO.I think the same is true for tables imported into KSQL by CT statements: we should not allow users to insert rows into them.
Consider the situation where Connect is used to pull a table from Postgres into a table and then imported into KSQL using a
CREATE TABLEcommand. At the momentINSERT VALUESwill allow the user to produce new records to this topic. However, such updates will be overwritten if/when the source Postgres row is updated. So allowing users to insert into the table is semantically incorrect. Correct updates require the user to update the _source_ table, i.e. the one in post-gres. (See #3773, where I mention switch CT toCREATE VIEWfor the cases where KSQL does not own the topic - the use of the termVIEWmakes it clear that it is not updatable)The only time we should allow inserts into tables is if KSQL created, (i.e. owns), the underlying topic. In such a situation KSQL manages the source
TABLEand can have inserts with correct semantics, (though we should probably differentiate betweenINSERTandUPDATE).I love this feature too! And I'm sure others do. However, I think we need to limit it to only allowing users to insert into streams, (which have no per-key update semantics), and tables that KSQL owns. If KSQL doesn't own the table than its actually a view.