If I create a stream that starts a query in the background, e.g.
CREATE STREAM S AS SELECT c1, c2 FROM T;
Then I should be able to drop this stream _without_ needing to _find_ the query Id and terminate the query. e.g. I should be able to do:
DROP STREAM S;
This currently complains, due to inter-stream referential integrity checking. (The query is seen as a source to stream 'S', so 'S' can't be dropped while the query is running).
The issue here is that the query id is non-deterministic, being based on an internal counter. So its not possible to submit a script to the rest api that drops and recreates a stream, as you can't add suitable TERMINATE call.
The complexity comes about when we have INSERT INTO S queries running into the same stream/table. I think it is OK/correct in such a situation that the DROP STREAM S to complain. However, we should have a way of overriding this so that even that query is stopped.
I would propose that DROP STREAM S should automatically stop any query that was started as part of the original CREATE STREAM S AS statement. But the presence of other queries that insert _into_ the stream should cause it to fail, unless the user requests us to stop them, e.g. we could support something like:
-- use if you want any other source queries to automatically be stopped
FORCE DROP STREAM S;
The DROP command will also fail if this stream is used as a source for other queries. This should continue to cause the DROP to fail.
I guess an alternative would be some syntax to allow your to terminate a queries based on the stream/table they output too, e.g.
-- terminate all queries that output to stream S.
TERMINATE ALL FOR STREAM S;
This approach differs from having a straight DROP STREAM S automatically stopping the query started by its CSAS statement in that with the above TERMINATE FOR STREAM syntax the user can't choose to drop _only if the only query running is the one auto created during the CSAS statement_.
I prefer the implicit terminate approach because it differentiates between those queries created during the CSAS/CTAS and those that weren't. So you can do this:
CREATE STREAM S1 AS SELECT * FROM FOO;
-- This should work, automatically stopping the query created in the line above
DROP STREAM S1;
CREATE STREAM S2 AS SELECT * FROM FOO;
INSERT INTO S2 SELECT * FROM BAR;
-- Drop should fail as there are other queries that write to S2.
DROP STREAM S2;
-- a forced drop should still succeed:
FORCE DROP STREAM S2;
CREATE STREAM S2 AS SELECT * FROM FOO;
CREATE STREAM S3 AS SELECT * FROM S2;
-- Drop should fail as as this would break the query behind S3;
DROP STREAM S2;
-- Drop should _still_ fail as as this would break the query behind S3;
FORCE DROP STREAM S2;
Ephemeral / temporary queries, (e.g. from a user issuing a SELECT * FROM BLAH; statement in the CLI), should _not_ be included when determining if an entity can be dropped, i.e. an ephemeral query should _not_ stop an entity being dropped.
To do so would be the same as not allowing a DB Admin of a RDBMS system to not be able to drop a table if _any_ system was currently selecting from it.
Such ephemeral queries should terminate and, ideally, report a suitable message to the user.
Duplicate of #1317?
Related to #2112, which is more about a CASCADE DROP command to drop a hierarchy of statements.
@big-andy-coates: Thanks for putting this together. At large, this make sense to me.
Could you please clarify the behavior for a stream's/table's dependencies:
What's the desired behavior for queries/streams/tables that write to the [stream/table]-to-be-dropped when using a normal DROP? In comparison, what's the desired behavior for a forced DROP? This is, as you already alluded to above, not just about the [stream/table]-to-be-dropped itself but also what happens to these related queries/streams/tables.
What's the desired behavior for streams/tables/queries that read from the [stream/table]-to-be-dropped? In comparison, what's the desired behavior for a forced DROP? This is, as you already alluded to above, not just about the [stream/table]-to-be-dropped itself but also what happens to these related queries/streams/tables.
As you pointed out above, we also have the related ticket https://github.com/confluentinc/ksql/issues/2112 (~ DROP CASCADE).
My questions here:
I think your argument is that any "collateral damage" of this ticket (beyond terminating its own query from a CSAS/CTAS) should be limited to what I listed as (1) further above in this comment, whereas https://github.com/confluentinc/ksql/issues/2112 is more about (2)?
Regarding existing RDBMS: What's their behavior for DROP TABLE in the face of (1) and (2)? Will a DROP TABLE ... CASCADE, if supported (see below), cover both (1) and (2), or just either of the two?
Regarding the syntax: I'd prefer an approach based on DROP, too, rather than sth like TERMINATE ALL FOR STREAM S.
For what it's worth, here's what some RDBMS have chosen to implement.
PostgreSQL (docs):
Uses DROP TABLE myTable and DROP TABLE myTable CASCADE. DROP TABLE myTable is actually an alias for DROP TABLE myTable RESTRICT.
CASCADE
Automatically drop objects that depend on the table (such as views), and in turn all objects that depend on those objects (see Section 5.13).RESTRICT
Refuse to drop the table if any objects depend on it. This is the default.
MySQL (docs):
Considers DROP TABLE ... [RESTRICT | CASCADE] as valid syntax to make migration from other DBs like PostgreSQL easier, but itself it doesn't support this functionality.
The RESTRICT and CASCADE keywords do nothing. They are permitted to make porting easier from other database systems.
SQL Server:
From what I understand it doesn't support the DROP syntax directly. Instead, it says:
DROP TABLE cannot be used to drop a table that is referenced by a FOREIGN KEY constraint. The referencing FOREIGN KEY constraint or the referencing table must first be dropped. If both the referencing table and the table that holds the primary key are being dropped in the same DROP TABLE statement, the referencing table must be listed first.
Multiple tables can be dropped in any database. If a table being dropped references the primary key of another table that is also being dropped, the referencing table with the foreign key must be listed before the table holding the primary key that is being referenced.
When a table is dropped, rules or defaults on the table lose their binding, and any constraints or triggers associated with the table are automatically dropped. If you re-create a table, you must rebind the appropriate rules and defaults, re-create any triggers, and add all required constraints.
Note: There's also a related option, which is the ON DELETE CASCADE clause when you are defining the table (DDL), but its use case is different because it deletes records from any child tables but it doesn't cause SQL Server to delete the child tables when doing a DROP.
@miguno
I thought I'd made that pretty clear in my description - apparently not!
In answer to Q1:
DROP will only stop the query that was originally created with the stream/table. See 'Straight create / drop example' example above. It will fail if any other queries read or write to the stream/table.FORCE DROP will stop any query that write to the stream/table, i.e. it will also stop any queries started by INSERT INTO statements. See 'With insert into example' example above.In answer to Q2:
DROP and FORCE DROP will fail if there are downstream queries that read from the stream. (We'd need to support CASCADE DROP for that).INSERT INTO X SELECT * FROM Y _reads_ from Y. So it's not just CTAS / CSAS statements.Regarding Cascade drop:
I think your argument is that any "collateral damage" of this ticket (beyond terminating its own query from a CSAS/CTAS) should be limited to what I listed as (1) further above in this comment, whereas #2112 is more about (2)?
Yeah, [FORCE] DROP only affects things writing to this stream/table. CASCADE DROP would include other streams/tables built from this one. But as I said, they are complementary.
Regarding existing RDBMS: What's their behavior for DROP TABLE in the face of (1) and (2)? Will a DROP TABLE ... CASCADE, if supported (see below), cover both (1) and (2), or just either of the two?
I guess the closest thing an RDBMS will have is the cascade dropping of views built of a table. Which I guess is similar to having a hierarchy of streams and tables. So, like KSQL, they'll allow users to drop the leaves of the dependency graph, (.i.e. those things that have nothing reading from them), and will use a cascade drop to delete everything downstream of the entity being dropped. This is the same as (2).
RDMBS's obviously don't have long running queries like KSQL, so (1) doesn't apply. Because (1) is all about _queries_, not tables/streams.
From what I understand it [SQL Server] doesn't support the DROP syntax directly
I read that differently: it is supported, but you need to start at the leaves first. i.e. if you have table b with a foreign key to table a, then you need to drop table b before you can drop table a. This is just referential integrity of the dbs metadata.
Just having a chat with @hjafarpour about this and it struck me that this solution doesn't help when it comes to INSERT INTO commands.
Affectively, what most people want to do in a script is _drop and recreate_ on each statement. The above discussion focuses on _drop and recreate_ of CSAS/CSAT statements. But won't work for INSERT INTO statements. To drop and recreate the insert into statement I can't drop the INSERT INTO as it doesn't exist. I need to terminate the underlying query and then reissue my insert statement. But in a script I don't _know_ the query Id so can't so it.
So how does a user terminate and restart an insert into query?
INSERT INTO D SELECT a, b, func(c) FROM S;
???
Maybe what we need is a way for users to query KSQL for query ids!
SELECT QueryId FROM RUNNING_QUERIES WHERE SOURCE = 'S' AND DESTINATION = 'D' AND SQL CONTAINS 'Insert Into';
Then we can add the ability to select into a variable and/or subqueries:
TERMINATE ALL (SELECT QueryId FROM RUNNING_QUERIES WHERE SOURCE = 'S' AND DESTINATION = 'D' AND SQL CONTAINS 'Insert Into');
Thoughts?
Thanks for the follow-up, Andy.
RDMBS's obviously don't have long running queries like KSQL, so (1) doesn't apply. Because (1) is all about queries, not tables/streams.
Yeah. Handling INSERT INTO when DROP'ing a table wouldn't be an issue for RDBMS because the INSERT INTO isn't continuously running like KSQL's INSERT INTO.
And that's why then, in this ticket:
From what I understand it [SQL Server] doesn't support the DROP syntax directly
I read that differently: it is supported, but you need to start at the leaves first. i.e. if you have table b with a foreign key to table a, then you need to drop table b before you can drop table a. This is just referential integrity of the dbs metadata.
Oh, yes, of course you can do the manual work of first DROPing the dependencies, and then DROP'ing the actual table. I was talking about the cascading DROP variant, which isn't supported by SQL Server (and also not supported by MySQL).
Affectively, what most people want to do in a script is drop and recreate on each statement. The above discussion focuses on drop and recreate of CSAS/CSAT statements. But won't work for INSERT INTO statements. To drop and recreate the insert into statement I can't drop the INSERT INTO as it doesn't exist. I need to terminate the underlying query and then reissue my insert statement. But in a script I don't know the query Id so can't so it.
I think this is a valid concern, but: This ticket is about dropping tables/streams (CTAS/CSAS), and that we are making the case that the user should optionally be able to force the termination of any INSERT INTO queries that are writing into the table/stream.
Your new concern above is about how users could stop/restart (drop and recreate) an INSERT INTO statement in isolation, i.e. without touching the target stream or table.
Right?
Your new concern above is about how users could stop/restart (drop and recreate) an INSERT INTO statement in isolation, i.e. without touching the target stream or table.
I'm just realising that any solution we come up with for CSAS / CTAS should also be generic enough to work with INSERT INTO. Otherwise we're only solving half the problem.
The only solution that I can think of right now that can handle CSAS and CTAS and INSERT INTO is some queryable metadata that allows users to get at the queryids.
@hjafarpour @rodesai @dguy @apurvam @vcrfxia @rmoff can you think of any other statements we need to worry about? What's your opinion?
Late to the convo but here're my thoughts:
FORCE DROP STREAM S;
I think it would be better to retain DROP as the verb, and have FORCE / CASCADE / WITH EXTREME PREJUDICE operators after the object name
TERMINATE ALL FOR STREAM S;
This would still leave the stream object and make it a two-command process. I can't see a scenario where you'd want to do this, and not just the whole thing in one go
CREATE STREAM S2 AS SELECT * FROM FOO;
CREATE STREAM S3 AS SELECT * FROM S2;
We should still support an option to drop a stream _and all child dependencies_. This would happen when redeploying a pipeline. So maybe we want CASCADE here? I realise that this is touching on #2112 but we don't want to miss a trick by arbitrarily separating them just because it's two GH issues to start with.
Thinking about it, what's the reason not to automagically clean up queries if a drop is requested against a stream? I mean, should we default to "FORCE DROP" (i.e. drop a stream also terminates dependent queries)? Keep the simplest syntax for the most common operation, and then introduce a NO TERMINATE flag if we want to retain the query?
@rmoff
Thinking about it, what's the reason not to automagically clean up queries if a drop is requested against a stream? I mean, should we default to "FORCE DROP" (i.e. drop a stream also terminates dependent queries)? Keep the simplest syntax for the most common operation, and then introduce a NO TERMINATE flag if we want to retain the query?
I think we _should_ default to automatically terminating the _implicit_ query created when the stream was created, when it is later dropped. However, if you're suggesting we should also terminate _downstream_ queries, then I'd disagree - the reason we added referential integrity in the first place was to stop users dropping a query a down stream query relies on.
What about the following flow then:
DROP STREAM s and DROP TABLE t (optional syntax: DROP STREAM s RESTRICT): Deletes the stream/table and its "implicit" query, and nothing more.DROP STREAM s CASCADE and DROP TABLE t CASCADE: Same as DROP RESTRICT above, but additionally also deletes any downstream streams/tables, including their respective "implicit" queries.The one open question would then be how to handle any INSERT INTO statements for both (1) and (2), notably because for (2) the downstream streams/tables might have their own INSERT INTOs.
It seems we have the following things to worry about when dropping a stream or table:
INSERT INTO statements.If any of the above exist then, at the moment, the DROP will fail.
So maybe it's enough to just have CASCADE handle both 2 & 3?
@big-andy-coates :
So maybe it's enough to just have CASCADE handle both 2 & 3?
As I mentioned in my previous comment, I think CASCADE should certainly cover (3), and my question was whether it should also cover (2). Personally, I think CASCADE should cover (2) as well, and it should also cover the recursive scenario where any downstream streams/tables have their own INSERT INTOs.
Now, regarding UX, it's obvious that a DROP CASCADE is rather impactful and potentially dangerous -- a human mistake can be costly. Should we provide any functionality like a dry-run option to explain to the user what exactly would happen if the DROP CASCADE would be carried out? Think: "First, do a dry-run DROP CASCADE. Then, if the results look good to you, you can run the real DROP CASCADE."
How is this handled in RBDMS that support CASCADE? Is there any such safety measure?
Oracle, for example, has a recycle bin concept.
I don't think we should be too nanny-state about this. DROP CASCADE should
do just that. Maybe we could have a dry run flag available for people who
are scared? ;) But mandating it feels like not such a nice workflow.
On Mon, 10 Dec 2018 at 15:54, Michael G. Noll notifications@github.com
wrote:
@big-andy-coates https://github.com/big-andy-coates :
So maybe it's enough to just have CASCADE handle both 2 & 3?
As I mentioned in my previous comment, I think CASCADE should certainly
cover (3), and my question was whether it should also cover (2).
Personally, I think CASCADE should cover (2) as well, and it should also
cover the recursive scenario where any downstream streams/tables have their
own INSERT INTOs.Now, regarding UX, it's obvious that a DROP CASCADE is rather impactful
and potentially dangerous -- a human mistake can be costly. Should we
provide any functionality like a dry-run option to explain to the user what
exactly would happen if the DROP CASCADE would be carried out? Think:
"First, do a dry-run DROP CASCADE. If the results look good to you, you can
run the real DROP CASCADE."How is this handled in RBDMS that support CASCADE? Is there any such
safety measure?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/confluentinc/ksql/issues/2177#issuecomment-445843190,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADgGHgKgDylDv9HK09X9aEyHWukZNJAxks5u3nWxgaJpZM4Ys8S_
.
Maybe we could have a dry run flag available for people who are scared? ;)
That's what I meant. A normal DROP CASCADE would "just do it". But cautious people might want to have the option to do a dry run first (via an optional flag).
IMHO, the minimum work would be DROP CASCADE without such a dry-run option, and the dry-run option would be the extra.
+1
On Tue, 11 Dec 2018 at 08:41, Michael G. Noll notifications@github.com
wrote:
Maybe we could have a dry run flag available for people who are scared? ;)
That's what I meant. A normal DROP CASCADE would "just do it". But
cautious people might want to have the option to do a dry run first (via an
optional flag).IMHO, the minimum work would be DROP CASCADE without such a dry-run
option, and the dry-run option would be the extra.—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/confluentinc/ksql/issues/2177#issuecomment-446102390,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADgGHmOPq9jnP_o_S9QKRJEe4Z2y805Yks5u32GlgaJpZM4Ys8S_
.
Sounds sensible. What would the dry run syntax look like?
DROP STREAM x CASCADE DRY RUN
???
+1 @big-andy-coates
DROP STREAM <name> [CASCADE | RESTRICT] [DRY RUN]
DROP TABLE <name> [CASCADE | RESTRICT] [DRY RUN]
https://github.com/confluentinc/ksql/issues/5987 looks to address this issue partially by terminating only the persistent query that belongs to the CAS statement. If other queries (inserts, cas, transient) are reading the stream/table then DROP will fail until any reference is terminated/dropped manually.
Marking https://github.com/confluentinc/ksql/issues/5987 as duplicate. I will use this issue to track the work as seems it is most popular feature to implement.
DROP STREAM already supports an optional DELETE TOPIC
https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/drop-stream/
Why not add another optional DELETE QUERY
@mingfang That would be useful too, but DROP STREAM is more natural and SQL standard than adding the DELETE QUERY.
The query running in the background for such stream/table is only used by the stream. A user shouldn't have to interact with the query. That's the idea of terminating the query automatically so users interact with the STREAM only (no queries).
I know I'm pretty late to the party, have no clues about the internals, not much knowledge about ksql itself, but an outsider's opinion :). I've read almost all tickets & PRs about TERMINATE ALL, DROP ... CASCADE, INSERT vs. UNION, and I have a personal issue with DROP S/T while a query is still running, which brought me here in the first place. So please, bear with me if I miss something.
IMHO you should take a step back and look from a higher level: The persistent query is not a first-class citizen in the KSQL language like streams or tables. I think this is the main reason why you have all these issues.
However, your queries are the R.E.A.L. powerhouse of ksql, giving us the ability to define our stream processors easily. To me, queries are the serverless functions of Kafka.
We have CREATE STREAM
We have CREATE TABLE
We should have CREATE QUERY
Imagine the possibilities:
CREATE QUERY _name_ AS SELECT
CREATE QUERY _name_ AS INSERT
CREATE QUERY _name_ AS SCRIPT
...
To me, an e.g. CREATE STREAM AS SELECT is nice as it consolidates 2 steps in one (create stream, create query) but I wouldn't bother so much about writing those two statements if I'd have to.
Btw. I personally think that DROP ... CASCADE is a bad style (also in SQL) and a dinosaur. If you want to drop entire functional areas in a production database you would do this rather carefully and step by step. If you want to do this in your dev database you start it from scratch and apply your migrations.
Most helpful comment
+1 @big-andy-coates