I am joining a stream with a table. The result of the join is only partially successful. Some of the values are accurately populated while others are null. I checked to make sure that the values exist in both the table and stream, and that the column used to join on is the key for both.
I'm using confluent to load the topics that are reading from tables in MSSQL using the jdbc.
Im then creating the stream and table from the corresponding topics using KSQL, and the data is all in JSON format.
The time stamps on the missing data in the table is older than that of the stream.
Any suggestions?
create stream casecode_contract_stream as select ct.projectid, ct.casecode, cs.isTrue from contract_stream cs left join casecode_table ct on cs.projectid = ct.projectid;
select * from casecode_contract_stream limit 1;
1532034321292 | 706083 | null | null | true
select * from casecode_contract_stream where casecode is not null limit 1;
1532034321292 | 705147 | 705147 | data1 | true
select * from casecode_table where projectid = 705147;
1532033878462 | 705147 | 705147 | data1
select * from casecode_table where projectid = 706083;
1532033878463 | 706083 | 706083 | data2
select * from contract_stream where projectid = 705147;
1532034321292 | 705147 | 705147 | true
select * from contract_stream where projectid = 706083;
1532034321292 | 706083 | 706083 | true
You may be hitting https://github.com/confluentinc/ksql/issues/1510 or https://github.com/confluentinc/ksql/issues/1405
Right now, there is a certain amount of non determinism in left joins, because there so guarantee fo the order in which messages are processed between the left and right side. It's either that, or the topic feeding your table is keyed incorrectly.
We took a different stream for a different topic that contains the 'projectid' column and joined it on the same table as it mentioned above. The results of which include the data that the join above was unable to join on. I do not see a difference between the two streams as far as the key that is being joined on, but for whatever reason we are still experiencing the issue described for that specific stream.
It appears as though this is an issue with the non determinism that you mention.
At this time, we've given up on using KSQL for these transformations. We're moving forward by having the JDBC connector ingest from a view in SQL that implements the join.
Hi, I have struggled with the exact same problem. With some help from confluent I was able to verify that joins are deterministic as long as you can guarantee the following sequence of events:
With this sequence carefully observed, all was fine. My test datasets are 1MIO table rows and 1MIO stream events. Initially only 280K joins worked and 720K didn't.
After applying the above order, 960K joins worked and 40K joins didn't which was exactly as expected - the test stream had some IDs not available in the table.
Of course, we'd all like joins to be fully deterministic, regardless of the order in which it gets set up and populated.
Thanks for the update @Eoan242
I am going to close this out since the issue is already tracked in #1510