Installation details
Scylla version (or git commit hash): 2.2.rc0-0.20180504.9aa172f
Cluster size: 1
OS (RHEL/CentOS/Ubuntu/AWS AMI): AWS AMI (CentOS 7.3)
Hardware details (for performance issues)
Platform (physical/VM/cloud instance type/docker): AWS i3.xlarge
We have been testing 2.2 on a test cluster and everything has been looking really good. We are eager to deploy it more widely.
However, tonight we got a strange error for a single row:
May 20 00:37:53 services scylla[32555]: [shard 2] storage_proxy - exception during mutation write to 127.0.0.1: std::out_of_range (Column not found for id = 22)
Selecting the row works fine but we get this error every time when trying to either update or delete:
cqlsh:veraminetest> select * from hosts where key = 1 and customerguid = 00000000-0000-1000-0000-000000000010 and systemguid = 6505035f-30fe-4698-98c0-757ded7cc987 and hostid = 46;
key | customerguid | systemguid | hostid | bitlockervolumesdisabled | bitlockervolumesenabled | bytesrecved | createdat | customerid | diskinformation | eventscount | groupid | groupname | groups | hostname | ipstr | lastcontact | lastdiagnosticsjson | lastdiagnosticstime | lastsessiondisconnecttime | lastsessionendtime | lastsessionstarttime | lastsystemuptime | lockconsoleonwake | osver | protocolbytecount | quarantined | sensorversion | sessionguid | uploadbytes | uploadcount | windowsfirewallstatus | wusummary
-----+--------------------------------------+--------------------------------------+--------+--------------------------+-------------------------+-------------+--------------------------+------------+-----------------------------------------------------------------------+-------------+---------+-----------+--------+----------+---------------------+--------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--------------------------+---------------------------+--------------------------+--------------------------+------------------+-------------------+---------------+-------------------+-------------+---------------+--------------------------------------+-------------+-------------+-----------------------+----------------
1 | 00000000-0000-1000-0000-000000000010 | 6505035f-30fe-4698-98c0-757ded7cc987 | 46 | null | null | null | 2018-05-10 03:34:33+0000 | 10 | [{diskname: 'C:\', disktype: 3, totalspace: 61437, freespace: 51941}] | null | null | null | null | sandbox | {'192.168.126.137'} | 2018-05-16 06:05:39+0000 | {"Timestamp":"202796244401","Time":"131706676184959653","Sequence":0,"SensorVersion":0,"SessionGuid":{"Data":""},"Service":{"Events":{"GeneratedEvents":766,"GeneratedEventErrors":0,"EventsByType":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,336,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,6,6,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,0,0,0,0,0,402],"EventBytesByType":["0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","101","0","7671","0","0","0","0","0","0","0","0","0","0","0","0","0","306","26","113","198","446","23","0","0","0","0","0","0","0","0","0","0","231","195","0","640","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","0","67","0","0","0","0","0","0","0","0","0","0","62538"]}},"SmbServer":{"Etw":{"ReceivedDataLength":"0","ReceivedEvents":0,"ProcessedDataLength":"0","ProcessedEvents":0,"ParsingErrors":0,"ParsingExceptions":0},"ActiveSmbSessions":0,"OpenSmbShares":0},"Dns":{"CachedEntries":0}} | 2018-05-13 06:46:58+0000 | 2018-05-16 06:05:39+0000 | 1970-01-01 00:00:00+0000 | 2018-05-13 06:47:10+0000 | 409386793 | True | Windows 7 SP1 | 8519584 | null | 17039367 | 7b6900aa-4f70-0ce6-3949-8a59a7540456 | 56090297 | 150 | Enabled | Not Configured
(1 rows)
cqlsh:veraminetest> update hosts set groupid = 21, groupname = 'My Default Group' where key = 1 and customerguid = 00000000-0000-1000-0000-000000000010 and systemguid = 6505035f-30fe-4698-98c0-757ded7cc987 and hostid = 46;
WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="Operation failed for veraminetest.hosts - received 0 responses and 1 failures from 1 CL=ONE." info={'failures': 1, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
cqlsh:veraminetest> delete from hosts where key = 1 and customerguid = 00000000-0000-1000-0000-000000000010 and systemguid = 6505035f-30fe-4698-98c0-757ded7cc987 and hostid = 46;
WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="Operation failed for veraminetest.hosts - received 0 responses and 1 failures from 1 CL=ONE." info={'failures': 1, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
I ran a repair on this host and it didn't seem to help. I also initiated a compaction of this table and that didn't seem to help either:
May 20 00:41:50 services scylla[32555]: [shard 1] compaction - Compacting [/var/lib/scylla/data/veraminetest/hosts-12c20330e1be11e7829b000000000000/veraminetest-hosts-ka-221-Data.db:level=0, /var/lib/scylla/data/veraminetest/hosts-12c20330e1be11e7829b000000000000/veraminetest-hosts-ka-217-Data.db:level=0, ]
May 20 00:41:50 services scylla[32555]: [shard 1] compaction - Compacted 2 sstables to [/var/lib/scylla/data/veraminetest/hosts-12c20330e1be11e7829b000000000000/veraminetest-hosts-ka-225-Data.db:level=0, ]. 68153 bytes to 46918 (~68% of original) in 106ms = 0.42MB/s. ~256 total partitions merged to 1.
As you can see, this table is tiny, a few dozen KB. We recently added two columns to the table via alter table... However, there have been no trouble updating any other row, it's just this one and its every time. What is the best way to troubleshoot this condition? For this test cluster, it's no big deal but we will need a procedure to resolve the issue in production. Thanks.
Jonathan
@duarten I suspect this is related to materialized views handling, can you confirm?
It doesn't ring a bell, but seems a likely suspect.
@veramine - can you be more specific - which alters have you executed ... we will try to reproduce (more than once) on our end.
I tried to repro this but couldn't. We only call cell_at in 3 places in the MV code, and they all look sane (they are used to fetch primary key columns of the base, or the regular column of the base in the view's PK, which at that point we ensure it exists in the update).
@veramine Do you recall what were the order of schema changes? What MVs were already created when the new columns were added? Which columns were those?
Also, if you could drop the MVs and retry the update, it would let us confirm the issue is within MV code.
Yes, I can confirm that this procedure resolved the issue:
1- Repro issue (it had "spread" to a second table row)
2- Drop each materialized view
3- Recreate each materailized view
4- Iterate over each row, re-writing each value to repopulate the materailzied view
5- Attempt same update
In this case, the update prior to the MV drop/recreate returned an error. After dropping and recreating the MV's, the update worked with no error.
I originally reproduced this with the exact query our application is making, setting a groupid and groupname on the row. However, I tested setting each independently. The update involving just the groupid worked fine. Updating just the groupname of either of those two problematic rows always returned an error. It's an interesting situation because groupname is actually NOT involved in any of the materialized views. Groupid is involved but groupname is not. You can see from the attached that groupname does not appear in any of the MV statements. I also saw errors in the application log that involved other columns also not participating in any of the MV's (protocolbytecount, uploadbytes, uploadcount, lastcontact, lastsessionendtime, lastsessiondisconnecttime).
Here was the order of operations involving the schema change:
First, our app is running and is making several updates per second to rows in hosts table. We are tracking state in this table so each row might get as many as 1 update every second.
1- alter table hosts add groupid int;
2- alter table hosts add groupname text
(..lots of CQL updates made programatically via gocql..)
3- CREATE MATERIALIZED VIEW hosts_groupid AS SELECT key, customerguid, systemguid, hostid, groupid, hostname FROM hosts WHERE key IS NOT NULL AND customerguid IS NOT NULL AND hostid IS NOT NULL AND systemguid IS NOT NULL AND groupid IS NOT NULL PRIMARY KEY (customerguid, groupid, systemguid, hostid, key);
(..lots more CQL updates made programatically via gocql.. everything is working fine..)
4- update hosts set groupid = ?, groupname = ? where key = ? and customerguid = ? and systemguid = ? and hostid = ?
I am just totally guessing here but I wonder if there is some situation that is exacerbated by a situation where 1) a table has several MV's (this host table has five), 2) it is UPDATED frequently, 3) add new columns without stopping the updates or inserts, 4) add new materialized view, 5) attempt to update a recently added column that is not present in any of the materialized views that was added after one of the materialized views was created.
We will watch our application logs to see if this returns. The fix above is sufficient for us for now. When I performed that fix, I did first drop and recreate the hosts_groupid MV, the one that involved the newly added column. After that drop and recreate, I tested the manual update and it was still broken. And then I dropped and recreated the other four older MV's that had been created prior to the addition of the groupid and groupname columns. It was the drop/create of one of those (not sure which, or maybe all of them) that caused the update to work again without error.
Hope that helps.
hostsschema.txt
Do you have the errors regarding those other columns?
Even if changing the schema while sending updates would triggers bugs, nothing could be so permanent.
I tried to repro again and failed (with 2.2.). We'll need a more sophisticated setup, I guess.
More technical notes:
The only thing that explains the behavior is indeed loading a value for which the column is null, and trying to fetch it. When we use a more recent schema to load an existing value, that column is null. But we should handle that, and we do, for the case where the null column is included in the view. We don't try to fetch unselected columns beyond iterating over them to calculate timestamps and the like. So my best guess is that we're missing a mutation::update() call somewhere, although I would expect reading with a particular schema would upgrade to it.
The other errors were similar, just different column ID's. For example:
May 20 14:09:52 services scylla[32555]: [shard 2] storage_proxy - exception during mutation write to 127.0.0.1: std::out_of_range (Column not found for id = 9)
May 20 14:10:00 services scylla[32555]: [shard 0] storage_proxy - exception during mutation write to 127.0.0.1: std::out_of_range (Column not found for id = 9)
and
May 21 12:44:55 services scylla[32555]: [shard 2] storage_proxy - exception during mutation write to 127.0.0.1: std::out_of_range (Column not found for id = 10)
May 21 12:46:20 services scylla[32555]: [shard 2] storage_proxy - exception during mutation write to 127.0.0.1: std::out_of_range (Column not found for id = 10)
Your technical notes are helpful for me. In response to that, anytime I add new columns, I will iterate through all rows and set those nulls to an empty value. Probably that will help us avoid these kind of issues.
Dropping and recreating all materialized views on this table seems to have caused the problem to disappear. But today it is back after I added two additional columns to the problematic 'hosts' table and then inserted a new row with some values not specified.
To recap, here is the full set of relevant events:
(recap of the above):
1 - Table with 20+ columns of various types, including collection types, and 4 materialized views
2 - Added two new columns, one of type text and one int
3 - Added a materialized view involving the int (not the text)
4 - Updates to a specific existing row (one of 25) began to fail when updating the new 'text' type column added in step 2 above
5 - Dropped and recreated the new materialized view that involved the newly added column. Did not alter the table to drop the newly added columns.
6 - Updates to the single existing row continued to fail when they involved the text column.
7 - At some point, a second row became affected in the same way.
8 - Dropped all materialized views on this table, re-added them all. Repopulated the views by iterating through the rows and updating each with their existing values
9 - Updates to all rows began to work again, no errors
(yesterday):
10 - Added two new columns to the hosts table, a text and a timestamp. Neither new field was involved with a materialized view (no changes to the MV's after adding the columns)
11 - Inserted a new row with several fields not populated in the initial insert statement
12 - Attempted to update a single text field, osver, on that new row. This operation failed with the same error as before. osver is not a new column and has been present for months.
cqlsh:veraminetest> update hosts set osver = 'Windows 7' where key = 1 and customerguid = 00000000-0000-1000-0000-000000000001 and systemguid = b4b66531-0e4a-0b54-3cc2-cbd5fbc7c9fb and hostid = 50;
WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="Operation failed for veraminetest.hosts - received 0 responses and 1 failures from 1 CL=ONE." info={'failures': 1, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
cqlsh:veraminetest> update hosts set osver = '' where key = 1 and customerguid = 00000000-0000-1000-0000-000000000001 and systemguid = b4b66531-0e4a-0b54-3cc2-cbd5fbc7c9fb and hostid = 50;
WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="Operation failed for veraminetest.hosts - received 0 responses and 1 failures from 1 CL=ONE." info={'failures': 1, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
We see a similar error as before in the scylla-server journal:
May 24 10:40:21 services scylla[7341]: [shard 3] storage_proxy - exception during mutation write to 127.0.0.1: std::out_of_range (Column not found for id = 24)
May 24 10:40:33 services scylla[7341]: [shard 3] storage_proxy - exception during mutation write to 127.0.0.1: std::out_of_range (Column not found for id = 24)
We don't see this issue on 2.1.3 even with our same read and write patterns using the same schema. We also have not added new columns to our 2.1.3 installation so that may be a factor in why we see this in 2.2rc0 and not 2.1.3. We will hold off on deploying 2.2rc1 for now to continue to observe this behavior on the single node test.
I just performed nodetool drain, systemctl stop scylla-server, systemctl start scylla-server and this problem has gone away again. We will try restarting the server after making schema changes to this table in order to avoid the issue.
Very odd. Next time this happens, it would be great if you try just doing a flush. Seems like we're storing an old schema somewhere, and rebooting refreshes all the schemas. I'll comb the code again to look for such a case.
I tried to repro again, but still couldn't; I did:
All works fine, on branch-2.2.
We're not storing an old schema anywhere :/
We hit this issue again today on a different table (groups2) and nodetool flush of just that one table and that did not seem to help:
$ nodetool flush -- veraminetest groups2
Here is what we see from database journalctl:
May 28 23:45:37 services scylla[23165]: [shard 2] storage_proxy - exception during mutation write to 127.0.0.1: std::out_of_range (Column not found for id = 8)
May 28 23:52:36 services scylla[23165]: [shard 3] storage_proxy - exception during mutation write to 127.0.0.1: std::out_of_range (Column not found for id = 8)
May 28 23:56:38 services scylla[23165]: [shard 3] storage_proxy - exception during mutation write to 127.0.0.1: std::out_of_range (Column not found for id = 8)
May 28 23:58:21 services scylla[23165]: [shard 3] storage_proxy - exception during mutation write to 127.0.0.1: std::out_of_range (Column not found for id = 8)
May 28 23:59:31 services scylla[23165]: [shard 2] compaction - Compacting [/var/lib/scylla/data/veraminetest/groups2-39faef905b2e11e8a2ea000000000002/veraminetest-groups2-ka-14-Data.db:level=0, /var/lib/scylla/data/veraminetest/groups2-39faef905b2e11e8a2ea000000000002/veraminetest-groups2-ka-10-Data.db:level=0, /var/lib/scylla/data/veraminetest/groups2-39faef905b2e11e8a2ea000000000002/veraminetest-groups2-ka-18-Data.db:level=0, /var/lib/scylla/data/veraminetest/groups2-39faef905b2e11e8a2ea000000000002/veraminetest-groups2-ka-6-Data.db:level=0, ]
May 28 23:59:31 services scylla[23165]: [shard 2] compaction - Compacted 4 sstables to [/var/lib/scylla/data/veraminetest/groups2-39faef905b2e11e8a2ea000000000002/veraminetest-groups2-ka-22-Data.db:level=0, ]. 76031 bytes to 52610 (~69% of original) in 110ms = 0.46MB/s. ~512 total partitions merged to 3.
May 28 23:59:48 services scylla[23165]: [shard 1] storage_proxy - exception during mutation write to 127.0.0.1: std::out_of_range (Column not found for id = 8)
Also tried nodetool flush of everything and that also did not seem to help. But I do think the nodetool flush was a good idea because this server has way too much memory for the load so it may be keeping everything in memory for a really long time.
One more clue for this new instance of the issue.. On this table, I haven't recently added any columns. However, I did drop a column yesterday. The row that is having trouble was already inserted at the time that I dropped the column. This table has four materialized views and the column that is causing the error is not involved in any of them. The particular column that is causing the error this time is an int, not a text as it had been previously.
cqlsh:veraminetest> select * from groups2 where customerguid = 00000000-0000-1000-0000-000000000001 and groupid = 51;
customerguid | groupid | autoactionid | autoassignmenthostnamepattern | binarypolicy | collectionpolicy | collectionpolicyid | concurrentupdates | configid | deceptionpolicy | deceptionpolicyid | filepathpolicyid | groupname | hostcount | incrementalconfigid | sensorfilepath | sensormd5 | sensorstate | sensorversion | sensorversionstring
--------------------------------------+---------+--------------+-------------------------------+--------------+--------------------+--------------------+-------------------+----------+-----------------+-------------------+------------------+--------------+-----------+---------------------+-------------------------------------------------------+----------------------------------+-------------+---------------+---------------------
00000000-0000-1000-0000-000000000001 | 51 | null | | Enabled | Maximum Monitoring | 232 | 10 | null | None | 0 | null | AWS test VMs | 2 | 9 | /uploads/updates/bca837f4b2ef9a2f74823cfa693f85a0.exe | bca837f4b2ef9a2f74823cfa693f85a0 | Dormant | 17039367 | 1.4.7
(1 rows)
cqlsh:veraminetest> update groups2 set concurrentupdates = 0 where customerguid = 00000000-0000-1000-0000-000000000001 and groupid = 51;
WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="Operation failed for veraminetest.groups2 - received 0 responses and 1 failures from 1 CL=ONE." info={'failures': 1, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
@veramine
can you please provide us the system_schema tables (this should hold the persisted schema on the disk) - we can boot an instance with this schema information and check the schema id (similar to what will happen on reboot)
can you please run nodetool describe cluster (this will provide us the schema id in memory)
if they are not the same (the schema from the disk used on a new server, schema in memory) something is not working correctly.
we can also try and check the system_schema tables in memory for the missing field.
select * from system_schema.columns;
is the _missing_ column showing in the select ?
Sure. I've attached output of this command
$ cqlsh -e 'select * from system_schema.tables' > /tmp/sst.txt
And here is the nodetool describecluster output:
Cluster Information:
Name: veraminev2
Snitch: org.apache.cassandra.locator.SimpleSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
39223f1c-a446-3034-9349-836c46853e54: [127.0.0.1]
I don't see that guid in the result of the previous select:
jness@services:~$ grep 836c46853e54 /tmp/sst.txt
jness@services:~$ grep 39223f1c /tmp/sst.txt
jness@services:~$
The system_schema.columns select for this schema and this table name does not include the column that I recently dropped (filepathpolicypb blob). It does include the column that I am unable to update:
veraminetest | groups2 | concurrentupdates | NONE | 0x636f6e63757272656e7475706461746573 | regular | -1 | int
On Tue, May 29, 2018 at 11:07 AM, Veramine notifications@github.com wrote:
Sure. I've attached output of this command
$ cqlsh -e 'select * from system_schema.tables' > /tmp/sst.txt
And here is the nodetool describecluster output:
Cluster Information:
Name: veraminev2
Snitch: org.apache.cassandra.locator.SimpleSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
39223f1c-a446-3034-9349-836c46853e54: [127.0.0.1]I don't see that guid in the result of the previous select:
You will not see it - its something that scylla computes from all the
information in system_schema.
Can you please upload all the sstables in system_schema tar.gz
jness@services:~$ grep 836c46853e54 /tmp/sst.txt
jness@services:~$ grep 39223f1c /tmp/sst.txt
jness@services:~$The system_schema.columns select for this schema and this table name does
not include the column that I recently dropped (filepathpolicypb blob). It
does include the column that I am unable to update:veraminetest | groups2 | concurrentupdates | NONE | 0x636f6e63757272656e7475706461746573 | regular | -1 | int
sst.txt https://github.com/scylladb/scylla/files/2047286/sst.txt
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/scylladb/scylla/issues/3443#issuecomment-392689289,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ADThCPTlpPmUCMWnv2eO1_-x_5M2jHOwks5t3QG3gaJpZM4UF_V7
.
Sure, here you go:
$ echo $report_uuid
F17D5596-BE3C-41C2-9C2D-50A82C75ADAD
$ curl -X PUT http://upload.scylladb.com/$report_uuid/system_schema.tgz -T system_schema.tgz
thanks
after loading your system_schema into a docker image I get
[root@3e57a5daeab3 data]# nodetool describecluster
Cluster Information:
Name: Test Cluster
Snitch: org.apache.cassandra.locator.SimpleSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
39223f1c-a446-3034-9349-836c46853e54: [172.17.0.2]
which is the same schema id the node is reporting.
and when I try to run your commands
select * from groups2 where customerguid = 00000000-0000-1000-0000-000000000001 and groupid = 51;
customerguid | groupid | autoactionid | autoassignmenthostnamepattern | binarypolicy | collectionpolicy | collectionpolicyid | concurrentupdates | configid | deceptionpolicy | deceptionpolicyid | filepathpolicyid | groupname | hostcount | incrementalconfigid | sensorfilepath | sensormd5 | sensorstate | sensorversion | sensorversionstring
--------------+---------+--------------+-------------------------------+--------------+------------------+--------------------+-------------------+----------+-----------------+-------------------+------------------+-----------+-----------+---------------------+----------------+-----------+-------------+---------------+---------------------
(0 rows)
cqlsh:veraminetest> update groups2 set concurrentupdates = 0 where customerguid = 00000000-0000-1000-0000-000000000001 and groupid = 51;
they pass
@veramine
Since the address in nodetool status is 127.0.0.1, and you have a single node listed - I want to verify - is this reproducing on a single ?
@duarten / @tgrabiec - would a memory dump help ?
Well, the issue is clearly related to us using an old schema somewhere. A dump would help if we could enumerate all schemas and see who references an old one.
Not sure if this matters but I found today that I could INSERT the same values that I was not able UPDATE.
cqlsh:veraminetest> select * from groups2 where customerguid = 00000000-0000-1000-0000-000000000001 and groupid = 51;
customerguid | groupid | autoactionid | autoassignmenthostnamepattern | binarypolicy | collectionpolicy | collectionpolicyid | concurrentupdates | configid | deceptionpolicy | deceptionpolicyid | filepathpolicyid | groupname | hostcount | incrementalconfigid | sensorfilepath | sensormd5 | sensorstate | sensorversion | sensorversionstring
--------------------------------------+---------+--------------+-------------------------------+--------------+--------------------+--------------------+-------------------+----------+-----------------+-------------------+------------------+--------------+-----------+---------------------+-------------------------------------------------------+----------------------------------+-------------+---------------+---------------------
00000000-0000-1000-0000-000000000001 | 51 | null | | Enabled | Maximum Monitoring | 232 | 10 | null | testp1 | 268 | null | AWS test VMs | 0 | 9 | /uploads/updates/739592aa476f96736b17b488a406fb54.exe | 739592aa476f96736b17b488a406fb54 | Dormant | 16973833 | 1.3.9
(1 rows)
cqlsh:veraminetest> update groups2 set concurrentupdates = 0 where customerguid = 00000000-0000-1000-0000-000000000001 and groupid = 51;
WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="Operation failed for veraminetest.groups2 - received 0 responses and 1 failures from 1 CL=ONE." info={'failures': 1, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
cqlsh:veraminetest> insert INTO groups2 (customerguid, groupid, concurrentupdates) values (00000000-0000-1000-0000-000000000001, 51, 0);
cqlsh:veraminetest> select * from groups2 where customerguid = 00000000-0000-1000-0000-000000000001 and groupid = 51;
customerguid | groupid | autoactionid | autoassignmenthostnamepattern | binarypolicy | collectionpolicy | collectionpolicyid | concurrentupdates | configid | deceptionpolicy | deceptionpolicyid | filepathpolicyid | groupname | hostcount | incrementalconfigid | sensorfilepath | sensormd5 | sensorstate | sensorversion | sensorversionstring
--------------------------------------+---------+--------------+-------------------------------+--------------+--------------------+--------------------+-------------------+----------+-----------------+-------------------+------------------+--------------+-----------+---------------------+-------------------------------------------------------+----------------------------------+-------------+---------------+---------------------
00000000-0000-1000-0000-000000000001 | 51 | null | | Enabled | Maximum Monitoring | 232 | 0 | null | testp1 | 268 | null | AWS test VMs | 0 | 9 | /uploads/updates/739592aa476f96736b17b488a406fb54.exe | 739592aa476f96736b17b488a406fb54 | Dormant | 16973833 | 1.3.9
(1 rows)
@slivne yes I confirm that I am reproducing this on a single node.
Regarding the groups2 table (can you share the schema?), you dropped a column, inserted a row, and then failed to update it?
@duarten I can see that _base_non_pk_column_in_view_pk, which contains ids into the base schema, is filled-in by initialize_base_dependent_fields, which is only called from add_or_update_view(), which is called when a view changes. It seems it won't be updated if the base table schema changes, or am I missing something?
I've attached the groups2 schema. The sequence of events was something like this:
1- Create the table. Inserted/updated/deleted a bunch of rows.
2- Added several materialized views. Inserted/updated/deleted a bunch of rows.
3- Added 5 new columns (concurrentupdates, incrementalconfigid, autoactionid, filepathpolicyid, filepathpolicypb). Inserted/updated/deleted a bunch of rows.
4- Realized I didn't need one of those new columns (filepathpolicypb blob) so I dropped it. Inserted/updated/deleted a bunch of rows.
5- Then noticed errors in log updating a single one of the rows. Updates to all other rows seem to be working fine.
This morning doing a bunch of tests I found I could update that row using an insert statement but not an update so I am going to change our application to use inserts instead of updates for these tables that have materialized views.
That @tgrabiec hint gave me an idea. I "altered" each materialized view on this table and the problem went away.
cqlsh:veraminetest> update groups2 set concurrentupdates = 0 where customerguid = 00000000-0000-1000-0000-000000000001 and groupid = 51;
WriteFailure: Error from server: code=1500 [Replica(s) failed to execute write] message="Operation failed for veraminetest.groups2 - received 0 responses and 1 failures from 1 CL=ONE." info={'failures': 1, 'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
cqlsh:veraminetest> ALTER materialized view groups2_deceptionpolicyid with comment='test';
cqlsh:veraminetest> ALTER materialized view groups2_groupname with comment='test';
cqlsh:veraminetest> ALTER materialized view groups2_collectionpolicyid with comment='test';
cqlsh:veraminetest> alter materialized view groups2_sensormd5 with comment='test';
cqlsh:veraminetest> update groups2 set concurrentupdates = 0 where customerguid = 00000000-0000-1000-0000-000000000001 and groupid = 51;
cqlsh:veraminetest>
@tgrabiec Oh man, I was convinced we did. facepalm Thanks!!
Although it's weird that I couldn't reproduce this, since I added a ton of columns that should definitely change that id. Go figure. Will push a fix.
Alright, I was able to semi-reproduce it:
cqlsh:ks> create table cf (p int, c int, v1 int, v2 int, primary key (p, c));
cqlsh:ks> create materialized view vcf as select p, c, v1, v2 from cf where p is not null and c is not null and v1 is not null and v2 is not null primary key (v1, p, c);
cqlsh:ks> update cf set v1 = 4, v2 = 5 where p = 1 and c = 1;
cqlsh:ks> alter table cf add f int;
cqlsh:ks> alter table cf add o int;
cqlsh:ks> alter table cf add t int;
cqlsh:ks> alter table cf add x int;
cqlsh:ks> alter table cf add z int;
cqlsh:ks> update cf set v2 = 7 where p = 1 and c = 1;
cqlsh:ks> select * from vcf;
v1 | p | c | v2
----+---+---+----
4 | 1 | 1 | 5
I wasn't really paying attention to values before, as I was expecting a crash. But this makes sense: the wrong id is used to fetch the regular base column in the view's primary key, which probably points to a new column for which the cell is null, and thus we won't even consider it.