Signal-desktop: SQLITE_CORRUPT: database disk image is malformed

Created on 27 Feb 2019  ·  65Comments  ·  Source: signalapp/Signal-Desktop

Bug Description

Unhandled Promise Rejection
Error: SQLITE_CORRUPT: database disk image is malformed

People had the same issue in #2972 but it was prematurely closed before resolution a few days ago so I can't post my log file in there. Closed with reason "I'm going to close this, since that new dialog prevents you from being stuck in that 'Optimizing...' stage." -- not sure what this is referring to as the dialog just states the database is corrupt then the app closes, there is no resolution offered.

Steps to Reproduce

Using MacOS and Signal was working fine. It asked me to restart the app for the new version to install and then gives me a corruption error.

Platform Info

Signal Version:

1.22.0

Operating System:

MacOS Mojave 10.14.2

Link to Debug Log

{"name":"log","hostname":"MacBook-Pro.local","pid":10259,"level":30,"msg":"app ready","time":"2019-02-27T00:00:19.774Z","v":0}
{"name":"log","hostname":"MacBook-Pro.local","pid":10259,"level":30,"msg":"updateSchema: Current schema version: 9; Most recent schema version: 11; SQLite version: 3.20.1; SQLCipher version: 3.4.2;","time":"2019-02-27T00:00:19.789Z","v":0}
{"name":"log","hostname":"MacBook-Pro.local","pid":10259,"level":30,"msg":"updateToSchemaVersion10: starting...","time":"2019-02-27T00:00:19.790Z","v":0}
{"name":"log","hostname":"MacBook-Pro.local","pid":10259,"level":50,"msg":"Unhandled Promise Rejection: Error: SQLITE_CORRUPT: database disk image is malformed","time":"2019-02-27T00:00:19.798Z","v":0}
{"name":"log","hostname":"MacBook-Pro.local","pid":10266,"level":30,"msg":"app ready","time":"2019-02-27T00:00:52.195Z","v":0}
{"name":"log","hostname":"MacBook-Pro.local","pid":10266,"level":30,"msg":"updateSchema: Current schema version: 9; Most recent schema version: 11; SQLite version: 3.20.1; SQLCipher version: 3.4.2;","time":"2019-02-27T00:00:52.213Z","v":0}
{"name":"log","hostname":"MacBook-Pro.local","pid":10266,"level":30,"msg":"updateToSchemaVersion10: starting...","time":"2019-02-27T00:00:52.214Z","v":0}
{"name":"log","hostname":"MacBook-Pro.local","pid":10266,"level":50,"msg":"Unhandled Promise Rejection: Error: SQLITE_CORRUPT: database disk image is malformed","time":"2019-02-27T00:00:52.221Z","v":0}
Bug

Most helpful comment

Hi @wolever - Thanks for running that! Based on the verification of the per-page MACs we can at least conclude that the database is intact as SQLCipher wrote it. That helps effectively rule out a substantial set of corruption causes related to unintentional modification of the database file outside of the library, and makes it more likely that this is an "internal" corruption.

As a next step @aguynamedben is going to put together a standalone Electron app for us that we can use to perform some isolated testing under difference scenarios. We'll see if we can reproduce some corruption during a crash, etc.

All 65 comments

Ah, so in your case you just get the dialog with 'Copy Error and Quit' and 'Quit'? Yep, we could improve the database-specific error handler, so you get the option to delete everything and restart in this case as well.

There should be more logs available from the time when you ran into these errors - can you look for those? We're still looking for clues as to why database corruption happens.

Also, if you'd like this bug to track the ultimate resolution of SQLite corruption errors in Signal Desktop, fine by me.

To all future posters: we need as much information as we can get to help us track down what is causing the corruption. Did you restore from backup? Did you need to Force Close SIgnal Desktop, or turn off your computer without shutting it down normally? What exactly was your computer doing at the last time Signal Desktop was working properly?

Beyond that, we really, really need debug logs from right before corruption appeared. You can go into your Signal config directory (locations for each OS are listed here: https://github.com/signalapp/Signal-Desktop/blob/development/CONTRIBUTING.md#the-staging-environment) and zip up the logs directory and post it here or send it to [email protected].

I'll have a look for the logs when I'm back at work with my Mac. The app was working normally when I opened it, showed all my contacts and conversations. It said there is an update available so I let it download the update and after the update completed I launched the app again only to get the database is corrupt message.

To answer your question yes that is the dialogue I saw when opening it before deleting my local database.

I've also experienced this issue. Using mv "~/Library/Application Support/Signal"{,.old} to move the old config directory out of the way then restarting Signal does correct the issue.

I've emailed my logs to [email protected], and I'm happy to help if there's anything else I can do.

Assuming the database is at $config/sql/db.sqlite, what format is the database stored in? It doesn't seem like sqlite can read it, even with a correctly working Signal:

$ file …/Signal/sql/db.sqlite
…/Signal/sql/db.sqlite: data
$ sqlite3 …/Signal/sql/db.sqlite .dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
/**** ERROR: (26) file is not a database *****/
ROLLBACK; -- due to errors

Hi, I am also experiencing this issue. Interestingly I have no folder at ~/Library/Application Support/Signal. It doesn't show up after installing Signal, or after telling the desktop app to clear all data, or after uninstalling and reinstalling. Furthermore, I tried creating a folder called Signal within Application Support and then doing what Wolever suggested, but the Terminal insisted there was "no such file or directory." As such, I don't seem to even have any debug logs that I can share.

I also tried deleting a random message from a conversation I had with someone on my phone, and then I re-linked Signal Desktop (after clearing all data from the Desktop app) to my phone. Interestingly, the deleted message still showed up in the thread on the Desktop app. This suggests to me that the messages are being stored somewhere on my computer that I can't find.

I also tried installing version 1.20.0 on my Mac again (which as I recall worked fine), but I ran into the same issues. Seems like something deeper in my system has gone wrong and I'm not sure where to look. If there is anything I can do to help, I would be happy to do so.

@wolever It's a SQLCipher database, so you can't use plain sqlite to open it. Thanks for sending your logs. Anything else you can say about the state of your computer around the time of discovering the corruption (or the last usage, when corruption might have happened), would be really useful. Did you have to do a hard restart of your computer? Did you restore from backup? Was there a Signal Desktop update?

@Arcuna First, let's talk about how you initially installed Signal Desktop. Was it from https://signal.org, or did you install the beta? Now, when you install a new version, what exactly do you see when you start it up? Just a dialog? Or does the window pop up?

HI, like Arcuna, a folder for Signal also does not appear in my application support. From what I can remember the errors started happening when I updated Signal to 1.22.0. At first, it would randomly crash with some pop-up update error appearing (can't remember the exact name of error) and I would have to reinstall. After reinstalling it would work for a while, then I started receiving error handling messages to one of my contacts and it gradually spread to all of them.

@funnypaperboy What version of macOS are you on? Can you show me how you get to the Application Support directory? Also, please open a new bug with full details (including a Debug Log) for your 'error handling message' issue.

I'm on macOS Sierra 10.12.2. I went to Macintosh HD -> Library -> Application Support. How do I open a new bug? Didn't I open one with all the details before and you closed it?

@scottnonnenberg I downloaded and installed Signal Desktop via signal.org. To my knowledge I have never installed a beta version. I just now re-downloaded and reinstalled it, and upon opening it I was greeted with a blue screen and the text "Loading messages. X so far..." This number X increased for a while until the app opened. Interestingly some recent messages still came through to the Desktop client, but they were all either images or very short messages (a message with the string :) made it through, and messages containing only emoji also seem to get through, but nothing with more text got through). Any messages containing longer strings that I've sent or received after February 12 don't appear to have reached the Desktop client. I've attached a few screenshots; I hope that helps. The messages indicating that I marked the contact as verified show up every time I clear all data on Desktop and then re-link it to my phone. I have indeed verified the contact in question but only once, and long ago.

To navigate to the folder I use Finder (i.e. the GUI rather than via command line). I am running macOS 10.14 and Signal Desktop 1.22.0.

For Signal1
Signal2
Signal3
Signal4

Sorry to double-post, but I just found something else interesting: Desktop still displays a copy of all messages I send and receive via my phone, so long as they contain no alphanumeric text. That means special characters and emoji work, but as soon as there's a letter in it, Desktop doesn't show it. HOWEVER, if I send or receive a message via my phone, that most recent message still shows up in the preview on the left-hand side of the screen on Desktop, regardless of the content. As soon as I click on that person's tab, the preview disappears and is replaced by the most recent non-alphanumeric message. (The preview also appears for a couple seconds even if I am already viewing the conversation in question, but then it vanishes shortly thereafter.) Finally, Desktop seems to think that any messages I send from my phone do not get received by the recipient; that is, it shows only one checkmark by the sent message, even though I can confirm (and my phone shows) that the recipient read the message on their phone. Attempting to send an emoji message from Desktop immediately gives me a Send Failed error. I have attached more screenshots. (In case this is relevant, the Activity Monitor consistently shows about 107-109% CPU usage for Signal, as well as 15-25% CPU usage for Signal Helper, so long as Signal is running. Opening and closing the app works normally; I do not have to Force Quit.)

I am sorry for my quick writing style. I hope this helps!

Desktop
Phone1
Phone2

@Arcuna Please reach out to [email protected]. Your installation is displaying quite a few unusual behaviors, and we definitely need to find your data directory.

Thanks, I got in touch with support. In the meantime, I feel like an idiot (@funnypaperboy may be interested to know this) and I successfully found the hidden Library folder located at Users/[My Name]/Library. (Navigating to my user folder and pressing Cmd + Shift + . revealed the hidden folders, for the luddites out there like me who may not know.) I can now confirm that Wolever's method works; removing the Signal folder solves the problem (albeit at the cost of losing all your messages on Desktop). I kept a copy of the original (problematic) Signal folder which I can easily put back where I found it, and I am happy to continue helping with debugging in any way I can.

@Arcuna you're a bloody genius mate, the problems fixed!

@wolever It's a SQLCipher database, so you can't use plain sqlite to open it. Thanks for sending your logs. Anything else you can say about the state of your computer around the time of discovering the corruption (or the last usage, when corruption might have happened), would be really useful. Did you have to do a hard restart of your computer? Did you restore from backup? Was there a Signal Desktop update?

@scottnonnenberg unfortunately I don't remember any specific details about what happened. Around the time I did have a few hard resets (my mac (10.14.2) was having some trouble waking from sleep), so that's definitely a possibility.

The only potentially abnormal behavior was that it would take Signal quite a long time to load (it would take ~minutes to load, and IIRC the "loading messages …" count finished at about 500), where it would typically load in ~seconds. The database file is ~20mb, and I had (very rough guess) about 10,000 messages in it.

Also, to elaborate on the behavior: when I would first launch Signal.app, it would correctly show any unread messages (ex, if I closed Signal.app, received a message, then opened Signal.app, it would correctly show that my contact had sent an unread message, and I don't recall whether or not it would show the preview), but when I would click the contact, the unread notification would be cleared, but only historical media messages would be displayed (it's possible older unread messages would show too; I don't recall). When I would try to send a message, it would show locally, but it would have the empty circle that indicating "not sent yet".

Would it be possible to retrieve the key used by sqlcipher so I could try loading the database?

@wolever Take a look at the other files in the Signal data directory. The key for sqlcipher is there.

I just ran into this problem as well. I had hard restarted my macbook right before the error occurred. Following @wolever's advice let me re-link my desktop app

Here are my logs: logs.zip

@scottnonnenberg-signal I also build Electron + SQLCipher app, and a few of our users are getting SQLITE_CORRUPT errors too. From your package.json it looks like this is happening for your users with version 3.2.1 of node-sqlcipher, which means SQLCipher 3.4.1 and node-sqlite3 3.15.2... is that right?

In our app, we're seeing this behavior with SQLCipher 4.0.1 and a custom-built version of node-sqlite3 4.0.6. I don't think we ever saw it with SQLCipher 3.x, which we used for 6 months.

For our app, I'm most suspicious of potential cause 1.1 listed here:

1.1. Continuing to use a file descriptor after it has been closed

I'm also suspicious that Electron's multi-process environment (main process and renderer process) may lead to Electron users accidentally making the mistake mentioned in this potential cause 2.6:

2.6. Carrying an open database connection across a fork()

If it's neither of those... I would be curious if it's a SQLCipher bug, but you're on a different version of SQLCipher, and it looks like nobody has reported any corruption bugs at https://discuss.zetetic.net/, so I'm guessing it's some kind of app lifecycle/forking thing with our app.

Godspeed! 🙏

UPDATE: I also asked here to see if any other SQLCipher+Electron users have dealt with this.

@aguynamedben It does seem like hard resets can cause it. But I also suspect that shutdown/update scenarios can sometimes result in two versions of the app competing for access to the database file. This is all edge case/race condition kind of stuff, so very difficult to nail down. But I suppose it could also have something to do with the database itself... it's a tough one.

I can't say conclusively that this is related, but I noticed that sql.js is using PRAGMA schema_version, which is generally not a safe practice. Manipulation of PRAGMA schema_version can lead to database corruption. It is almost always best to SQLite to manage the schema version internally.

For Signal's use case, i.e. tracking application file versions for migrations, the use of PRAGMA user_version would be a much more appropriate replacement.

It may be worth a brief investigation into whether this could be a contributing factor in these corruption incidents.

I just got this when my MacBook powered up after a battery death. Signal had been open prior to the unclean shutdown.

Database startup error:

Error: SQLITE_CORRUPT: database disk image is malformed

@oconnetf Thanks for the report. It looks like you got a dialog which allowed you to delete everything and start from scratch. Did that work for you?

Okay, update: I wasn't able to load the database file using the sqlcipher command line tool, but I was able to get it working by installing the @journeyapps/sqlcipher package.

I used that to write a quick script to query all the tables, and I think I've found the issue:

var fs = require('fs')
var sqlite3 = require('@journeyapps/sqlcipher').verbose()
var db = new sqlite3.Database('sql/db.sqlite')
var config = JSON.parse(fs.readFileSync('config.json'))
db.run(`PRAGMA key = "x'${config.key}'";`)
db.each(`select name from sqlite_master where type = 'table'`, (err, row) => {
  db.each(`select count(*) as count from ${row.name}`, (err, count) => {
    if (err) {
      console.error(`${row.name}: ${err}`)
      return
    }
    console.log(`${row.name}: ${count.count}`)
  })
})

Which yields:

messages_fts: Error: SQLITE_ERROR: no such module: fts5
messages: 13859
sqlite_stat1: 10
conversations: 162
identityKeys: 49
items: 24
preKeys: 73
signedPreKeys: 3
sessions: 29
messages_fts_idx: 145
messages_fts_content: 13859
messages_fts_docsize: 13859
messages_fts_config: 1
attachment_downloads: 0
messages_fts_data: Error: SQLITE_CORRUPT: database disk image is malformed
unprocessed: 64

You're saying that it's specifically the messages_fts_data table that's having problems, nothing else? Before we feel sure about that, you should probably get fts5 installed; it may have an effect on what you're seeing.

@wolever you'll need to enable fts5 (--enable-fts5). Once you do, I'd like to know what is the output from PRAGMA integrity_check;

Yea, I'm not sure what to do there. As far as I can tell it's working with my system sqlite install – I can run sqlite3 <<< "select fts5(42);" without error – but for some reason it doesn't work from node (ie, running db.each("select fts5(42)") yields SQLITE_ERROR: no such function: fds5.

@sjlombardo ah dang, didn't see your message when I posted. Do you happen to know why node would be using a different version of sqlite than the system version? Or would this be a "hunt the dependencies" kind of thing?

@wolever SQLite is basically just a C library, it's not a database service that runs on your system like you'd expect if you've used MySQL/PostgreSQL a bunch.

Signal Desktop embeds a special version of the SQLite C library that:

  • is actually SQLCipher - a customized version of SQLite with encryption enabled
  • is built with the --enable-fts5 flag for search indexing

The (corrupt) Signal database was created with SQLCipher + FTS5 enabled.

Your computer probably currently has:

  • The system SQLite (without SQLCipher) (type which sqlite3)
  • Node.js with the @journeyapps/sqlcipher package you installed. I don't think --enable-fts5 is set when the @journeyapps/sqlcipher package builds the C library.

In order to read your database (thank you for helping! 🙏) you need to install SQLCipher with FTS5 enabled.

The easiest way to do this if you're on a Mac is to use the SQLCipher via Homebrew (see recipe) (note -DSQLITE_ENABLE_FTS5 on line 27 of the recipe).

$ brew install sqlcipher
(output...)
$ which sqlcipher
/usr/local/bin/sqlcipher
$ sqlcipher /path/to/database.db
sqlite> PRAGMA KEY = "blah";
sqlite> PRAGMA integrity_check;

The sqlcipher binary installed by Homebrew is the same command-line tool as sqlite, except is has SQLCipher and has FTS5 enabled. If which sqlcipher fails, you probably need to add /usr/local/bin to your bash/ZSH path, or just run /usr/local/bin/sqlcipher /path/to/database.db.

@wolever and SQLCipher team... thank you so much for helping debug this. My app is having a seemingly similar issue, but we can't get a copy of a database where this has happened, so your help in debugging the corrupt database is really key!

@wolever A few more notes for you on trying to run PRAGMA integrity_check;

~Make sure you're using SQLCipher 4.x~

UPDATE: Sorry, it looks like Signal Desktop uses SQLCipher 3.4.2
After running the sqlcipher binary installed by Homebrew, you can use PRAGMA cipher_version; to ~verify you're running SQLCipher 4.x.~

sqlite> PRAGMA cipher_version;
4.1.0 community

Trying to open a SQLCipher 3.x file with SQLCipher 4.x will probably fail. I had an old 3.x version installed by homebrew... so had to brew uninstall sqlcipher and brew install sqlcipher.

You must set PRAGMA key (the SQLCipher key) before running the integrity check

falcon!ben:/Users/ben/code/tesla$ sqlcipher ~/Library/Application\ Support/Electron/ce-4.db
SQLCipher version 3.27.2 2019-02-25 16:06:06
Enter ".help" for usage hints.
sqlite> PRAGMA integrity_check;
Error: file is not a database
sqlite> PRAGMA key = "my-key";
sqlite> PRAGMA integrity_check;
ok

🎉 Downgrading to sqlcipher 3.4.2 seems to have made things work.

There might be a better way to do it, but I downgraded with:

$ cd /usr/local/Homebrew/Library/Taps/homebrew/homebrew-core/Formula
$ git show 4abec2d11a5eb374c6610ae924aae94c56883c99:./sqlcipher.rb > sqlcipher.rb
$ brew install sqlcipher

And here's the result:

$ sqlcipher sql/db.sqlite
SQLCipher version 3.20.1 2017-08-24 16:21:36
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> PRAGMA key = "x'6a…5d'";
sqlite> PRAGMA integrity_check;
*** in database main ***
Page 2494: btreeInitPage() returns error code 11
Page 9709 is never used
sqlite> select count(*) from messages_fts;
13859
sqlite> select count(*) from messages_fts_data;
Error: database disk image is malformed

@wolever Just to confirm, you tried to query every table but messages_fts_data is the only one that returned an error?

@scottnonnenberg I believe so, yes (assuming that the count(*) I ran on each table will do a scan that will touch every row; I did not select * from each table, though)

Well, one thing I can say is that the corruption errors precede our use of the fts5 module. We started using that early this year, but you can search for the error message and find issues from last year.

Of course, it is possible that the use of fts5 made the problem more common. Also worth noting that we didn't have any triggers before our version 8 schema (which also introduced fts5 usage).

@wolever The next step would be for us to look at what is in the database file around the location of page 2494. Can you please run the following command?

hexdump -C sql/db.sqlite | tail -n +159616 | head -n 130 > pages.txt

That this will output the contents of the encrypted database file around the page number referenced in the interity_check report into pages.txt. Because the database file is encrypted the output should not contain anything secret (it should be encrypted with your key). Even so, just to be safe take a quick look at pages.txt just to make sure there isn't any sensitive information present (for example, the "secrets" in the white box here). Assuming not, post the results up to a gist, or you can send it to us via email/PGP.

@sjlombardo great, I've emailed the dump to [email protected].

Additionally, if it would be helpful, I'd be happy to jump on a screen share debug session so you can poke around at the file live.

@scottnonnenberg does signal desktop use WAL or standard DELETE mode?

@wolever when the crash occurred, do you happen to know if there were any files ending in -journal or -wal in the folder with the database? If so, do you have them in addition to the db.sqlite file?

Hi,

In an attempt to gather some additional information regarding the corruption issue can someone provide details regarding:

  • The version of Electron being used within the application generating the corruption
  • Has this corruption issue occurred over multiple versions of Electron?

    • If so, what versions?

  • Are you building Electron from source, or installing binaries locally from npm?

    • If you build from source, do you make any material modifications to Electron?

@sjlombardo there are not currently any journal or wal files in the directory. I relaunched signal a few times, though, so it's entirely possible that they were cleaned up.

@developernotes unfortunately I don't remember the version of Signal that was active when the corruption happened. The corruption happened around January, so it's likely to have been the latest official macOS build from that time.

Hello @wolever. We have authored a small tool we'd like you to run on the corrupted database. Each page in a SQLCipher database has an attached MAC which is used to authenticate that the page contents are valid and have not been tampered with. This MAC is applied after encryption. The tool will run on your encrypted database and verify the state of each page, and report any issues. You can download it here:

https://github.com/sqlcipher/sqlcipher-tools/blob/master/verify.c

To build the tool run the following command (this assumes OpenSSL is installed in the default Homebrew location /usr/local/opt/openssl):

clang verify.c -DPAGESIZE=1024 -DPBKDF2_ITER=64000 -DHMAC_SHA1  -I /usr/local/opt/openssl/include -L /usr/local/opt/openssl/lib -l crypto -o verify

Then, run the verify program on your database, replacing the -k parameter with your database encryption key e.g.

./verify -f sql/db.sqlite -k "x'6a…5d'"

This should provide a report of the pages in the database, and will help us differentiate whether the database is internally corrupted (i.e. corrupted by the database library prior to encryption by SQLCipher), or if it is externally corrupted (i.e. the data was manipulated after the write of the encrypted content).

Can you please run this at your earliest convenience and let us know the output. Let us know if you have any questions on building or running the tool.

Hello @wolever have you had a chance to look at running the verify program on your corrupted database? Right now your database is the only database snapshot we have of the problem, it would be super helpful see the results. Thanks!

@sjlombardo hey! Sorry for the delay - I'm just getting back from some travel.

I've tried to run the tool, but unfortunately something doesn't seem to be working. It reports all pages as being invalid, even on a valid database.

I've compiled it for both versions 3 and 4 (verify3 and verify4), and this is what I get when I run it against a valid database:

$ cat config.json
{
  "key": "465...1c8"
}
$ ./verify3 -f sql/db.sqlite -k "x'465...1c8'" | tail -3
page 11772 is invalid
page 11773 is invalid
scanned 11773 pages and found 11773 invalid, database is corrupt
$ ./verify4 -f sql/db.sqlite -k "x'465...1c8'" | tail -3
page 2943 is invalid
page 2944 is invalid
scanned 2944 pages and found 2944 invalid, database is corrupt
$ ./verify3 -f sql/db.sqlite -k "465...1c8" | tail -3
page 11772 is invalid
page 11773 is invalid
scanned 11773 pages and found 11773 invalid, database is corrupt
$ ./verify4 -f sql/db.sqlite -k "465...1c8" | tail -3
page 2943 is invalid
page 2944 is invalid
scanned 2944 pages and found 2944 invalid, database is corrupt

I see a similar result running against the corrupt database, but I presume that's to be expected.

@wolever thanks so much for giving this a try! I forgot that signal desktop was using raw key syntax. I've just pushed an update to verify.c that fixes it to support raw hex keys. Can you grab the latest, compile for SQLCipher 3 settings and try it one more time?

Okay! That works better.

$ ./verify3 -f ./sql/db.sqlite -k "x'6a1b…375d'"
scanned 22256 pages and 0 are invalid, database is intact

And I've confirmed that using an incorrect key "breaks" it, reporting all pages as invalid, as does using verify4.

I've also double checked that this is the broken database (and not accidentally running it against my working database).

Hi @wolever - Thanks for running that! Based on the verification of the per-page MACs we can at least conclude that the database is intact as SQLCipher wrote it. That helps effectively rule out a substantial set of corruption causes related to unintentional modification of the database file outside of the library, and makes it more likely that this is an "internal" corruption.

As a next step @aguynamedben is going to put together a standalone Electron app for us that we can use to perform some isolated testing under difference scenarios. We'll see if we can reproduce some corruption during a crash, etc.

Okay, awesome! And like I mentioned, if it would be helpful, I'd be happy to get on screen share with someone to poke at the database and see what's up.

@scottnonnenberg - Cross-posting this from the SQLCipher discussion site... The new release of SQLCipher 4.2.0 adds PRAGMA cipher_integrity_check. It performs an independent check of each page in a database using the stored HMAC, and produces a list of an errors found. It might be a good idea to collect some information when these errors occur. If no problems are found with the "envelope" of the database by cipher_integrity_check, yet the database it is being reported as SQLITE_CORRUPT during use, then it is likely that some problem corrupted the database internally (e.g. logic error, improper use of the database API, etc.).

This might confirm what @wolever was seeing, and potentially narrow things down on a wider scale.

@sjlombardo We'll get on that - thanks!

We have another example of corruption on macOS: https://github.com/signalapp/Signal-Desktop/issues/3599

It looks as if the corruption happened not long after starting the app - a few queries succeeded before these started: SQLITE_CORRUPT: database disk image is malformed. The integrity check right after opening the database didn't catch it. I've asked for a restart of the app, which should give us integrity check results.

@scottnonnenberg were you able to upgrade to SQLCipher 4.2.0 and work in PRAGMA cipher_integrity_check? I still think it would be useful to determine conclusively whether the database in question is internally or externally corrupted.

From the log in #3599, yes we're on 4.2.0. Right after that line the cipher_integrity_check is run, and then the app starts. If the user restarts, the integrity check should fail, and that log output should be a pretty conclusive smoking gun.

INFO  2019-09-16T19:35:55.871Z updateSchema:
  Current user_version: 18;
  Most recent db schema: 18;
  SQLite version: 3.28.0;
  SQLCipher version: 4.2.0 community;
  (deprecated) schema_version: 43;

Hi @scottnonnenberg thanks, that's good to know. So it sounds like the database was externally intact, pointing to an internal database corruption.

As a next step @aguynamedben is going to put together a standalone Electron app for us that we can use to perform some isolated testing under difference scenarios. We'll see if we can reproduce some corruption during a crash, etc.

I apologize for not having done this yet. I just checked our Sentry logs, and we've seen this error with 7 users now. (We have far less users than Signal.) We seem to get this happening around once every 2 weeks. We're still tracking this, and soon may be able to put in some time making the example app.

@aguynamedben @sjlombardo We have our first failed integrity check result, which involved a Signal Desktop user losing all of their data. Electron v6.0.1.

The failed integrity check results:

Database integrity check failed: HMAC verification failed for page 21078

Here's the full log - you can see that the error started happening after quite a bit of uptime, since it was job 84042 which was the first to return the SQLITE_CORRUPT error: https://debuglogs.org/d164cfb89fa8acb3a852210fb6d8889afcad55421cabdd5bfb7a5640dca81e6a

What can we take away from this?

Hello @scottnonnenberg - the next step here would be to try to determine if there is an explanation for the external corruption. The first thing I'd like to do is examine the contents of the corrupted page (for example, if we see some recognizable content written there we might conclude that another process or thread wrote to the file handle).

I'd imagine it would be undesirable for the user to share their database, so instead we can have them share the hexdump of that one corrupted page, similar to what we did earlier in this thread using hexdump -C with some positioning to filter just the lines we want. Is Signal now using the default SQLCipher 4 settings, or the SQLCipher 3 compatibility settings?

hexdump outputs 16 bytes per line of output so the general command format will be

hexdump -C sql/db.sqlite | tail -n +<startpos> | head -n <length> > pages.txt

Where startups = page_no*(page_size/16) and length = (page_size/16)

For SQLCipher 4 the page size is 4096 so this should be:

hexdump -C sql/db.sqlite | tail -n +5395968 | head -n 256 > page.txt

For SQLCipher 3 compatibility settings the page size is 1024 so the command should be:

hexdump -C sql/db.sqlite | tail -n +1348992 | head -n 64 > page.txt

@sjlombardo Signal Desktop is now using the default SQLCipher 4 settings. But I'm not sure we'll ever get that extra information from a user. This user, for example, unblocked themselves by deleting the old database and starting over from scratch. What are our options without that additional detail?

@scottnonnenberg That is problematic, without seeing the page in question I don't know that we can deduce anything about a root cause. There are still many potential causes for an external corruption. It is interesting that this is the first one you've seen, but there have been a few others where the corruption was internal, Is that right?

Have you had contact with this user? if so, did they say that the corruption was precipitated by a crash / power loss event like some of the other cases? Could we ask if they may have a backup copy of the database, e.g. in Time Machine that we could have them restore and look at?

Alternately, going forward, if a corruption event is detected, could you offer the user to either capture and send the offending page in the log, or alternately make a local backup copy of the corrupted database in on their computer and request they contact support?

We haven't seen much in the way of similar corruption from other applications. It is possible that this has something specific to do with Electron (e.g. related to Electron embedding its own copy of SQLite, or result of file handling, API wrapper, etc). As a result I still think it would be extremely helpful to have a sample Electron project so we can do more extensive testing (@aguynamedben offered to prepare one).

@sjlombardo Well, let's start with a definition for 'external' vs. 'internal.' It sounds like you're saying that you believe some other process went and modified the db.sqlite file. Or another thread in the same Electron process?

@sjlombardo Here's an even more confusing one - this user sees SQLITE_CORRUPT errors when running the app, but on restart the integrity check doesn't fire: https://github.com/signalapp/Signal-Desktop/issues/3599#issuecomment-533763836

@scottnonnenberg is Signal running PRAGMA integrity_check at startup as well, or only PRAGMA cipher_integrity_check?

Only PRAGMA cipher_integrity_check;

@scottnonnenberg thanks for clarifying that. In that case, I don't really think this is unexpected because PRAGMA cipher_integrity_check is just verifying the external integrity of the database. In other words, the linked issue seems to be another internal corruption. More on that below.

Let's take a step back to summarize since this is a rather long-running thread and is also partially split with the topic that @aguynamedben posted on the SQLCipher discussion site.

To recap, there are at least two independently developed applications having a problem, Signal Desktop and the application developed by @aguynamedben. Both are experiencing intermittent data file corruption issues for unknown reasons. Reports of the corruption date back at least to SQLCipher 3.4.2 (released in 2017). Queries against a corrupted database result in a SQLITE_CORRUPT: database disk image is malformed error, indicating that there are problems with database file consistency. Both applications are built using Electron and that there may be some evidence that the issue affects users that "hard reset" their devices, i.e. force a power off, run out of battery, or force quit the app.

Since there are numerous ways to corrupt an SQLite database we then tried to come up with a way to narrow down the possibilities. With the help of @wolever, a Signal user that had a backup copy of a corrupt database, we were able to determine that there was a corruption reported on one of the tables. We wrote, and he helped us test, a small verifier program that checked the HMAC values of each page on the database. This same verification logic was added to a new SQLCipher-specific PRAGMA cipher_integrity_check included in 4.2.0. This feature does the same thing as the separate verifier program, i.e. test the HMAC values of each page in a database to ensure that they are consistent, reporting any errors, as follows:

  1. If the PRAGMA cipher_integrity_check reports an error it means that one or more of the pages in the database contain an invalid HMAC. This would be considered externally corrupted because the "signature" that SQLCipher adds to the database pages is invalid. SQLCipher will not even attempt to decrypt a page when the HMAC check fails, so the corruption is being reported prior to the SQLite database layer even processing the contents of the page.

  2. If PRAGMA cipher_integrity_check doesn't report an error the digital signatures on every page of the database were intact, yet there is still something unexpected in the page data that SQLCipher signed. This is considered internally corrupted because SQLCipher can check every page, decrypt the contents, and hand it off to the SQLite database layer for processing, so the SQLite layer is likely reporting the corruption.

While it is not conclusive, another way to think about internal vs. external is in terms of probable causes. For example, if PRAGMA cipher_integrity_check succeeds then rogue process/thread overwrite, memory corruption, physical drive / flash corruption, and file system corruption are all less likely. Those problems would have the potential to change the file contents or "flip bits" in an unpredictable way that would tend to cause the HMAC checks to fail. If a cipher_integrity_check failure does occur, it is useful to look at the database page in question. Even though it is encrypted, we might be able to make some observations. For instance, if we notice that the page contains some clearly non-random data (e.g. all zeros, debug log strings) then it might point towards a culprit.

That said, a successful PRAGMA cipher_integrity_check does not guarantee that the file is not corrupted. It is still possible to have a corrupted database where SQLCipher's signatures are correct. In fact, this was the case with @wolever's original database, which passed SQLCipher verification but was still corrupt. In this situation, running the built in SQLite integrity check, PRAGMA integrity_check, on the database might yield some additional information. Because the external HMAC check succeeded, we must consider the possibility that the SQLite database layer handed SQLCipher some "bad data", and SQLCipher encrypted it, signed it and wrote it. There are still numerous underlying reasons for this, e.g. locking issues with multiple processes, problems with the journal or recovery, internal logic errors, issues related to linking multiple versions of SQLite (a definite potential problem with Electron), manipulation of schema_version (which was an issue with Signal in earlier versions) etc. Any of these could legitimately result in a corrupt database where the SQLCipher HMACs are still correct for each page.

The difficulty now is that we don't have enough information to make any conclusions. There is at least one reported case of an external corruption from September 19th ("Database integrity check failed: HMAC verification failed for page 21078"). However, it sounds like the database has been deleted so we can't look at the page in question. There are at least two that appear to be internal: @wolever's and the one reported in #3599, but that may have been deleted as well. To be honest, we can't rule out a bug in SQLCipher (or SQLite) at this time, but we haven't had any large-scale reports of these issues across five releases in almost two years.

Given these circumstances, we would propose the following:

  1. If any corruption error occurs, preserve a snapshot / backup copy of the database, and ask the user to contact your support team so you can go through investigate steps (e.g. hexdump) for cases where cipher_integrity_check fails.
  2. Add a PRAGMA integrity_check to run after PRAGMA cipher_integrity_check when no errors are reported by the latter
  3. Attempt to reproduce the most likely environmental conditions related Electron. This is where the sample application for testing (like the one we discussed with @aguynamedben) would be very helpful.

Let me know if this makes sense with respect to the current state, difference between internal / external, and potential next steps.

@sjlombardo Thanks for the detailed summary. The latest Signal Desktop beta introduces the additional integrity check as recommended in item 2.

Regarding your item 3, we've already made a concerted effort to interrupt execution, shut down the device, etc. with no success. Personally, I've never seen a problem stemming from a hard Ctrl-C shutdown when doing local development.

And finally, your item 1 will remain quite difficult, as 1) we're a privacy-focused app and 2) users want to unblock themselves and use a functioning app.

Hello @scottnonnenberg-signal. Just to clarify point 1, I do understand that Signal is a privacy focused app and was certainly not suggesting that you capture user data without permission. However, in #3599 the advice to they user was to delete ~/Library/Application Support/Signal and start over from scratch. You might instead report an error and ask them to create a backup copy of the folder or database before deleting to start over. It would then be possible to ask them to run commands to extract just the specific bytes of the corrupted page, which is encrypted anyway with a random key. This is essentially what we did with @wolever. It’s just a thought, and if the user is not comfortable with it they could certainly decline, but it has the potential to provide important information.

Let me know if you have any other questions, otherwise we’ll sit tight and await more data.

Was this page helpful?
0 / 5 - 0 ratings