Clickhouse: Support 'lost key' pattern on DB level to satisfy GDPR requirements?

Created on 14 Mar 2018 · 9Comments · Source: ClickHouse/ClickHouse

I know that you, guys, have plans to add some support for UPDATE / DELETE to make it possible to implement 'right to be forgotten' of GDPR.

But also there is another quite nice approach to satisfy GDPR requirements which probably is easier to implement, and allows to process 'forget me' requests very efficiently.

It's called 'lost key' pattern. Idea is simple: all fields containing sensitive (personal) data are encrypted by per-user key, all non-sensitive but important numbers for aggregation can be stored 'as is'. Per user-keys are stored separately in DB with limited access (in case of ClickHouse it can be just a dictionary). If user decide to be forgotten - it's enough to remove his key from db with key, and no personal information will be available anymore, and at the same time all 'numerical' reports are not affected (you can still see the data of one user and you can see that it was one user, but you can't decrypt it). It will slow down access to fields with sensitive data but in some cases it's not always necessary to decrypt it, and some effecient algorithms can be chosen.

For example can work in a following manner:
```sql
CREATE TABLE xx (
a Encrypted(String),
b UInt8,
c Encrypted(Int8)
date Date,
userid UInt8,
) ENGINE ... EncryptionAlgorithm = ... EncryptionKey = dictGetString( 'keys_storage', 'key', userid);

SELECT decrypt(a) FROM xx; -- decrypted
SELECT a FROM xx; -- raw (encrypted)
```

BTW, any update about update / delete progress?

st-community-taken

Source

filimonov

👍5 🎉1

Most helpful comment

is there any update on this, any plans to release encryption at rest in a future version?

jenyacazacu on 1 Apr 2019

👍5

All 9 comments

That can also lead to big downgrade of compression ratio (implementing compression before encryption will be not so easy) and together with slowing down data reading / inserting can kill all the advantages.

filimonov on 14 Mar 2018

This is very clever idea!

Probably we can start with encryption/decryption functions.
Special encrypted data types and syntax in table definition may go later.

Compression rate will suffer. But not so much if we just encrypt only few certain fields out of many.
CPU cost will be not so much if we will be able to utilize AES-NI instructions.

If we encrypt values independently (one by one) we should also add some unused random bits before encryption, to avoid correlated results.

I'm not an expert in crypto. Could you please propose some details about possible interfaces of that functions?

alexey-milovidov on 17 Mar 2018

👍1

To utilize AES-NI instruction the only algorithm of choice is AES (in one of the modes).

I think that CTR mode of AES should be nice - reasonable good, fast, work in stream mode,
no padding needed (implementations should support SIMD, probably should support parallelization).

For algorithm itself you need to provide initialization vector (IV), the plaintext (or ciphertext to decode) and key (one of the sizes: 128, 192, 256 bits).

The proper way - is to generate IV randomly (crypto-secure random function needed) and store it with ciphertext, in that case you will get different ciphertexts for same plain text values. In that case it's really hard (~improssible) to crack the cipher. That is a native part of AES and proper usage (you're mentioned that as 'unused random bits before encryption, to avoid correlated results.')

You can also use the constant IV. It can be considered as safe when key is used only once / for one piece of data, because it will simplify some types of attacks, by giving properties:

if encrypt(A,KEY) = encrypt(B,KEY) then A=B
if encrypt(A,KEY) = substring( encrypt(B,KEY) ) then A=substring(B)
if prefix(encrypt(A,KEY)) = prefix( encrypt(B,KEY) ) then prefix(A)=prefix(B)
etc.

While being certain security hole, it can even be also desirable property for some kinds of encrypted data processing (so you will see that some encrypted attribute of user X remains fixed in time, or 2 encrypted urls have common prefix, i.e. probably came from the same domain). Another advantage of constant IV - you don't need to store / process initialization vector (IV) needed to decompress data.
So that approach can be cracked with smaller costs, but that will still require time / knowledge / efforts from educated person and some bigger amount of data encrypted with same key / IV combination. Generally if the encrypted data is not very sensitive - it still could be satisfactory protection. But it can't be considered as truly 'safe'.

Both modes can be supported (fixed IV for 'some' security, and random IV for 'fair' security), and IV (even constant) can always be attached to the ciphertext, but that will give +16 chars to each encrypted value (even for fixed IV case).

Key and initialization vector can be generated from passphrase string and salt using some key derivation function (for example PBKDF2 from PKCS#5). But that will require extra initialization steps before encrypt/decrypt, so for ClickHouse it's rather better to have key/IV in ready-to-use form - i.e. binary and with proper length (16/24/32 bytes depending of chosen option), without extra step of generation key/IV from passphrase/salt.

You can play a bit with encrypt/decrypt using openssl cli:

➜ echo 'aaaaa' | openssl enc -K 00000000000000000000000000000000 -iv 00000000000000000000000000000000 -aes-128-ctr -base64
B4gqtY6A
➜ echo 'B4gqtY6A' | openssl enc -K 00000000000000000000000000000000 -iv 00000000000000000000000000000000 -aes-128-ctr -base64 -d
aaaaa
# Openssl also has builtin benchmarks that can by run from bash like that:
➜ openssl speed

About interface: key and text are required. IV - may be can be provided as optional param? Text is any binary data, key is binary data of configured length, result is binary data.

Also (may be?) it will be better to leave possibility to use other algorithms in future, something like

encrypt('aes-128-ctr')(data, key)
encrypt('aes-128-ctr-noiv')(data, key) -- fixed IV mode?

In C++ you can use one of avaliable cryptography libraries. It looks like most of them should support AES-NI instruction.

Benchmark of one of them give some feeling of algorithm speed:
https://www.cryptopp.com/benchmarks.html

I'm not an expert in crypto.

I didn't have bigger practical experience with crypto in last 10 years, so just consider written above as IMHO.

filimonov on 19 Mar 2018

Update: as Clickhouse stores column data in blocks and each block is compressed and can be processed separately, encryprion can also be done on block level. In that case initialization vector could be unique per block, and it will give really tiny overhead.

filimonov on 27 Jun 2018

is there any update on this, any plans to release encryption at rest in a future version?

jenyacazacu on 1 Apr 2019

👍5

Any update on this? It's in the roadmap for next months?