Osquery: Lots of sst files under /var/osquery/osquery.db/

Created on 27 Apr 2018  Β·  38Comments  Β·  Source: osquery/osquery

We're seeing this issue where over time, osquery will start to utilize a lot of CPU. We have observed that when this happens, it is typically after osquery has been running for several months. We also observe that when this happens, there are 50k+ sst files under /var/osquery/osquery.db!

Couple of questions:

  • Is it normal to have so many files under there?
  • Is it safe to purge old files based on mtime? or atime?
  • Has anyone else observed so many files and/or a high spike in CPU after running for several months?

osquery info:

osquery_version = 2.8.0
     os_version = 14.04, Trusty Tahr
       os_build =
    os_platform = ubuntu

some bash showing the files and that they're all accessed within the last few days (this is on a host that is not yet observing the high CPU): https://pastebin.com/pvaxwdPm

danielpops@host:~ [00:34:09] βœ“
$ sudo ls -ult /var/osquery/osquery.db/ | wc -l
61566
danielpops@host:~ [00:38:56] βœ“
$ sudo ls -ult /var/osquery/osquery.db/ | head -100
total 268M
-rw-r--r-- 1 root root 1012 Apr 26 00:38 679003.sst
-rw-r--r-- 1 root root 1012 Apr 26 00:38 675074.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:38 455765.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:38 353732.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:38 334159.sst
-rw-r--r-- 1 root root  938 Apr 26 00:37 113605.sst
-rw-r--r-- 1 root root  976 Apr 26 00:37 087493.sst
-rw-r--r-- 1 root root  954 Apr 26 00:37 050921.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:37 859061.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:37 812067.sst
-rw-r--r-- 1 root root 1016 Apr 26 00:37 729877.sst
-rw-r--r-- 1 root root 1.7K Apr 26 00:37 718162.sst
-rw-r--r-- 1 root root  991 Apr 26 00:37 561517.sst
-rw-r--r-- 1 root root 1.2K Apr 26 00:37 534082.sst
-rw-r--r-- 1 root root  954 Apr 26 00:37 412668.sst
-rw-r--r-- 1 root root  954 Apr 26 00:37 126690.sst
-rw-r--r-- 1 root root 1004 Apr 26 00:37 1015311.sst
-rw-r--r-- 1 root root  991 Apr 26 00:37 749445.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:37 741632.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:37 436179.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:37 393039.sst
-rw-r--r-- 1 root root 2.6K Apr 26 00:37 318373.sst
-rw-r--r-- 1 root root  991 Apr 26 00:37 310502.sst
-rw-r--r-- 1 root root 1.2K Apr 26 00:37 302600.sst
-rw-r--r-- 1 root root 1.3K Apr 26 00:37 294761.sst
-rw-r--r-- 1 root root 7.4K Apr 26 00:37 148466.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:37 135382.sst
-rw-r--r-- 1 root root 7.2K Apr 26 00:37 074361.sst
-rw-r--r-- 1 root root 1010 Apr 26 00:37 067967.sst
-rw-r--r-- 1 root root  954 Apr 26 00:37 780804.sst
-rw-r--r-- 1 root root  978 Apr 26 00:37 765112.sst
-rw-r--r-- 1 root root  954 Apr 26 00:37 577169.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:37 538015.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:37 255512.sst
-rw-r--r-- 1 root root 1.2K Apr 26 00:37 239754.sst
-rw-r--r-- 1 root root  929 Apr 26 00:37 157219.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:37 038124.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:37 1003582.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:37 964578.sst
-rw-r--r-- 1 root root  967 Apr 26 00:36 1030977.sst
-rw-r--r-- 1 root root 1013 Apr 26 00:36 823841.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:36 761188.sst
-rw-r--r-- 1 root root  976 Apr 26 00:36 518444.sst
-rw-r--r-- 1 root root  991 Apr 26 00:36 498886.sst
-rw-r--r-- 1 root root 1.8K Apr 26 00:36 487106.sst
-rw-r--r-- 1 root root 1.7K Apr 26 00:36 459676.sst
-rw-r--r-- 1 root root 8.1K Apr 26 00:36 214101.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:36 205348.sst
-rw-r--r-- 1 root root  954 Apr 26 00:36 196631.sst
-rw-r--r-- 1 root root  975 Apr 26 00:36 174712.sst
-rw-r--r-- 1 root root 2.6K Apr 26 00:36 995791.sst
-rw-r--r-- 1 root root  991 Apr 26 00:36 440074.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:36 271159.sst
-rw-r--r-- 1 root root 7.0K Apr 26 00:36 263326.sst
-rw-r--r-- 1 root root 1011 Apr 26 00:36 247644.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:36 165968.sst
-rw-r--r-- 1 root root 2.5K Apr 26 00:36 984113.sst
-rw-r--r-- 1 root root 2.7K Apr 26 00:36 972367.sst
-rw-r--r-- 1 root root 1.2K Apr 26 00:36 952838.sst
-rw-r--r-- 1 root root 2.2K Apr 26 00:36 855162.sst
-rw-r--r-- 1 root root 1011 Apr 26 00:36 671142.sst
-rw-r--r-- 1 root root 1.2K Apr 26 00:36 122342.sst
-rw-r--r-- 1 root root  994 Apr 26 00:35 843455.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:35 831667.sst
-rw-r--r-- 1 root root 1.8K Apr 26 00:35 800320.sst
-rw-r--r-- 1 root root  966 Apr 26 00:35 776894.sst
-rw-r--r-- 1 root root 1012 Apr 26 00:35 706431.sst
-rw-r--r-- 1 root root  991 Apr 26 00:35 702440.sst
-rw-r--r-- 1 root root 1.2K Apr 26 00:35 424407.sst
-rw-r--r-- 1 root root  954 Apr 26 00:35 278999.sst
-rw-r--r-- 1 root root 7.6K Apr 26 00:35 035989.sst
-rw-r--r-- 1 root root 2.6K Apr 26 00:35 1007518.sst
-rw-r--r-- 1 root root 2.5K Apr 26 00:35 960661.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:35 937225.sst
-rw-r--r-- 1 root root  955 Apr 26 00:35 870804.sst
-rw-r--r-- 1 root root 1013 Apr 26 00:35 565420.sst
-rw-r--r-- 1 root root 1.8K Apr 26 00:35 444020.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:35 286845.sst
-rw-r--r-- 1 root root 5.6K Apr 26 00:35 109259.sst
-rw-r--r-- 1 root root 1.2K Apr 26 00:35 096208.sst
-rw-r--r-- 1 root root 1000 Apr 26 00:35 545837.sst
-rw-r--r-- 1 root root 1003 Apr 26 00:35 541909.sst
-rw-r--r-- 1 root root  991 Apr 26 00:35 522376.sst
-rw-r--r-- 1 root root 1.3K Apr 26 00:35 502780.sst
-rw-r--r-- 1 root root 1009 Apr 26 00:35 235863.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:35 227291.sst
-rw-r--r-- 1 root root 1.3K Apr 26 00:35 1034896.sst
-rw-r--r-- 1 root root  991 Apr 26 00:35 745553.sst
-rw-r--r-- 1 root root  966 Apr 26 00:35 667268.sst
-rw-r--r-- 1 root root 2.4K Apr 26 00:35 345860.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:35 338018.sst
-rw-r--r-- 1 root root  990 Apr 26 00:35 330171.sst
-rw-r--r-- 1 root root  947 Apr 26 00:35 065827.sst
-rw-r--r-- 1 root root 1.2K Apr 26 00:35 055177.sst
-rw-r--r-- 1 root root 1.1K Apr 26 00:35 1023142.sst
-rw-r--r-- 1 root root 1.2K Apr 26 00:35 808126.sst
-rw-r--r-- 1 root root  991 Apr 26 00:35 796409.sst
-rw-r--r-- 1 root root 1.8K Apr 26 00:35 569341.sst
-rw-r--r-- 1 root root 1.7K Apr 26 00:35 506668.sst
danielpops@host:~ [00:38:02] βœ“
$ sudo ls -ult /var/osquery/osquery.db/ | tail -100
-rw-r--r-- 1 root root 2.0K Apr 25 00:39 549787.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 545850.sst
-rw-r--r-- 1 root root  978 Apr 25 00:39 526282.sst
-rw-r--r-- 1 root root  953 Apr 25 00:39 306551.sst
-rw-r--r-- 1 root root  953 Apr 25 00:39 298670.sst
-rw-r--r-- 1 root root 3.6K Apr 25 00:39 290806.sst
-rw-r--r-- 1 root root  953 Apr 25 00:39 259400.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 251587.sst
-rw-r--r-- 1 root root  953 Apr 25 00:39 243701.sst
-rw-r--r-- 1 root root  950 Apr 25 00:39 044521.sst
-rw-r--r-- 1 root root  973 Apr 25 00:39 033853.sst
-rw-r--r-- 1 root root 1.2K Apr 25 00:39 031726.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 815994.sst
-rw-r--r-- 1 root root  954 Apr 25 00:39 753371.sst
-rw-r--r-- 1 root root  954 Apr 25 00:39 737707.sst
-rw-r--r-- 1 root root 1.8K Apr 25 00:39 510607.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 451867.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 917782.sst
-rw-r--r-- 1 root root 2.4K Apr 25 00:39 913905.sst
-rw-r--r-- 1 root root 1.2K Apr 25 00:39 902187.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 862986.sst
-rw-r--r-- 1 root root 1.2K Apr 25 00:39 839507.sst
-rw-r--r-- 1 root root 1.8K Apr 25 00:39 827783.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 788617.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 659481.sst
-rw-r--r-- 1 root root 2.6K Apr 25 00:39 655540.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 651620.sst
-rw-r--r-- 1 root root 2.6K Apr 25 00:39 620282.sst
-rw-r--r-- 1 root root  987 Apr 25 00:39 596759.sst
-rw-r--r-- 1 root root  954 Apr 25 00:39 592835.sst
-rw-r--r-- 1 root root  954 Apr 25 00:39 553712.sst
-rw-r--r-- 1 root root  966 Apr 25 00:39 479251.sst
-rw-r--r-- 1 root root  954 Apr 25 00:39 377342.sst
-rw-r--r-- 1 root root 1.0K Apr 25 00:39 357670.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 117970.sst
-rw-r--r-- 1 root root 1.2K Apr 25 00:39 091836.sst
-rw-r--r-- 1 root root  966 Apr 25 00:39 053052.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 976278.sst
-rw-r--r-- 1 root root  955 Apr 25 00:39 968481.sst
-rw-r--r-- 1 root root  992 Apr 25 00:39 898268.sst
-rw-r--r-- 1 root root  991 Apr 25 00:39 690717.sst
-rw-r--r-- 1 root root 2.4K Apr 25 00:39 585016.sst
-rw-r--r-- 1 root root 1016 Apr 25 00:39 369490.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 282908.sst
-rw-r--r-- 1 root root 7.9K Apr 25 00:39 267245.sst
-rw-r--r-- 1 root root  988 Apr 25 00:39 131054.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 104926.sst
-rw-r--r-- 1 root root 1.2K Apr 25 00:39 042388.sst
-rw-r--r-- 1 root root 1014 Apr 25 00:39 1011404.sst
-rw-r--r-- 1 root root  955 Apr 25 00:39 999697.sst
-rw-r--r-- 1 root root 1.6K Apr 25 00:39 910004.sst
-rw-r--r-- 1 root root 1.7K Apr 25 00:39 686825.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 647705.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 643809.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 632052.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 628111.sst
-rw-r--r-- 1 root root 1.2K Apr 25 00:39 624199.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 612424.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:39 608499.sst
-rw-r--r-- 1 root root 2.5K Apr 25 00:39 604584.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:38 161634.sst
-rw-r--r-- 1 root root  929 Apr 25 00:38 139757.sst
-rw-r--r-- 1 root root  991 Apr 25 00:38 070104.sst
-rw-r--r-- 1 root root 1.9K Apr 25 00:38 784713.sst
-rw-r--r-- 1 root root  991 Apr 25 00:38 494958.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:38 389129.sst
-rw-r--r-- 1 root root  975 Apr 25 00:38 170348.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:38 1027056.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:38 956732.sst
-rw-r--r-- 1 root root 3.0K Apr 25 00:38 948925.sst
-rw-r--r-- 1 root root  989 Apr 25 00:38 933357.sst
-rw-r--r-- 1 root root 1003 Apr 25 00:38 769052.sst
-rw-r--r-- 1 root root  991 Apr 25 00:38 722077.sst
-rw-r--r-- 1 root root 1.6K Apr 25 00:38 467487.sst
-rw-r--r-- 1 root root 1015 Apr 25 00:38 420476.sst
-rw-r--r-- 1 root root 6.2K Apr 25 00:38 059439.sst
-rw-r--r-- 1 root root  971 Apr 25 00:38 029588.sst
-rw-r--r-- 1 root root  955 Apr 25 00:38 847357.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:38 483185.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:38 231720.sst
-rw-r--r-- 1 root root 1.2K Apr 25 00:38 222921.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:38 187850.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:38 179083.sst
-rw-r--r-- 1 root root 2.2K Apr 25 00:38 941131.sst
-rw-r--r-- 1 root root 1.1K Apr 25 00:38 878664.sst
-rw-r--r-- 1 root root 1003 Apr 25 00:38 635972.sst
-rw-r--r-- 1 root root 1012 Apr 25 00:38 581092.sst
-rw-r--r-- 1 root root  953 Apr 25 00:38 326284.sst
-rw-r--r-- 1 root root    0 Dec  1  2016 LOG
-rw-r--r-- 1 root root    0 Dec  1  2016 LOG.old.1480623856134798
-rw-r--r-- 1 root root    0 Nov  7  2016 LOG.old.1480623856126275
-rw-r--r-- 1 root root    0 Nov  7  2016 LOG.old.1478520192771897
-rw-r--r-- 1 root root    0 Aug 30  2016 LOG.old.1478520192722831
-rw-r--r-- 1 root root    0 Aug 30  2016 LOG.old.1472605102050147
-rw-r--r-- 1 root root    0 Aug 23  2016 LOG.old.1472605102037343
-rw-r--r-- 1 root root    0 Aug 23  2016 LOG.old.1471996707661786
-rw-r--r-- 1 root root    0 Jun 15  2016 LOG.old.1471996707644407
-rw-r--r-- 1 root root    0 Jun 15  2016 LOG.old.1466011983926467
-rw-r--r-- 1 root root   37 Jun 15  2016 IDENTITY
-rw-r--r-- 1 root root    0 Jun 15  2016 LOCK
database question

All 38 comments

  • Is it normal to have so many files under there? No, these should be cleaned up over time, but this has been an issue in the past, see #3984 as it _might_ be related, but this shouldn't normally be the case.
  • Is it safe to purge old files based on mtime? or atime? Maybe. That's a tough one, as it's unknown what might be stored in there as it depends on your configuration. The worst case here is that you might lose data if you have some _really_ long standing differential queries lingering, but I'd bet that they can be pruned, however which ones to prune is difficult. Worst case scenario here - you could just blow away the whole DB after an upgrade and see if the issue comes back, which I don't think it will on later versions of osquery.
  • Has anyone else observed so many files and/or a high spike in CPU after running for several months? I'd hit up the Slack with this question, as some of the folks who had experienced this issue in the past idle there and might be able to offer insight into what symptoms they were having when this issue was around.

That all being said, is there any chance you could upgrade to 2.11.2 and see if this issue persists? There've been quite a few database changes in the later 2.X releases, so this issue _might_ have been resolved. Hope this information helps!

Thanks @muffins!! I'll see about upgrading at some point in the near future. I guess i'll upgrade a host with lots of sst files (without purging them first) and see what happens!

@danielpops right on. My bet is that nothing will happen on the hosts with large amounts of .sst files, as I don't _think_ that we do any sort of purging. I'm more curious if see if this can be replicated on newer versions of osquery, but as you mentioned this took quite a bit of time to trigger. I don't think we've seen that in our environment lately, but I would definitely encourage you to hit up Slack and see if other folks have seen this if you haven't already.

So I upgraded to the latest version (3.2.6 at the time of this post) and observed more or less about the same load. I then tried purging files in the /var/osquery/osquery.db/ folder that had an mtime > 90 days (while osqueryd was running). This resulted in osquery noticing something was wrong and resetting the db folder:

I0531 14:40:06.416247  5026 database.cpp:131] Resetting the database plugin: rocksdb
I0531 14:40:06.458199  5023 aws_log_forwarder.h:222] aws_kinesis: Successfully sent 1 out of 1 log records
I0531 14:40:06.473909  5026 rocksdb.cpp:132] Opening RocksDB handle: /var/osquery/osquery.db
I0531 14:40:06.556735  5026 rocksdb.cpp:66] RocksDB: [ERROR] [db/version_set.cc:1105] Unable to load table properties for file 8304344 --- IO error: While open a file for random read: /var/osquery/osquery.db/8304344.sst: No such file or directory
I0531 14:40:06.556872  5026 rocksdb.cpp:66] RocksDB: [ERROR] [db/version_set.cc:1105] Unable to load table properties for file 8296138 --- IO error: While open a file for random read: /var/osquery/osquery.db/8296138.sst: No such file or directory
I0531 14:40:06.608741  5026 rocksdb.cpp:66] RocksDB: [ERROR] [db/version_set.cc:1105] Unable to load table properties for file 8500660 --- IO error: While open a file for random read: /var/osquery/osquery.db/8500660.sst: No such file or directory
I0531 14:40:06.608876  5026 rocksdb.cpp:66] RocksDB: [ERROR] [db/version_set.cc:1105] Unable to load table properties for file 8500628 --- IO error: While open a file for random read: /var/osquery/osquery.db/8500628.sst: No such file or directory
I0531 14:40:06.608965  5026 rocksdb.cpp:66] RocksDB: [ERROR] [db/version_set.cc:1105] Unable to load table properties for file 8530488 --- IO error: While open a file for random read: /var/osquery/osquery.db/8530488.sst: No such file or directory
I0531 14:40:06.609026  5026 rocksdb.cpp:66] RocksDB: [ERROR] [db/version_set.cc:1105] Unable to load table properties for file 8545199 --- IO error: While open a file for random read: /var/osquery/osquery.db/8545199.sst: No such file or directory
[...]
I0531 14:40:07.649065  5026 rocksdb.cpp:66] RocksDB: [ERROR] [db/version_set.cc:1105] Unable to load table properties for file 2338 --- IO error: While open a file for random read: /var/osquery/osquery.db/002338.sst: No such file or directory
W0531 14:40:08.157758  5026 rocksdb.cpp:214] Backing up RocksDB database: /var/osquery/osquery.db.backup
W0531 14:40:08.158018  5026 rocksdb.cpp:221] Destroying RocksDB database due to corruption

I'm actually not sure if that behavior is unique to the latest version or not. I'm also not sure if that same thing would happen if I did the purging while osquery was _stopped_. I'll do some more testing here.

Edit: That behavior of osquery resetting the db folder is consistent on the older version i'm using too (2.8.1)

W0531 15:15:01.547013  5616 rocksdb.cpp:209] Removed previous RocksDB database backup: /var/osquery/osquery.db.backup
W0531 15:15:01.547236  5616 rocksdb.cpp:214] Backing up RocksDB database: /var/osquery/osquery.db.backup
W0531 15:15:01.547498  5616 rocksdb.cpp:221] Destroying RocksDB database due to corruption

Edit: The same behavior occurs on v2.8.1 if I do the purging of files with mtime > 60 days while osquery is not running.

$ sudo ls -al /var/osquery/osquery.db/ | wc -l
50405
$ sudo service osqueryd stop
$ sudo mkdir /var/osquery/osquery.db.bak
$ sudo find /var/osquery/osquery.db -type f -name '*' -mtime +60 -exec mv {} /var/osquery/osquery.db.bak/ \;
$ sudo ls -al /var/osquery/osquery.db/ | wc -l
7043
$ sudo osqueryd --flagfile=/etc/osquery/osquery.flags --config_path=/etc/osquery/osquery.conf --pidfile=/var/run/osqueryd.pidfile --verbose
W0531 15:38:02.382848 12035 rocksdb.cpp:214] Backing up RocksDB database: /var/osquery/osquery.db.backup
W0531 15:38:02.393203 12035 rocksdb.cpp:221] Destroying RocksDB database due to corruption

Awesome, thanks for reporting back! That behavior seems to make sense, I believe we have quite a bit of logic around monitoring the DB state. It’s also, unfortunately, not surprising that osquery still had references to some of the older sst files where old differentials might be stored. I’m not positive what the next step for triage here is gunna be. If you have your hands on one of these databases would you mind sharing it with me out of band? We can possibly take a look at what’s in there and maybe understand what’s causing this.

I'm not sure if there is much left to triage, per se. This is only a problem for me because my fleet is entirely on the older version of osquery where the sst files can pile up in to the 10's of thousands. My dilemma is that I want to get rid of 10's of thousands of these sst files, but I don't want to lose the historical context of the last ~30-60 days.

Unless there are any other ideas or suggestions, I should probably just bite the bullet and upgrade everywhere and purge all historical sst files, thus losing any historical context and potentially seeing lots of duplicate added events for some time...

FYI I've been running osquery v3.2.6 since May 31 on one host. On that host's /var/osquery/osquery.db folder, there are 5600 files now after running for ~28 days. That seems about in line with the previous versions. Seems to be about ~200 files per day:

dpopes@host:~ [17:42:11] βœ“
$ sudo ls -al /var/osquery/osquery.db/ | grep " Jun 26" | wc -l
201
dpopes@host:~ [17:42:17] βœ“
$ sudo ls -al /var/osquery/osquery.db/ | grep " Jun 25" | wc -l
192
dpopes@host:~ [17:42:21] βœ“
$ sudo ls -al /var/osquery/osquery.db/ | grep " Jun 24" | wc -l
214
dpopes@host:~ [17:42:23] βœ“
$ sudo ls -al /var/osquery/osquery.db/ | grep " Jun 23" | wc -l
208
dpopes@host:~ [17:42:26] βœ“
$ sudo ls -al /var/osquery/osquery.db/ | grep " Jun 22" | wc -l
217
dpopes@host:~ [17:42:28] βœ“
$ sudo ls -al /var/osquery/osquery.db/ | grep " Jun 21" | wc -l
192
dpopes@host:~ [17:42:30] βœ“
$ sudo ls -al /var/osquery/osquery.db/ | grep " Jun 20" | wc -l
186
dpopes@host:~ [17:42:36] βœ“
$ sudo ls -al /var/osquery/osquery.db/ | grep " Jun 19" | wc -l
183
dpopes@host:~ [17:42:40] βœ“
$ osqueryi --version
osqueryi version 3.2.6

Oh. Another thing! I was semi successful at purging older sst files and getting them in a state where osquery could still read them. The methodology was as follows:

  • Stop osquery
  • Purge a bunch of old records older than some threshold (30 days?) based on mtime
  • "repair" the database using the RocksDB-Repairer (https://github.com/facebook/rocksdb/wiki/RocksDB-Repairer)
  • Start osquery

By following this procedure, osquery was able to start up and never did the thing where it purges the database due to corruption. This may be a good approach that we can can instrument in our infrastructure, but it requires more testing, and still probably entails the potential for duplicate records to be replayed.

Hi,

I have the same issue:

# osqueryi --version
osqueryi version 3.2.4
# pwd
/var/osquery
# du -hs osquery.db/
557M    osquery.db/
# ls -rlt osquery.db/|wc -l
5453

Thanks
Best regards

This will happen if your config has _events tables enabled, but no scheduled queries for those events.
For example, if I have the following, but no "SELECT * FROM socket_events" in my schedule, then those socket_events cached events .sst files will not be deleted.

    "disable_events": false,
    "disable_audit": false,
    "audit_allow_process_events": true,
    "audit_allow_sockets": true,
    "audit_allow_fim_events": false,

Thanks for following up, @packetzero! I think my config doesn't meet the requirements you've stated. I have no _tables enabled afaik and am not using any of the audit system. I will take a look and try and see if I can pass some combination of flags to see any different results.

@muffins Actually I have another observation about this phenomena! Looking at my sst files, I see the following interesting thing:

danielpops@danielpops:~/scratch/osquery (master) [15:35:22] βœ“
$ sudo docker run --rm -it -u root -v /var/osquery:/var/osquery dpopes-rocksdb ./sst_dump --file=/v
ar/osquery/osquery.db.bak/ --command=scan | head -100
from [] to []
Process /var/osquery/osquery.db.bak//040206.sst
Sst file format: block-based
'aws_kinesis_r_1527720740_160249' seq:1288577, type:0 =>
'aws_kinesis_r_1527720864_160250' seq:1288639, type:0 =>
'aws_kinesis_r_1527720864_160251' seq:1288640, type:0 =>
'aws_kinesis_r_1527720864_160252' seq:1288641, type:0 =>
'aws_kinesis_r_1527720864_160253' seq:1288642, type:0 =>
'aws_kinesis_r_1527720864_160254' seq:1288643, type:0 =>
'aws_kinesis_r_1527720864_160255' seq:1288644, type:0 =>
'aws_kinesis_r_1527720864_160256' seq:1288645, type:0 =>
'aws_kinesis_r_1527720864_160257' seq:1288646, type:0 =>
'aws_kinesis_r_1527720864_160258' seq:1288647, type:0 =>
'aws_kinesis_r_1527720864_160259' seq:1288648, type:0 =>
'aws_kinesis_r_1527720864_160260' seq:1288649, type:0 =>
'aws_kinesis_r_1527720864_160261' seq:1288650, type:0 =>
'aws_kinesis_r_1527720864_160262' seq:1288651, type:0 =>
'aws_kinesis_r_1527720864_160263' seq:1288652, type:0 =>
'aws_kinesis_r_1527720864_160264' seq:1288653, type:0 =>
'aws_kinesis_r_1527720864_160265' seq:1288654, type:0 =>
'aws_kinesis_r_1527720864_160266' seq:1288655, type:0 =>
'aws_kinesis_r_1527720864_160267' seq:1288656, type:0 =>
'aws_kinesis_r_1527720864_160268' seq:1288657, type:0 =>
'aws_kinesis_r_1527720864_160269' seq:1288658, type:0 =>
'aws_kinesis_r_1527720864_160270' seq:1288659, type:0 =>
'aws_kinesis_r_1527720864_160271' seq:1288660, type:0 =>
'aws_kinesis_r_1527720864_160272' seq:1288661, type:0 =>
'aws_kinesis_r_1527720864_160273' seq:1288662, type:0 =>
'aws_kinesis_r_1527720864_160274' seq:1288663, type:0 =>
'aws_kinesis_r_1527720864_160275' seq:1288664, type:0 =>
'aws_kinesis_r_1527720864_160276' seq:1288665, type:0 =>
'aws_kinesis_r_1527720864_160277' seq:1288666, type:0 =>
'aws_kinesis_r_1527720864_160278' seq:1288667, type:0 =>
'aws_kinesis_r_1527720864_160279' seq:1288668, type:0 =>
'aws_kinesis_r_1527720864_160280' seq:1288669, type:0 =>
'aws_kinesis_r_1527720864_160281' seq:1288670, type:0 =>
'aws_kinesis_r_1527720864_160282' seq:1288671, type:0 =>
'aws_kinesis_r_1527720864_160283' seq:1288672, type:0 =>
'aws_kinesis_r_1527720864_160284' seq:1288673, type:0 =>
'aws_kinesis_r_1527720864_160285' seq:1288674, type:0 =>
'aws_kinesis_r_1527720864_160286' seq:1288675, type:0 =>
'aws_kinesis_r_1527720864_160287' seq:1288676, type:0 =>
'aws_kinesis_r_1527720864_160288' seq:1288677, type:0 =>
'aws_kinesis_r_1527720864_160289' seq:1288678, type:0 =>
'aws_kinesis_r_1527720864_160290' seq:1288679, type:0 =>
'aws_kinesis_r_1527720864_160291' seq:1288680, type:0 =>
'aws_kinesis_r_1527720864_160292' seq:1288681, type:0 =>
'aws_kinesis_r_1527720864_160293' seq:1288682, type:0 =>
'aws_kinesis_r_1527720864_160294' seq:1288683, type:0 =>
'aws_kinesis_r_1527720864_160295' seq:1288684, type:0 =>
'aws_kinesis_r_1527720916_160296' seq:1288710, type:0 =>
'aws_kinesis_r_1527720916_160297' seq:1288711, type:0 =>
'aws_kinesis_r_1527720916_160298' seq:1288712, type:0 =>
'aws_kinesis_r_1527720916_160299' seq:1288713, type:0 =>
'aws_kinesis_r_1527720916_160300' seq:1288714, type:0 =>
'aws_kinesis_r_1527720916_160301' seq:1288715, type:0 =>
'aws_kinesis_r_1527720916_160302' seq:1288716, type:0 =>
'aws_kinesis_r_1527720916_160303' seq:1288717, type:0 =>
'aws_kinesis_r_1527720916_160304' seq:1288718, type:0 =>
'aws_kinesis_r_1527720916_160305' seq:1288719, type:0 =>
'aws_kinesis_r_1527720916_160306' seq:1288720, type:0 =>
'aws_kinesis_r_1527720916_160307' seq:1288721, type:0 =>
'aws_kinesis_r_1527720916_160308' seq:1288722, type:0 =>
'aws_kinesis_r_1527720916_160309' seq:1288723, type:0 =>
'aws_kinesis_r_1527720916_160310' seq:1288724, type:0 =>
'aws_kinesis_r_1527720916_160311' seq:1288725, type:0 =>
'aws_kinesis_r_1527720916_160312' seq:1288726, type:0 =>
'aws_kinesis_r_1527720916_160313' seq:1288727, type:0 =>
'aws_kinesis_r_1527720922_160314' seq:1288805, type:0 =>
'aws_kinesis_r_1527720922_160315' seq:1288806, type:0 =>
'aws_kinesis_r_1527720922_160316' seq:1288807, type:0 =>
'aws_kinesis_r_1527720922_160317' seq:1288808, type:0 =>
'aws_kinesis_r_1527720922_160318' seq:1288809, type:0 =>
'aws_kinesis_r_1527720922_160319' seq:1288810, type:0 =>
'aws_kinesis_r_1527720922_160320' seq:1288811, type:0 =>
'aws_kinesis_r_1527720922_160321' seq:1288812, type:0 =>
'aws_kinesis_r_1527720922_160322' seq:1288813, type:0 =>
'aws_kinesis_r_1527720922_160323' seq:1288814, type:0 =>
'aws_kinesis_r_1527720922_160324' seq:1288815, type:0 =>
'aws_kinesis_r_1527720922_160325' seq:1288816, type:0 =>
'aws_kinesis_r_1527720922_160326' seq:1288817, type:0 =>
'aws_kinesis_r_1527720922_160327' seq:1288818, type:0 =>
'aws_kinesis_r_1527720922_160328' seq:1288819, type:0 =>
'aws_kinesis_r_1527720922_160329' seq:1288820, type:0 =>
'aws_kinesis_r_1527720922_160330' seq:1288821, type:0 =>
'aws_kinesis_r_1527720922_160331' seq:1288822, type:0 =>
'aws_kinesis_r_1527720922_160332' seq:1288823, type:0 =>
'aws_kinesis_r_1527720922_160333' seq:1288824, type:0 =>
'aws_kinesis_r_1527720922_160334' seq:1288825, type:0 =>
'aws_kinesis_r_1527720922_160335' seq:1288826, type:0 =>
'aws_kinesis_r_1527720922_160336' seq:1288827, type:0 =>
'aws_kinesis_r_1527720922_160337' seq:1288828, type:0 =>
'aws_kinesis_r_1527720922_160338' seq:1288829, type:0 =>
'aws_kinesis_r_1527720922_160339' seq:1288830, type:0 =>
'aws_kinesis_r_1527720922_160340' seq:1288831, type:0 =>
'aws_kinesis_r_1527720922_160341' seq:1288832, type:0 =>
'aws_kinesis_r_1527720922_160342' seq:1288833, type:0 =>
'aws_kinesis_r_1527720922_160343' seq:1288834, type:0 =>
'aws_kinesis_r_1527720922_160344' seq:1288835, type:0 =>
'aws_kinesis_r_1527720922_160345' seq:1288836, type:0 =>

danielpops@danielpops:~/scratch/osquery (master) [15:38:59] βœ—
$ sudo docker run --rm -it -u root -v /var/osquery:/var/osquery dpopes-rocksdb ./sst_dump --file=/v
ar/osquery/osquery.db.bak/ --command=scan | tail -10
'aws_kinesis_r_1526281906_89269' seq:221305, type:0 =>
'aws_kinesis_r_1526281906_89270' seq:221306, type:0 =>
'aws_kinesis_r_1526281906_89271' seq:221307, type:0 =>
'aws_kinesis_r_1526281906_89272' seq:221308, type:0 =>
'aws_kinesis_r_1526281906_89273' seq:221309, type:0 =>
'aws_kinesis_r_1526281906_89274' seq:221310, type:0 =>
'aws_kinesis_r_1526281906_89275' seq:221311, type:0 =>
'aws_kinesis_r_1526281980_89276' seq:221325, type:0 =>
'aws_kinesis_r_1526281980_89277' seq:221326, type:0 =>
'aws_kinesis_r_1526281984_89278' seq:221334, type:0 =>

Note that there are all those tables with dynamic looking names in them, e.g. aws_kinesis_r_<timestamp>_<number>, which appears to come from BufferedLogForwarder (https://github.com/facebook/osquery/blob/3.2.6/osquery/logger/plugins/buffered.cpp#L255), but maybe these aren't getting properly managed by the kinesis logger? Is it possible that the presence of those indefinitely never being cleared up may be causing those osquery files to stick around forever?

The output lines from sst_dump ending with '=>' indicate items (keys) that have been deleted. If the files are hanging around, there's probably one or more keys have haven't been deleted. For files that have most deleted (e.g. '040206.sst') can you see which of the keys haven't been deleted?

Also, do you see the following string in your WARN log?
"aws_log_forwarder::send() error_count:"
This is one codepath that will cause keys to not be deleted, but retried EVERY time. The AWS logger implementation is not very error savvy.

Regarding CPU usage, BufferedLogger derived loggers like KinesisLogger call database::remove() for each key (e.g. each row in a query result). For RocksDB, this will turn into a sync for anything that is not an events table, so with many files, I could see this causing contention. We should look at also skipping sync for domain==kLogs.

@packetzero when running in verbose mode, I do see intermittently a bunch of =>

E0122 18:09:56.783711 19018 aws_util.cpp:178] Exception making HTTP request to URL (https://kinesis.us-west-1.amazonaws.com): std::exception

and then in one case, I see that and a similar looking message adjacent to eachother =>

E0122 18:19:05.391729 19018 aws_util.cpp:178] Exception making HTTP request to URL (https://kinesis.us-west-1.amazonaws.com): std::exception
E0122 18:19:05.391970 19018 aws_log_forwarder.h:247] aws_kinesis: Complete request failure: Unable to connect to endpoint

Probably unrelated, on that same host I see lots of

I0122 18:11:49.167207 19021 scheduler.cpp:83] Executing scheduled query ops_iptables_rules: SELECT * FROM iptables;
I0122 18:11:49.173919 19021 virtual_table.cpp:291] Error casting bytes (51875880084) to INTEGER
I0122 18:11:49.175422 19021 virtual_table.cpp:291] Error casting protocol () to INTEGER
I0122 18:11:49.175456 19021 virtual_table.cpp:291] Error casting packets (5707531830) to INTEGER
I0122 18:11:49.175473 19021 virtual_table.cpp:291] Error casting bytes (2023969024884) to INTEGER
I0122 18:11:49.175503 19021 virtual_table.cpp:291] Error casting protocol () to INTEGER
I0122 18:11:49.175524 19021 virtual_table.cpp:291] Error casting packets (5707531822) to INTEGER
I0122 18:11:49.175541 19021 virtual_table.cpp:291] Error casting bytes (2023969020276) to INTEGER
I0122 18:11:49.175567 19021 virtual_table.cpp:291] Error casting protocol () to INTEGER
I0122 18:11:49.175616 19021 virtual_table.cpp:291] Error casting protocol () to INTEGER
I0122 18:11:49.175667 19021 virtual_table.cpp:291] Error casting packets (6100607407) to INTEGER
I0122 18:11:49.175688 19021 virtual_table.cpp:291] Error casting bytes (1821528008154) to INTEGER
I0122 18:11:49.175716 19021 virtual_table.cpp:291] Error casting protocol () to INTEGER
I0122 18:11:49.175740 19021 virtual_table.cpp:291] Error casting packets (6100607407) to INTEGER
I0122 18:11:49.175756 19021 virtual_table.cpp:291] Error casting bytes (1821528008154) to INTEGER

Also, a few more observations that may be useful or not =>

From a co-worker =>

I'm seeing sst files accumulating at the same rate. I did notice something though - some of the files are being cleaned up. The sst count skips, so some files are getting removed regularly. It looks like it removes files every 5 minutes, but if more than one sst file was created in the last 5 minutes, only the latest is getting removed.

Another unrelated observation, is that on a particular host, I see this very frequently in the logs:

I0122 18:40:06.180625 21856 scheduler.cpp:83] Executing scheduled query forensics_etc_hosts: SELECT * FROM etc_hosts;
I0122 18:40:06.199512 21856 database.cpp:131] Resetting the database plugin: rocksdb
I0122 18:40:06.199759 21856 rocksdb.cpp:132] Opening RocksDB handle: /var/osquery/osquery.db

Seems like the rocksdb handle is getting reset a bunch. not sure what's up with that. On that particular host, there are a bunch of log files piling up under /tmp/ =>

dpopes@the_server:~ [18:44:33] βœ“
$ ls /tmp | grep osquery | wc -l
36210
dpopes@the_server:~ [18:44:47] βœ“
$ ls /tmp | grep osquery | tail -30
osqueryd.the_server.invalid-user.log.WARNING.20190122-174804.19561
osqueryd.the_server.invalid-user.log.WARNING.20190122-174805.19587
osqueryd.the_server.invalid-user.log.WARNING.20190122-181808.22360
osqueryd.the_server.invalid-user.log.WARNING.20190122-181808.22386
osqueryd.the_server.invalid-user.log.WARNING.20190122-181808.22416
osqueryd.the_server.invalid-user.log.WARNING.20190122-181808.22444
osqueryd.the_server.invalid-user.log.WARNING.20190122-181809.22455
osqueryd.the_server.root.log.ERROR.20190118-101832.24457
osqueryd.the_server.root.log.ERROR.20190118-103720.29744
osqueryd.the_server.root.log.ERROR.20190118-103733.29941
osqueryd.the_server.root.log.ERROR.20190118-104708.30955
osqueryd.the_server.root.log.ERROR.20190118-105147.19009
osqueryd.the_server.root.log.ERROR.20190122-183410.21843
osqueryd.the_server.root.log.INFO.20190118-101832.24457
osqueryd.the_server.root.log.INFO.20190118-103720.29744
osqueryd.the_server.root.log.INFO.20190118-103733.29941
osqueryd.the_server.root.log.INFO.20190118-104708.30955
osqueryd.the_server.root.log.INFO.20190118-104722.19007
osqueryd.the_server.root.log.INFO.20190118-104722.19009
osqueryd.the_server.root.log.INFO.20190122-182945.21841
osqueryd.the_server.root.log.INFO.20190122-182945.21843
osqueryd.the_server.root.log.WARNING.20190118-101832.24457
osqueryd.the_server.root.log.WARNING.20190118-103720.29744
osqueryd.the_server.root.log.WARNING.20190118-103733.29941
osqueryd.the_server.root.log.WARNING.20190118-104708.30955
osqueryd.the_server.root.log.WARNING.20190118-105147.19009
osqueryd.the_server.root.log.WARNING.20190122-183410.21843
osqueryd.ERROR@
osqueryd.INFO@
osqueryd.WARNING@

The 'Resetting the database plugin' stuff happens regularly on all systems I have observed, so I consider that normal.

For the files in /tmp/, that's weird. Also, it seems strange to see the usernames in the filenames. You should check the osqueryd processes running. Could it be that it's setup to run as a user process on that particular system? Seems like some unusual config going on there.

Not sure where we go from here. I am unable to reproduce this with 3.3.1. It may be up to you to try to figure out the root cause unless someone else sees it. Can you try with v3.3.1?

I see 'shed7' is having the same problem, with 38,000 lines like 'data.auditeventpublisher.process_events.0003350252' seq:6482196, type:0 =>' . I believe that lines ending with the arrow '=>' are indexes that were never deleted, but the data has.

Looking at the code in osquery/events/events.cpp, I can see a scenario where this is possible. Indexes are only being deleted if all queries have been completed. But I don't think the query results are posted yet. The getIndexes() call will call expireIndexes(). I suspect that if you add the LOG statement below in expireIndexes() it will be logged in your case.

void EventSubscriberPlugin::expireIndexes(
    const std::string& list_type,
    const std::vector<std::string>& indexes,
    const std::vector<std::string>& expirations) {
  if (!executedAllQueries()) {
    LOG(WARNING) << "DBG Not deleting indexes, since all Queries not finished. size:" << indexes.size() << " queries_.size:" << queries_.size();
    return;
  }

I wrote a fake_events table that can be configured to send many events on any platform (MacOS, Linux, or Windows). If I crank it up fast enough (--fake_run_pause_millis=25 --fake_events_per_run=77), I do see some unusual logging messages from rocksdb:

I0210 23:05:58.978260 69574656 rocksdb.cpp:68] RocksDB: [WARN] [db/column_family.cc:675] [queries] Stalling writes because we have 15 immutable memtables (waiting for flush), max_write_buffer_number is set to 16 rate 16777216

On MacOS running 3.3.1, I can see the number of files grow in DB dir (208 files in a minute or so, but mostly .log files, rather than .sst). Eventually they get pruned down to 18 files after the Resetting the database plugin: rocksdb happens. I'll try it on Linux tomorrow.

So any developers wanting to reproduce this can use: fake_events changes

On Ubuntu 18.04 running 3.3.1, I get the same results, where the files are cleaned up. The number of files grows, with a .log file every time the events exceeded limit 50000 by: 254 messages are seen, but after a rocksdb reset get cleaned up. I was hoping to try on Ubuntu 14 to match this ticket, but the 'build deps' fails, refusing to install linux brew.

Do you mind installing the deb on the Ubuntu 14 machine? https://pkg.osquery.io/deb/osquery_3.3.2_1.linux.amd64.deb

sha256sum: 6c54a55df1feaf410064bc87696f9829d0787fb5b9c7beabeb4f5cca7ce6c3fb

This will happen if your config has _events tables enabled, but no scheduled queries for those events.
For example, if I have the following, but no "SELECT * FROM socket_events" in my schedule, then those socket_events cached events .sst files will not be deleted.

    "disable_events": false,
    "disable_audit": false,
    "audit_allow_process_events": true,
    "audit_allow_sockets": true,
    "audit_allow_fim_events": false,

@packetzero If osquery has flag for events_max=10000 or events_expiry=1800, the .sst files will be deleted even without scheduled queries. right?

I was working on a related fix today. There are so many issues with buffered logger...

  • on windows 2012, it takes 40 seconds to delete 500 records (one-by-one), with moderate activity. Each delete call will need to acquire DB lock, so it ends up waiting. Even more so if you have events hammering your DB. The result is your data is VERY slow to report.
  • it should be using fixed length keys like events code (10-digit 0-prefixed number), since rocksdb uses string compare. So when scanning keys, it will put them in order 'aws_kinesis_r_1527720922_160339' < 'aws_kinesis_r_1526281906_89269' . So your server-side timestamp / ordering checking will probably be throwing out some records as 'already seen'. Could also lead to data files not being cleaned up.

Hi @danielpops I am having same issue in my environment. I am thinking to take your suggestion that deleting old sst files. Do you see any issue doing so? Did you do it via chef or you did it manually?
I have to use chef to delete files since I have many systems having this issue.
One stupid question, can we reset rocksdb? how?
Thanks!

Hi @muffins / @danielpops / @packetzero

The size of osquery.db has been growing since last December. I'm running osquery v3.3.2 on Debian Stretch.

➜ du -sh osquery.db
3.2G    osquery.db

The data inside the SST files is of Type:1. Would expired query results show up as Type:0? Further, I'm only using the TLS logger.

➜  rocksdb_sst_dump --file=osquery.db/133751.sst --command=scan  | grep ' type:1' | wc -l
    4645
➜  rocksdb_sst_dump --file=osquery.db/133751.sst --command=scan  | grep ' type:0' | wc -l
       0

The query grabs data from virtual table process_open_sockets and I don't have _event queries

➜ rocksdb_sst_dump --osquery.db/133751.sst --command=scan --read_num=1
from [] to []
Process /osquery.db/133751.sst
Sst file format: block-based
'tls_r_1578230693_21233' seq:0, type:1 => {"name":"pack_process_open_sockets","hostIdentifier":"[...]","calendarTime":"Sun Jan  5 13:24:36 2020 UTC","unixTime":1578230676,"epoch":0,"counter":406,"config_hash":"6beac24247f41edc40fa0f54324eb627ff05a976","columns":{"local_address":"[...]","local_port":"[...]","path":"[...]},"action":"added"}

Since Dec. 7th there's 2MB SST files.

➜  tools git:(master) ls -l osquery.db/ | head -15
.rw-r--r-- [...]     2 MB Sat Dec  7 02:58:16 2019 064081.sst
.rw-r--r-- [...]   2 MB Sat Dec  7 02:58:16 2019 064082.sst
.rw-r--r-- [...]    2 MB Sat Dec  7 03:05:01 2019 064103.sst
.rw-r--r-- [...]    2 MB Sat Dec  7 03:05:01 2019 064105.sst
.rw-r--r-- [...]     2 MB Sat Dec  7 03:05:01 2019 064107.sst

It's my understanding that stored differential data expires? Also, I restarted the osquery service and osquery.db remained that same size.

Does anyone have data/experiences as to whether this still occurs on the 4.x series of releases?

It still occurs on 4.1.1. We have made systemd timer that periodically cleans sst-files based on mtime and restarts osquery, but that timer got broken on one machine and now i see

● osqueryd.service - The osquery Daemon
   Loaded: loaded (/usr/lib/systemd/system/osqueryd.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2020-02-06 11:24:40 UTC; 6 months 7 days ago
...
sudo ls -lah /usr/share/osquery/osquery.dbq | wc -l
47871

I have made the backup before droping the database, and will try to analyse it later.

UPD: Most of records (12779794/13087785) are "deleted" tls_r records from tls logging plugin.

Could this issue stem from the differential engine? In other words, RocksDB stores previous query results to calculate differentials for recent query results. However, let's imagine over a month period a query is constantly changed. Thus, RocksDB stores copies of past queries that don't reflect the current query in configuration. How does osquery purge cache data from months ago?

If anyone has a large DB directory, or one with many SST files, can you compress and email it to me [email protected]?

My goal is to understand if there is a fixable root cause to the unexpected growth and creation of these files.

@gbedoya Just to understand, does it happen with a new database (deleting the old osquery.db file) on 4.x?

~99% of the data produced from a 3.2 GB osquery.db from last year, during a month period, was related to a query using the _processes_ and _process_open_sockets_ tables. Further, this query was executed on a 5-15 min interval.

It's been a while, but I think the problem was either rocksdb usage for :

  • buffered logging (aws_kinesis, tls_logger, etc.)
  • event cache where table has multiple interval values (socket_events with queries for 1 min, 5 min, 30 min).
    I ended up writing replacements for both and problem went away.
    @gbedoya What logger are you using? Can you try filesystem logger temporarily and see if growth continues?

TLS logging to a central endpoint.

From SST files ...

'tls_r_1578230693_21233' seq:0, type:1 => {"name":"pack_process_open_sockets","hostIdentifier":"[...]","calendarTime":"Sun Jan  5 13:24:36 2020 UTC","unixTime":1578230676,"epoch":0,"counter":406,"config_hash":"6beac24247f41edc40fa0f54324eb627ff05a976","columns":{"local_address":"[...]","local_port":"[...]","path":"[...]},"action":"added"}

I left an instance running for a week configured to publish logs every 5s to a non-existent endpoint. I then pegged the box with process events for hours, which created about 50M of process events. Afterward the box remained more it less idle.

Results: 1567 SST files, 107MB of data total. Almost all of the SST files are 39,658B, and almost every file belongs to the logs database column family.

I think I have a fix https://github.com/osquery/osquery/pull/6606 and that should apply retroactively.

Then, with the merge of #6606 is this issue resolved?

Yes and 4.5.0 includes that fix.

Thanks @theopolis . Cloudflare will upgrade to v4.5.0 for this feature and ARM support.

Awesome, let me know on Slack if there’s anything you need.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sanghyunhong picture sanghyunhong  Β·  19Comments

Iwant2learn picture Iwant2learn  Β·  23Comments

andrew-d picture andrew-d  Β·  12Comments

ageis picture ageis  Β·  12Comments

iyerabhi07 picture iyerabhi07  Β·  14Comments