While setting up my version of the fluxorama Docker container to load both flux-accounting and the job-archive module, I noticed that the inactive jobs that get written to the .db file are not persistent after a system instance restart of Flux.
My reproducer:
From within the container, I'll submit a couple of jobs:
[fluxuser@1880c353bc24 ~]$ flux mini submit -n 3 hostname
453353930752
[fluxuser@1880c353bc24 ~]$ flux mini submit -n 2 hostname
405773746176
These two jobs can be seen with both flux jobs -A and in the job-archive DB, whose location is in /run/flux/jobs.db:
sqlite> SELECT userid,id,t_submit,t_run,t_inactive,R from jobs;
userid id t_submit t_run t_inactive R
---------- ------------- ---------------- -------------- ---------------- -----------------------------------------------------------------------------
2001 453353930752 1594834322.78774 1594834322.808 1594834322.87507 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0-2"}}]}}
2001 405773746176 1594835454.78189 1594835454.801 1594835454.86345 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0-1"}}]}}
After a systemctl restart flux, the two jobs will still show with flux jobs -A, but not in the job-archive DB:
sqlite> SELECT userid,id,t_submit,t_run,t_inactive,R from jobs;
sqlite>
The .db file gets a new modification time after the Flux instance is restarted, but there are no previously completed jobs written there anymore.
IMHO, I don't think it is a blocker for the testing environment I was planning on working with Ryan Day, but it's just something I noticed.
whats the dbpath you are using?
whats the dbpath you are using?
dbpath is /run/flux/jobs.db
I'm confident that the data can persist in the db after the flux instance is brought down. As an experiment, tried this:
--- a/t/t2220-job-archive.t
+++ b/t/t2220-job-archive.t
@@ -8,7 +8,7 @@ export FLUX_CONF_DIR=$(pwd)
test_under_flux 4 job
ARCHIVEDIR=`pwd`
-ARCHIVEDB="${ARCHIVEDIR}/jobarchive.db"
+ARCHIVEDB="/tmp/achu/jobarchive.db"
So basically instead of creating a new file each time the tests are run, set a path where data from a prior test will be loaded. First test run succeeds, second test run sees the prior data and tests fail.
For this issue, the question is if the db might be removed somehow at the end of the instance running? Or if the instance is shut down in some way that some data is not flushed to disk?
Asked @cmoussa1 if he could try storing the db someplace "safe" as a first test. Given its docker and systemd, not sure if the path /run/flux is a safe place to store a db. (i.e. directory could be wiped after instance stops running)
Ah, yeah, content-store is kept under an alternate path in this environment (I think I hinted to @cmoussa1 that job-archive db should go in rundir but that probably wasn't correct).
content.backing-path /usr/var/lib/flux/content.sqlite
Sorry for the misdirection!
Our docker image probably needs to be fixed though, it should be /var/lib not /usr/var/lib. Probably need a --localstatedir=/var added to default configure args.
I was able to point the job-archive DB to /usr/var/lib/flux/, where content.sqlite is located. Jobs do in fact remain in the DB after a restart:
sqlite> select userid,id,t_submit,t_run,t_inactive,R from jobs;
userid id t_submit t_run t_inactive R
---------- ------------- ---------------- ---------------- ---------------- -----------------------------------------------------------------------------
2201 1553133993984 1594921654.80753 1594921654.82732 1594921654.89485 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0-1"}}]}}
2001 1260606455808 1594921637.3711 1594921637.389 1594921637.44179 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
2001 1238141763584 1594921636.03251 1594921636.05075 1594921636.12109 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
Then after a systemctl restart flux:
sqlite> select userid,id,t_submit,t_run,t_inactive,R from jobs;
userid id t_submit t_run t_inactive R
---------- ------------- ---------------- ---------------- ---------------- -----------------------------------------------------------------------------
2201 1553133993984 1594921654.80753 1594921654.82732 1594921654.89485 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0-1"}}]}}
2001 1260606455808 1594921637.3711 1594921637.389 1594921637.44179 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
2001 1238141763584 1594921636.03251 1594921636.05075 1594921636.12109 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
The one thing I noticed is that root privileges are required to access this database here. Maybe this is expected, and the behavior we want with access to the job-archive DB. With SQLite, I believe the user trying to open the database _also_ needs access to the directory the database file is located in [1].
FWIW, I tried specifying a custom location elsewhere (i.e. I made a new directory under / called /new-dir and specified dbpath='/new-dir/jobs.db), but I get a cmb.insmod: No such file or directory error.
[root@394b0121e8cc ~]# flux module load job-archive dbpath=/new-dir/jobs.db
flux-module: cmb.insmod: No such file or directory
@cmoussa1, the newest fluxorama image has the change to --localstatedir=/var, so sqlite.db content-cache will be found under /var/lib/flux. You might want to process the relevant attr in your rc1 script in order to always place the job-archive db co-located with content cache.
Just confirmed this - thanks @grondo!
sqlite> select userid,id,t_submit,t_run,t_inactive,R from jobs;
userid id t_submit t_run t_inactive R
---------- ------------ ---------------- ---------------- ---------------- ---------------------------------------------------------------------------
2001 817520181248 1594931850.05394 1594931850.07371 1594931850.13204 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
2001 799014912000 1594931848.95024 1594931848.97049 1594931849.03598 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
sqlite> .exit
[root@98a7bf4b57b6 ~]# systemctl restart flux
[root@98a7bf4b57b6 ~]# sqlite3 /var/lib/flux/jobs.db
SQLite version 3.26.0 2018-12-01 12:34:55
Enter ".help" for usage hints.
sqlite> .mode columns
sqlite> .headers on
sqlite> select userid,id,t_submit,t_run,t_inactive,R from jobs;
userid id t_submit t_run t_inactive R
---------- ------------ ---------------- ---------------- ---------------- ---------------------------------------------------------------------------
2001 817520181248 1594931850.05394 1594931850.07371 1594931850.13204 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
2001 799014912000 1594931848.95024 1594931848.97049 1594931849.03598 {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
sqlite>
Side question - do you think the bank/accounting database should also reside in /var/lib/flux? I think this would mean only root would be able to interact with both databases (the job-archive DB and the bank/accounting DB)
Good question. I'm not sure how the bank/accounting DBs work. The utilities need direct rw access for now?
Yes, both DB's need direct rw access, as well as ownership of the directory the database file resides in (this is behavior as a result of using SQLite).
Then it does seem like those DBs need to go in a different directory. Perhaps with group permissions for a new fluxadmin group? At least for now...
Then it does seem like those DBs need to go in a different directory.
I think so too. The problem I've been running into, however, with trying to place the job-archive DB in a custom location is I am getting a cmb.insmod: No such file or directory error, even when I create a new directory as root.
[root@844cdba731d4 ~]# mkdir /new-dir
[root@394b0121e8cc ~]# flux module load job-archive dbpath=/new-dir/jobs.db
flux-module: cmb.insmod: No such file or directory
FWIW, I tried this by running a single instance on one of the LC machines, and got the same error. Maybe I am misunderstanding the dbpath option.
I wouldn't choose /new-dir, since it isn't a directory that is going to exist on any Linux system. In the standard filesystem hierarchy, this DB probably _also_ should exist under /var/lib, so maybe /var/lib/flux-accounting? Maybe for the example docker image, the directory should have flux.fluxadmin permissions. Be sure to create the directory before you load the job-archive module
More discussion probably needed on how this might work in a production situation.
FWIW, I tried this by running a single instance on one of the LC machines, and got the same error. Maybe I am misunderstanding the dbpath option.
Typically this error indicates there's a path issue, e.g. like /new-dir doesn't exist. Although what you're doing above seems legit. Can you see what is in the flux broker logs?
Probably because the mkdir is being done as root, and flux runs as the flux user.
Most helpful comment
Ah, yeah, content-store is kept under an alternate path in this environment (I think I hinted to @cmoussa1 that job-archive db should go in
rundirbut that probably wasn't correct).Sorry for the misdirection!