Flux-core: job-archive: job records are not persistent in DB after instance restart

Created on 15 Jul 2020  路  16Comments  路  Source: flux-framework/flux-core

While setting up my version of the fluxorama Docker container to load both flux-accounting and the job-archive module, I noticed that the inactive jobs that get written to the .db file are not persistent after a system instance restart of Flux.

My reproducer:

From within the container, I'll submit a couple of jobs:

[fluxuser@1880c353bc24 ~]$ flux mini submit -n 3 hostname
453353930752
[fluxuser@1880c353bc24 ~]$ flux mini submit -n 2 hostname
405773746176

These two jobs can be seen with both flux jobs -A and in the job-archive DB, whose location is in /run/flux/jobs.db:

sqlite> SELECT userid,id,t_submit,t_run,t_inactive,R from jobs;
userid      id             t_submit          t_run           t_inactive        R                                                                            
----------  -------------  ----------------  --------------  ----------------  -----------------------------------------------------------------------------
2001        453353930752  1594834322.78774  1594834322.808  1594834322.87507  {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0-2"}}]}}
2001        405773746176  1594835454.78189  1594835454.801  1594835454.86345  {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0-1"}}]}} 

After a systemctl restart flux, the two jobs will still show with flux jobs -A, but not in the job-archive DB:

sqlite> SELECT userid,id,t_submit,t_run,t_inactive,R from jobs;
sqlite>

The .db file gets a new modification time after the Flux instance is restarted, but there are no previously completed jobs written there anymore.

IMHO, I don't think it is a blocker for the testing environment I was planning on working with Ryan Day, but it's just something I noticed.

Most helpful comment

Ah, yeah, content-store is kept under an alternate path in this environment (I think I hinted to @cmoussa1 that job-archive db should go in rundir but that probably wasn't correct).

content.backing-path                    /usr/var/lib/flux/content.sqlite

Sorry for the misdirection!

All 16 comments

whats the dbpath you are using?

whats the dbpath you are using?

dbpath is /run/flux/jobs.db

I'm confident that the data can persist in the db after the flux instance is brought down. As an experiment, tried this:

--- a/t/t2220-job-archive.t
+++ b/t/t2220-job-archive.t
@@ -8,7 +8,7 @@ export FLUX_CONF_DIR=$(pwd)
 test_under_flux 4 job

 ARCHIVEDIR=`pwd`
-ARCHIVEDB="${ARCHIVEDIR}/jobarchive.db"
+ARCHIVEDB="/tmp/achu/jobarchive.db"

So basically instead of creating a new file each time the tests are run, set a path where data from a prior test will be loaded. First test run succeeds, second test run sees the prior data and tests fail.

For this issue, the question is if the db might be removed somehow at the end of the instance running? Or if the instance is shut down in some way that some data is not flushed to disk?

Asked @cmoussa1 if he could try storing the db someplace "safe" as a first test. Given its docker and systemd, not sure if the path /run/flux is a safe place to store a db. (i.e. directory could be wiped after instance stops running)

Ah, yeah, content-store is kept under an alternate path in this environment (I think I hinted to @cmoussa1 that job-archive db should go in rundir but that probably wasn't correct).

content.backing-path                    /usr/var/lib/flux/content.sqlite

Sorry for the misdirection!

Our docker image probably needs to be fixed though, it should be /var/lib not /usr/var/lib. Probably need a --localstatedir=/var added to default configure args.

I was able to point the job-archive DB to /usr/var/lib/flux/, where content.sqlite is located. Jobs do in fact remain in the DB after a restart:

sqlite> select userid,id,t_submit,t_run,t_inactive,R from jobs;
userid      id             t_submit          t_run             t_inactive        R                                                                            
----------  -------------  ----------------  ----------------  ----------------  -----------------------------------------------------------------------------
2201        1553133993984  1594921654.80753  1594921654.82732  1594921654.89485  {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0-1"}}]}}
2001        1260606455808  1594921637.3711   1594921637.389    1594921637.44179  {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}  
2001        1238141763584  1594921636.03251  1594921636.05075  1594921636.12109  {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}} 

Then after a systemctl restart flux:

sqlite> select userid,id,t_submit,t_run,t_inactive,R from jobs;
userid      id             t_submit          t_run             t_inactive        R                                                                            
----------  -------------  ----------------  ----------------  ----------------  -----------------------------------------------------------------------------
2201        1553133993984  1594921654.80753  1594921654.82732  1594921654.89485  {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0-1"}}]}}
2001        1260606455808  1594921637.3711   1594921637.389    1594921637.44179  {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}  
2001        1238141763584  1594921636.03251  1594921636.05075  1594921636.12109  {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}

The one thing I noticed is that root privileges are required to access this database here. Maybe this is expected, and the behavior we want with access to the job-archive DB. With SQLite, I believe the user trying to open the database _also_ needs access to the directory the database file is located in [1].

FWIW, I tried specifying a custom location elsewhere (i.e. I made a new directory under / called /new-dir and specified dbpath='/new-dir/jobs.db), but I get a cmb.insmod: No such file or directory error.

[root@394b0121e8cc ~]# flux module load job-archive dbpath=/new-dir/jobs.db
flux-module: cmb.insmod: No such file or directory

@cmoussa1, the newest fluxorama image has the change to --localstatedir=/var, so sqlite.db content-cache will be found under /var/lib/flux. You might want to process the relevant attr in your rc1 script in order to always place the job-archive db co-located with content cache.

Just confirmed this - thanks @grondo!

sqlite> select userid,id,t_submit,t_run,t_inactive,R from jobs;
userid      id            t_submit          t_run             t_inactive        R                                                                          
----------  ------------  ----------------  ----------------  ----------------  ---------------------------------------------------------------------------
2001        817520181248  1594931850.05394  1594931850.07371  1594931850.13204  {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
2001        799014912000  1594931848.95024  1594931848.97049  1594931849.03598  {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
sqlite> .exit
[root@98a7bf4b57b6 ~]# systemctl restart flux
[root@98a7bf4b57b6 ~]# sqlite3 /var/lib/flux/jobs.db 
SQLite version 3.26.0 2018-12-01 12:34:55
Enter ".help" for usage hints.
sqlite> .mode columns
sqlite> .headers on
sqlite> select userid,id,t_submit,t_run,t_inactive,R from jobs;
userid      id            t_submit          t_run             t_inactive        R                                                                          
----------  ------------  ----------------  ----------------  ----------------  ---------------------------------------------------------------------------
2001        817520181248  1594931850.05394  1594931850.07371  1594931850.13204  {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
2001        799014912000  1594931848.95024  1594931848.97049  1594931849.03598  {"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}]}}
sqlite> 

Side question - do you think the bank/accounting database should also reside in /var/lib/flux? I think this would mean only root would be able to interact with both databases (the job-archive DB and the bank/accounting DB)

Good question. I'm not sure how the bank/accounting DBs work. The utilities need direct rw access for now?

Yes, both DB's need direct rw access, as well as ownership of the directory the database file resides in (this is behavior as a result of using SQLite).

Then it does seem like those DBs need to go in a different directory. Perhaps with group permissions for a new fluxadmin group? At least for now...

Then it does seem like those DBs need to go in a different directory.

I think so too. The problem I've been running into, however, with trying to place the job-archive DB in a custom location is I am getting a cmb.insmod: No such file or directory error, even when I create a new directory as root.

[root@844cdba731d4 ~]# mkdir /new-dir
[root@394b0121e8cc ~]# flux module load job-archive dbpath=/new-dir/jobs.db
flux-module: cmb.insmod: No such file or directory

FWIW, I tried this by running a single instance on one of the LC machines, and got the same error. Maybe I am misunderstanding the dbpath option.

I wouldn't choose /new-dir, since it isn't a directory that is going to exist on any Linux system. In the standard filesystem hierarchy, this DB probably _also_ should exist under /var/lib, so maybe /var/lib/flux-accounting? Maybe for the example docker image, the directory should have flux.fluxadmin permissions. Be sure to create the directory before you load the job-archive module

More discussion probably needed on how this might work in a production situation.

FWIW, I tried this by running a single instance on one of the LC machines, and got the same error. Maybe I am misunderstanding the dbpath option.

Typically this error indicates there's a path issue, e.g. like /new-dir doesn't exist. Although what you're doing above seems legit. Can you see what is in the flux broker logs?

Probably because the mkdir is being done as root, and flux runs as the flux user.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

garlick picture garlick  路  3Comments

SteVwonder picture SteVwonder  路  7Comments

garlick picture garlick  路  3Comments

grondo picture grondo  路  7Comments

dongahn picture dongahn  路  7Comments