Describe the bug
This issue aggregates one "old" and several new issues with ACME in 20.03 and unstable. Since some of them are dependent on each other to be of concern, I write them all in the same issue.
man systemd.timer
for confirmation about this point, or check systemd logs after one day to confirm that the service was indeed never restarted)lego renew || lego run
where "run" would be the one to run when the list changes)Poke: @aanderse @arianvp @Mic92 @m1cr0man @emilazy @flokli since you were concerned with the previous issue about acme.
Issue 2 was introduced in https://github.com/NixOS/nixpkgs/pull/72056 to fix https://github.com/NixOS/nixpkgs/issues/48845
However it seems like a bit of a workaround to me that seems to be more of a bug in switch-to-configuration.pl
which doesn't restart oneshots
properly so maybe the cure was worse than the harm and we should try fixing it another way.
Shouldn’t the solution be:
I don’t see how any other configuration could work, if ACME runs as regular user you would need to give world-writable rights to .well-known/challenge (in case two jobs with different users would run simultaneously, which happens for instance when you define two new certificates during deployment).
In any case the configuration in tempfiles cannot work if you have more than one user : it will try to create .well-known/challenge as both user, but only one of them will ever win...
Another possibility would be to use groups and force every "user" using certificates to belong to this group. (EDIT: this would not allow to assign a different group to the resulting certificate, not good)
About #48845, the way I see it is to have a acme-foo-setrights service whose only job would be to set rights on the final destination (and would have Wants/After acme-foo + WantedBy multi-user.target). This one would duplicate the last part of acme-foo’s job, but as it’s idempotent we don’t care much, and it could have the RemainAfterExit which would become harmless too in this situation (it would only be useful during deployment, which would anyway restart it regardless of this flag)
What do you think?
I may do the work to implement that if you agree with the ideas
The lego client should definitely not be run as root, that's a huge attack surface that has no need for elevated privileges beyond reading and writing out certificates.
Currently if you're using the nginx module with HTTP-01 challenges the certificates are owned as nginx:nginx
and lego runs as nginx
; I personally think it would be best to run as an acme
account with corresponding group, have the certificates owned by acme:acme
, turn on allowKeysForGroup
by default, and put nginx
in the acme
group, to reduce the privilege it runs with further and take away nginx's write access to the certificates.
Not sure if we can assume ACL support, but if we do, we might also be able to just use setfacl
after copying files into the final location to grant read access to the certificates for the users configured for each certificate, without having to require them to be in certain groups.
I think we should probably revert https://github.com/NixOS/nixpkgs/pull/72056, and see if we can fix it differently.
I just uncovered a new bug in Lego which might be blocking, but it might be a bug for upstream rather: when we change the domains list, it doesn’t regenerate the certificate...
(It might be due to the fact that the script runs lego renew || lego run
where "run" would be the one to run at that stage)
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/go-no-go-meeting-nixos-20-03-markhor/6495/25
(moved from #85152)
@m1cr0man I think a rewrite of the module would be great, as long as compatibility with the basic configuration options was kept; there's a fair bit held over from simp_le that no longer makes sense, and a lot of nits with the current implementation. I know @yegortimoshenko was working on a refactor at some point; maybe they'd be able to share WIP code?
For what it's worth, one possible way to fix the various renewal bugs in one shot would be to derive the certificate subdirectory from a hash of its configuration like /var/lib/acme/.lego/$domain-$hash
; that way, any changes to the OCSP Must-Staple configuration, alternate domains, etc. would change the underlying internal path to the certificate and force a new one to be issued. We'd still need explicit acme-*-{force,revoke}.service
or similar that users could run manually for edge-cases, but that should ensure that the 90% case is handled out of the box without hacks to manually check the configuration differences and remove certificates. (The implementation of force renewal will still probably involve manually removing certificates from underneath lego since I think it doesn't implement this directly; it'd be best to just rename the directory to $domain-$hash-old-$timestamp
or similar so that people can still revoke them if necessary.)
Of course, in an ideal world, lego would be able to handle forcing renewals on certificate configuration changes itself, but I'm not sure it tracks enough state currently to be able to do this.
I just opened https://github.com/NixOS/nixpkgs/pull/85223
I'm not sure if we're doing ourselves a favour by having this meta issue - maybe the individual issues should be put into separate issues…
How about a project board?
@flokli : merged them all because of inte-dependencies between them, feel free to cut it in pieces :)
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
I agree we should revert https://github.com/NixOS/nixpkgs/pull/72056 now that I come to think of it doesn't it actually completely break renewal? not only activation?! as the timer
will never be able to re-fire the unit
Edit: Reverting PRs:
Could other people jump in to fix the last two issues? @immae perhaps? I don't have enough time to work on all of them and I don't want to be blocking the 20.03 release
Sure
The third checkbox is the easy one:
By writing the PR I found one more issue that could be problematic...
# Test that existing cert is older than new cert
KEY=${spath}/certificates/${keyName}.key
if [ -e $KEY -a $KEY -nt key.pem ]; then
(...)
fi
In the simp_le age, the key.pem file was reused at each run. It’s not the case in lego, unless we use the --reuse-key
.
I’m not concerned with this change but I know this can have a big consequence for people who use key pinning (I’m not sure of the name, I’m thinking of the feature where the web server says to the browser "My public key is _Foo_, if I give you any certificate with a different public key in the next year be worried").
Maybe we should add something about it in the documentation? (note that since there is no migration from switching from simp_le to lego people who use that will already have issues about this feature)
Web browsers do not support this feature, but it's often used in native apps like banking apps indeed.
I think a note for now should be sufficient, but exposing this --reuse-key
feature in the module sounds like a good idea to me
Note: I don’t think the remaining issues should be considered blocking for the release of 20.03, provided we have proper documentations that acme certificates is not an ideal world: If I understand correctly, the remaining issues are either just annoying or wontfix (regarding migration scripts from old setup)
Should I propose a PR with --reuse-key @arianvp ?
EDIT:
I had a look at the last checkbox (changing the domains list, or as @emilazy said it can concern other options too), lego officially doesn’t support it (PR welcome though: https://github.com/go-acme/lego/issues/507 the issue is slightly different but raises the same concern).
I guess a way to solve it simply would be to store the whole command line in /var/lib/acme/my_cert/command_line, compare the current one with the previous one and switch "renew" to "run" in case it changed. However, the flags like reuse-key cannot be used in "run" so we’re back with the issue of key pinning...
See my previous comment for a simple way to address renewing certificates on changed settings without any complicated hacks or manual comparisons.
See my previous comment for a simple way to address renewing certificates on changed settings without any complicated hacks or manual comparisons.
Ah I just reread it and now I understand it better, sorry. Ok I’ll try to implement that.
See my previous comment for a simple way to address renewing certificates on changed settings without any complicated hacks or manual comparisons.
@emilazy : I implemented that, but I’m not satisfied, because it breaks the reusePrivateKey option. I had a look at the code from lego, and the only way I found to honor this option is to have a /var/lib/acme/.lego/${certname}
(so no hash), and a conditional run like this (pseudo bash code):
if (no existing dir); then
lego run ...
elif (important parameter changed); then
lego renew --days 0 ...
else
lego renew --days ${cfg.validMinDays}
fi
The reason being that lego run
will always recreate a new private key and all else, regardless of the content of the directory.
If I use a hashed key like you suggested, then it will always be empty at the first run. I can trick and copy the files from the old location, but I feel a scenario like above where I compute the hash to determine the (important parameter changed)
and store it in the state directory would be more clean.
Does that sound good to you?
I opened #85534 which is supposedly relevant.
@immae could you update your bullet list with what is fixed please?
Done @arianvp . I don’t know how to handle the last one yet, I made a sketch of a proposition above but got no answer so I don’t know what to do
(and the reuse key option wasn’t merged finally, maybe I should reopen an issue for that one before it gets lost?)
@immae There's also another solution for the last issue, though maybe a bit hacky, which would be to use 999
(or even a larger value) for --days
when we detect that the list of domain names (and other things, such as the contact address) has changed, either by comparing the command line (stripped of the validMinDays
flag, of course), or by reading the list of alternate names from the certificate file. This way we can keep using renew
and allow --reuse-key
but still be able to force a renewal. I don't know if that's better than your proposed solution.
I think that last issue is a serious one, as it breaks user expectations : I don't remember this happening with simp_le
. I have a fixed version of the module on my servers, and I can submit a PR with these changes if it's accepted as a good way to do it.
lego
is supposed to implement a combined renew/run command, which should make all of this easier, but I don't know when it will be done.
@pstch : the standard way of lego to say "infinite" is to put 0 for --days
, and it was my suggestion indeed :)
Except that in my implementation I store a hash of the relevant arguments in the .lego folder, and whenever the hash changes it means that the certificate needs to be regenerated. I’m happy with that implementation as far as I am concerned, but it’s just implementation details: the "big" part in the issue is deciding when you need to force renew and when you don’t.
I wrote a (big) change to acme handling. Its current state is here in my repository: https://git.immae.eu/?p=perso/Immae/Config/Nix.git;a=blob;f=modules/private/certificates.nix;hb=HEAD#l76
It should solve most issues reported here and there (note that contrary to my previous comment the --days 0
flag doesn’t work, I didn’t recheck why but either my memory faults me or it’s an unreleased recent change upstream)
advantages:
issues:
I may create PR for some of the issues if you’re happy with that change.
For people tracking this, I've been taking a look at what @immae has done, what emilazy has shared, and my own work to rewrite the module. It's going well, just slowed down during the week by work. I will open a PR this weekend. I've resolved most of the issues immae highlighted above except for the backwards compatibility, although adding a migration script would be possible.
There's a number of design decisions I've made that will be open for discussion on the PR, and some other issues I have thought of in the process of rewriting it.
How are people finding the new module? :slightly_smiling_face: I addressed everything that was mentioned in this issue, and since the merge there's only been a small bit of feedback all of which was external to the module itself (either lego or LE related). I'll give it a month and then close this ticket.
I'd say no news / new bug reports are good news :-)
Thanks a lot for all the work that has been done - I think we're in a wayy better situation than before :heart:
Thanks a lot for your work @m1cr0man ! I've been using the new module for three weeks now, mostly without any problem. I think the changes are really good. However, I've find a (minor) problem : switching to wildcard certificates fails because the old domain names are still requested, and the request fails because although these names are already validated, LE refuses certificate certificate requests with both a wildcard and subdomains.
Nov 08 19:54:49 example acme-example.com-start[32187]: 2020/11/08 19:54:49 acme: error: 400 :: POST :: https://acme-staging-v02.api.letsencrypt.org/acme/new-order :: urn:ietf:params:acme:error:malformed :: Error creating new order :: Domain name "sub.example.com" is redundant with a wildcard domain in the same request. Remove one or the other from the certificate request., url:
Using systemctl clean acme-example.com.service
fixed the problem.
Most helpful comment
I just uncovered a new bug in Lego which might be blocking, but it might be a bug for upstream rather: when we change the domains list, it doesn’t regenerate the certificate...
(It might be due to the fact that the script runs
lego renew || lego run
where "run" would be the one to run at that stage)