Moby: Missing from Swarmmode --cap-add

Created on 19 Aug 2016  Â·  102Comments  Â·  Source: moby/moby

Some form of --cap-add or optional elevated privilege system would be required for accessing GPIO pins on ARM devices. Since ARM is becoming better supported by the Docker engine I would like to raise this to attention.

We tend to need to write to /dev/mem and there is currently a capability for that in "regular flavoured swarm".

I would like to build out some IoT PoCs with Docker and swarmmode and support for this would really help. CC/ @DieterReuter @StefanScherer

areswarm kinenhancement

Most helpful comment

FYI. This is very unofficial information but I try tell something what I know because people are very eagerly asking about this feature.

Because of Mirantis to acquired Docker Enterprise and some of Docker Inc employees was moved there it is currently very unclear when they are able to get release process working again which why at least I don't know that what will be the next Docker version or when it will be released.

However, whole feature is implemented and works as far I can see so who ever want to test it can do it by downloading latest nightly build of Docker engine (dockerd) from https://master.dockerproject.org and my custom build version of Docker CLI from https://github.com/olljanat/cli/releases/tag/beta1
You can also find usage examples for CLI from https://github.com/docker/cli/pull/2199 and for Stack from https://github.com/docker/cli/pull/1940 If you find bugs from those please leave comment to correct PR. Also notice that syntax might still change during review.

All 102 comments

Ugh, /dev/mem gives you full access to the whole memory of the machine, ie even more than root. Isn't there some more sane API for gpio?

I know that it's opening a can of worms. There is also /dev/i2c which would be very useful too. There are some workarounds but they generally involve root at some point.

Writing to /sys/class/gpio/ may also be an option, but may involve re-writing Pi libraries.

This might need more looking into, here are a couple of related links.

https://dissectionbydavid.wordpress.com/2013/10/21/raspberry-pi-using-gpio-wiringpi-without-root-sudo-access/

http://elinux.org/RPi_GPIO_Code_Samples

/sys/class/gpio seems a sane interface, and docker service create --mount src=/sys/class/gpio,dst=/sys/class/gpio,type=bind ... ought to work right now - can you test?

It would be worth trying. I checked with the guys at Pimoroni and they advised against using these interfaces claiming high latency.

@alexellis
Could you please clarify your phrase "[Pimoroni] advised against using these interfaces claiming high latency"

I am currently (for a docker demo/training) developing containers using Pimoroni Piglow and Display-o-tron Hat on RPi's. I faced the issue of the --privileged flag. I didn't find it very "well behaved" and preferred to show/teach the principles of limiting access.
I finally got it to work by specifying the --device /dev/i2c-1 flag.

Is this the recommended way to do it or did I miss something ?

By the way, this is where I store my experiments https://github.com/jmMeessen/rpi-docker-images
Next step is to containerize the Display-o-tron and integrate the exercises in a swarm. (and write/publish some notes)

PS: I am using the latest Hypriot distribution, V1.1.0

This is a bit off-topic but hopefully @justincormack et al will tolerate it.

@jmMeessen the /sys/class/gpio interface allows interaction with GPIO (not I2C) and has a high latency. Pimoroni have advised against it and the majority of their devices assume full unprivileged access to memory to use memory mapping etc with the GPIO pins.

I2c is a different scenario - a single device is often enough for interaction i.e. /dev/i2c-1. This is how scroll-phat works for instance.

In general you should not need to port or re-write any of their code to use it in Docker, but you will often need to run as a privileged container. Privileged containers are not possible with Swarm Mode but classic swarm will allow them to run.

So in summary: i2c may be different, but GPIO-based libraries probably need full access to memory. Sometimes the Pimoroni code just wraps existing libraries - so take each one case-by-case. Maybe even install the Pimoroni libraries with pip? I think they are working to port all libraries to apt-get.

Classic swarm: https://github.com/alexellis/datacenter-sensor

ARM Docker images (including Pimoroni):
https://github.com/alexellis/docker-arm/tree/master/images/armv6
https://github.com/alexellis/docker-arm/tree/master/images/armhf

Hypriot or Raspbian should not make a difference - they are both Debian derivatives which run the same Debian (.deb) packages from get.docker.com.

See https://github.com/docker/swarmkit/pull/1722 for initial discussion of capabilities and privileged framework for swarmkit.

While waiting for https://github.com/docker/swarmkit/pull/1722, is there a way (beside recompiling the daemon) to change the default capabilities of spawned containers (even daemon wide) to be able to run those that require something like NET_ADMIN?

I would like to use keepalived in a swarm, which requires NET_ADMIN capability.

FWIW, elasticsearch requires the IPC_LOCK capability, making it impossible to deploy a Swarm Mode stack with ElasticSearch until this is resolved...

Hi all! Like @sirlatrom, I tried to deploy a Swarm Mode stack with Elastic (image: "docker.elastic.co/elasticsearch/elasticsearch:5.3.0") It fails due to the IPC_LOCK capability.

@a-jung the link is broken.

@albers Sorry for that. I just pasted the URI which I use in the compose file to pull the image from elastic.co. Installation docs They use cap_add: and - IPC_LOCK in their example file.

any update on this? is there any plans to include the feature in swarm mode.

I'm currently using swarm standalone to deploy my containers with cap_add in a cloud but I'm encountering many issues... swarm mode would ease the pain

please give us an ETA on this

thanks

+1

Coming here for Elasticsearch too ... right now i added the ES manually to the attachable swarm network...but this does not scale well ;)

need caps_add to mount gpu-device in swarm mode

We need NET_ADMIN for an OpenVPN server we need to add to our overlay network in swarm mode.

@waltherg I am waiting for this feature for the OpenVPN as well. Work Around was to install keepalived on each container. Once (if possible) this is added then I will migrate to Swarm mode.

That sounds interesting. Could you elaborate a bit on what you did?

On 12 May 2017 00:32, "Joaquin" notifications@github.com wrote:

@waltherg https://github.com/waltherg I am waiting for this feature for
the OpenVPN as well. Work Around was to install keepalived on each
container. Once (if possible) this is added then I will migrate to Swarm
mode.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/moby/moby/issues/25885#issuecomment-300934646, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ADHzHp9r85_bdvTzvaFYopMXjNygetfCks5r44xzgaJpZM4JovGD
.

  • Install keepalived inside the docker container.
  • Did the normal config (on the keepalived.conf) with 1VIP & 2 Physical IPs
  • I use the docker macvlan networking.
  • added the net.ipv4.ip_nonlocal_bind=1 (sysctls) through docker compose.
  • Put on the "local" on the conf file of the OpenVPN the VIP.

It works perfectly. Just had it now working.

They two containers will send each other the VRRP through the specify unicast (on keepalived conf) and if the docker contianer is stop (you can test it manually) then the BACKUP server will take over (the VIP) and it will work normally (people will need to re-register) but it will take 1 second to change.

Just a WA until this is possibly added to a future release into swarm.

@joaquin386 Would you mind sharing more details on your keepalived solution, e.g. a Dockerfile and compose file? I also tried dockerized keepalived but was unsuccessful: The floating IPs were not released on container shutdown. I used the keepalived package that comes with Ubuntu 16.04.

@albers sorry didn't see your post. Do you still need that info? Now watching the issue since it really interest me the --cap_add to swarm mode.

@joaquin386 yes, I'm still very much interested. Thanks. Looks like #32981 will open new possibilities here as well, allowing global services to access the host network.

+1 for at least being able to add NET_ADMIN capability.

Any progress?

+1

Is there a time frame for this? I have to decide if I drop the idea of running docker swarm or not.

Redis Cluster requires sys_resource in order to start correctly. Blocks use of Swarm mode.

@albers check this gist to check the config of both keepalived. And also the CMD command of the Dockerfile -> https://gist.github.com/joaquin386/44293cc729f1715601b18b5c8e6fdfda
What I saw as important was:
-create a macvlan network (10.100.11.0) (from puppet)
docker_network { "external-10.100.11.0":
ensure => present,
driver => 'macvlan',
subnet => "10.100.11.107/24",
gateway => "10.100.11.1",
options => ["macvlan_mode=bridge","parent=ens160"],
}
On Dockerfile add the network:
networks:
frontend:
external:
name: external-10.100.11.0

ON the Dockerfile on the services add:
networks:
frontend:
ipv4_address: 10.100.11.107
sysctls:
- net.ipv4.ip_nonlocal_bind=1

I have this value also because I use it for OPEN VPN (I do not know if this one is needed for keepalived but for sure it is needed for OpenVPN):
cap_add:
- NET_ADMIN

You will have in your docker image an eth0 interface which will be used for the keepalived.

Please add! needed for fuse

Please add! needed for vault

+1 for at least being able to add SYS_ADMIN capability. Use-case NFS server inside Docker container.

+1 needed for headless chrome/puppeteer

is there an ETA yet?

+1

+1 Can not run the million12/haproxy servcie in swarmmode when missing the CAP_ADD NET_ADMIN

+1

Any update?

Is NET_ADMIN capability in Swarm mode going to be a thing?

I'm trying to run a container to redirect traffic to an non containerised destination using netfilter (iptables) and this container should be reachable through a Traefik swarm deployment, configuring it just using variables and stack definitions.

Scenario With NET_ADMIN caps :

Traefik → "host1 match" → container1_running_apache_service
Traefik → "host2 match" → container2_running_nextcloud_service
Traefik → "host3 match" → container3_with_net_admin_caps(redir to) → Non-containerised-destination

Scenario without them:

Traefik → "host1 match" → container1_running_apache_service
Traefik → "host2 match" → container2_running_nextcloud_service
Traefik → "host3 match" → container3_ANOTHER_PROXY → Non-containerised-destination

+1

openjdk( jmap -heap) need open SYS_PTRACE

Need cap_add NET_ADMIN for kylemanna/openvpn on docker stack

dnsmasq[1]: setting capabilities failed: Operation not permitted

+1 for dnsmasq

Are these ones related to this? (or some of the use cases)
https://github.com/moby/moby/issues/24865
https://github.com/docker/swarmkit/issues/1244

For those there is proposal in https://github.com/docker/swarmkit/issues/2682

Found out about this restriction because I tried using https://hub.docker.com/r/factual/s3-backed-ftp/ but scrolling up I haven't seen any activity from the core members for over a year on this issue https://github.com/moby/moby/issues/25885#issuecomment-318605282

@trajano there is proposal on docker/swarmkit#2682

Comment to there if it fits to your needs?

EDIT: There now there looks to be suggested solution on this message:
https://github.com/moby/moby/issues/24862#issuecomment-428308152

I found a solution to solve the problem.and I also can use cap_net_admin in swarm mode.
you should modify the runtime source code to add the capabilities which you need.(it will be a local default setting).
for example I add the CAP_NET_ADMIN to my runtime(used nvidia-container-runtime)
wanyvic/nvidia-container-runtime.
After that rubuild it. started a container(use swarm mode), input:
capsh --print
CAP_NET_ADMIN can be found:
root@25303a54ebb3:/# capsh --print Current:=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_admin,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+eip Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_admin,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap Securebits: 00/0x0/1'b0 secure-noroot: no (unlocked) secure-no-suid-fixup: no (unlocked) secure-keep-caps: no (unlocked) uid=0(root) gid=0(root) groups= root@25303a54ebb3:/#
this method is not good.It also can't to set the cap_add or cap_drop in the docker-compose.yml.
but I can't find the other way to solve it.

FYI. I got bored to follow this not progressing discussion so I started to implement this one.

It will need multiple PRs:

  • [x] Allow give exact list of capabilities instead of add/drop default ones: #38380
  • [x] Moby bump to Swarmkit
  • [x] Swarmkit side implementation docker/swarmkit#2795
  • [x] Swarmkit bump to Moby
  • [x] Another PR to moby to support Swarmkit side Capabilities setting #39173
  • [x] Swarmkit and Moby bump to docker/cli
  • [x] Client side implementation with stack docker/cli#1940
  • [x] Client side implementation without stack docker/cli#2199
  • [ ] Docs update.

So these will not be ready anytime soon but maybe before summer...

+1

+1

Any news adding capabilities to swarm services?

+1 for NET_ADMIN

3 years and counting...

@megastef @redhog status and schedule on https://github.com/moby/moby/issues/25885#issuecomment-447657852 is still valid. First part of solution ( #38380 ) will be as part of API version 1.40 (which is released as part of Docker 19.03) and rest of the solution will be part of 1.41 API version (what ever Docker version will contain it).

It needed two versions as old solution needed quite big refactor (you can see whole discussion on #38380 if you are interested).

Status update. Swarmkit side was approved and Moby side is waiting for review on #39173

@olljanat Cool! Thank you so much for the hard work!

39173 was merged so this feature will ship as part of Docker 19.06 / 19.09 (not sure actual version yet).

Now it time to discuss about that how CLI part should be implemented?

PR docker/cli#1940 will add support for stack files like this:

version: "3.9"
services:
  test:
    image: ollijanatuinen/capsh
    networks:
      - test
    capabilities:
      - CAP_NET_ADMIN
      - CAP_MKNOD
      - CAP_SYS_ADMIN

networks:
  test:
    driver: overlay

but the question is that is it enough or should we also have?

  • --capabilities switch on docker service create
  • --capabilities-add and --capabilities-rm switches on docker service update

How users are actually planning to use this? Note that it is now exact list of capabilities so creating service even with default capabilities (unless you do not define that option) would need quite long command.

  1. Please add --capabilities switch on docker service create.
  2. docker service update is optional in our case but feel free to go further with this also.

Great job so far.
Cheers.

Why not —cap-add looks it is in containers ?

On Wed, 12 Jun 2019 at 07:02, alen-z notifications@github.com wrote:

>

  1. Please add --capabilities switch on docker service create.
  2. docker service update is optional in our case but feel free to go
    further with this also.

Great job so far.
Cheers.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/moby/moby/issues/25885?email_source=notifications&email_token=AAJ276S3AZGZBLQA2HN6NVTP2AHG5A5CNFSM4CNC6GB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXOPRMI#issuecomment-501020849,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAJ276WSWXA2Z76EYSYZ35LP2AHG5ANCNFSM4CNC6GBQ
.

>

James Mills / prologic

E: [email protected]
W: prologic.shortcircuit.net.au

Why not —cap-add looks it is in containers ?

@prologic because then switches on service update would be --cap-add-add and --cap-add-rm which is ugly. It is mentioned on old commets/PRs and was biggest reason why original implementation was not approved couple of years ago.

EDIT: link to original comment https://github.com/moby/moby/pull/26849#discussion_r80228719

One of the principles of keeping Docker Swarm simple is that everything is avail in service create, service update and stack yaml. You can expect that a feature is implemented in all three. Teams have reasons for going with services-only or stacks-only, so I'd prefer not to see this diverge from the original goals. Any real-world service command is "ugly" in that it's hundreds of characters and not easily typed, but not every use case can/will use stacks.

My vote is it's in all the commands, or it'll be less useful.

Agreed, it should be in all three.

One thing to look at is;

  • do we want the short-hand options? --capabilities=all (convenient, but easy to shoot-oneself-in-the-foot)
  • how do we handle service update --cap-add=foo / --cap-rm=foo on a service that was started without --capabilities=<custom list>?

    • services without --capabilities set won't have a list of capabilities in the service-spec

    • "diffing" won't be possible, unless the CLI / client requests the active set of capabilities from the service/daemon somehow

Basically; we need to prevent this situation;

User creates a service without setting capabilities; containers will have the default set of capabilities

docker service create --name myservice busybox

User attempts to _add_ a capability

docker service update --cap-add NET_ADMIN myservice

However, the service now ends up having a single capability (NET_ADMIN, nothing else).

Alternatively (very verbose, and less convenient); implement docker service update --capabilities instead, and require the user to always provide the _exact_ list of capabilities that needs to be set, instead of providing "diffing" flags (x-add / x-rm).

  • pro: result will be clear
  • con: very verbose
  • meh: probably ok for services defined in a stack (docker-compose.yml); easy to add/remove a capability by editing the compose file; much _less_ convenient when using the CLI (docker service update); easy to make mistakes there.
* pro: result will be clear

* con: very verbose

* meh: probably ok for services defined in a stack (`docker-compose.yml`); easy to add/remove a capability by editing the compose file; much _less_ convenient when using the CLI (`docker service update`); easy to make mistakes there.

Also, how would one distinguish between removing any added capabilities (with --cap-rm) vs. 'resetting' to having not specified any, which arguably should lead to the default capabilities? The same goes for --capabilities, would --capabilities "" mean no capabilities or the default set?

Also, how would one distinguish between removing any added capabilities (with --cap-rm) vs. 'resetting' to having not specified any

Possibly for the CLI we'd need (ugh) magic values (--capabilities=default / --capabilities=none).

One thing to look into is "what are the defaults" (also look at https://github.com/moby/moby/pull/39297); if defaults can be configured per daemon, the result in a Swarm situation will be unpredictable; a task deployed on one node might get different capabilities than a task deployed on another node.

My _ideal_ would be that the Service spec fully describes the service's capabilities, which means that when creating a service, its spec contains all the capabilities that it has; also in the "default" situation. This would take care of that situation (and take care of situations where the defaults are changed). In addition, if we want configurable defaults; those defaults should be specified at the SwarmKit / manager level, not per daemon (at least when deploying a service).

That would likely be a breaking change though (as in; existing services won't have those values set)

In addition, if we want configurable defaults; those defaults should be specified at the SwarmKit / manager level, not per daemon (at least when deploying a service).

That sounds like the best option, and could be implemented by the manager explicitly sending the set of capabilities along with any task, even when the default set is requested (by whatever means that is expressed).

That sounds like the best option, and could be implemented by the manager explicitly sending the set of capabilities along with any task, even when the default set is requested (by whatever means that is expressed).

That's a bit of a grey area; IIRC, there have been some discussions in the past about "altering" the create/update requests server-side. Those boiled down to; an API call to _create_ a service, followed by an API call to _inspect_ that service should produce the same information (baring current 'state' etc.).

I commented similar things on a couple of other PR's; what would (likely) be needed is a way for the client to get the defaults from the manager/daemon, so sequence of events would be something like;

Create a service:

  • fetch defaults
  • apply config set by user to the defaults
  • send create request to the daemon/manager

Update a service

  • fetch current service-spec
  • apply changes set by user
  • send update request to the daemon/manager

/cc @dperny

an API call to _create_ a service, followed by an API call to _inspect_ that service should produce the same information (baring current 'state' etc.).

That definitely makes sense. It could still be achieved for the docker service commands by having them build the correct API call, since they also set e.g. replicated mode etc. even if not specified on the CLI. But obviously, existing API clients would be hit by this change, since their requests would not include a capabilities list.

So, the API would be able to distinguish them (it would be either nil (not set), or [] (set, but explicitly empty)), so likely it would be able to fallback to the old behaviour in case of nil.

So, the API would be able to distinguish them (it would be either nil (not set), or [] (set, but explicitly empty)), so likely it would be able to fallback to the old behaviour in case of nil.

That's great news, and, I would argue, supports the case for docker service create/update to add the default capabilities (however they are determined) to services in order to ensure the same set of capabilities across the swarm, whereas the API clients that do not specify a set of capabilities would get the current behaviour of using the engines' own default sets.

That's a bit of a grey area; IIRC, there have been some discussions in the past about "altering" the create/update requests server-side. Those boiled down to; an API call to _create_ a service, followed by an API call to _inspect_ that service should produce the same information (baring current 'state' etc.).

I commented similar things on a couple of other PR's; what would (likely) be needed is a way for the client to get the defaults from the manager/daemon, so sequence of events would be something like;

Create a service:

  • fetch defaults
  • apply config set by user to the defaults
  • send create request to the daemon/manager

Update a service

  • fetch current service-spec
  • apply changes set by user
  • send update request to the daemon/manager

There is at least my proposal to support changing defaults in swarm level on docker/swarmkit#2794

That sounds like the best option, and could be implemented by the manager explicitly sending the set of capabilities along with any task, even when the default set is requested (by whatever means that is expressed).

That can be done but we need to verify that that it does not break Windows containers.

@thaJeztah @sirlatrom I wanted to follow up with you guys on what the plan is for the capabilities. It sounds like the plan is for client to ask the daemon for the default capabilities then specify the complete list of capabilities? Is that correct?

To be discussed here; currently there's no endpoint in place to pass the defaults to the client, so that would have to be implemented

@tjmehta hello, is possible to have an ETA?

@tconrado I have no idea who is @tjmehta or why you did ping him/her here but from my side ETA is still next version. 19.03 looks to be delayed so I assume that there will not be 19.06 so most probably it is 19.09 (which code freeze is on September) and hopefully that is released still during this year

Status update: This have been a bit hold during summer but I got now cap_add/cap_drop/privileged settings working with stack using https://github.com/docker/cli/pull/1940 PTAL and provide comments to that. I will create separate PR to provide docker service command flags.

@olljanat was the PR created?

@cjdcordeiro for command line? Not yet as someone need review https://github.com/docker/cli/pull/1940 first.

@olljanat What about the rest api?

@mhemrg merged already. plz look https://github.com/moby/moby/issues/25885#issuecomment-447657852

Hello,
In which version of docker it will be available to be used as parameters of the docker compose?
Thanks!

@olljanat Thank you so much for implementing this! I hope they're gonna review the docker/cli#1940 PR soon. I'd be really happy if the command flags were to be implemented :)

FYI. This is very unofficial information but I try tell something what I know because people are very eagerly asking about this feature.

Because of Mirantis to acquired Docker Enterprise and some of Docker Inc employees was moved there it is currently very unclear when they are able to get release process working again which why at least I don't know that what will be the next Docker version or when it will be released.

However, whole feature is implemented and works as far I can see so who ever want to test it can do it by downloading latest nightly build of Docker engine (dockerd) from https://master.dockerproject.org and my custom build version of Docker CLI from https://github.com/olljanat/cli/releases/tag/beta1
You can also find usage examples for CLI from https://github.com/docker/cli/pull/2199 and for Stack from https://github.com/docker/cli/pull/1940 If you find bugs from those please leave comment to correct PR. Also notice that syntax might still change during review.

Hi =) Thanks @olljanat for implementing this =) I really need this feature! Would be very cool when you can post updates when you know that the next docker release contains this feature, and when this will be =) Thanks for your work =) Kind Regards

@olljanat I really appreciate your efforts on implementing this feature. I have followed your suggestions and everything is as it should be except one minor problem:
Autocompletion of docker cli does not suggest any of --privileged, --cap-add or --cap-drop flags. I'm not sure if this is a bug/WIP or a misconfiguration from my side.

@information-security completion is not part of the PRs yet.
@olljanat I can take care of bash completion when your PRs are merged.

In light of the uncertainty regarding the next Docker Swarm release - a very "hack-ish" way for granting additional capabilities to Docker Swarm containers already exists. I'd rather not write a complete step-by-step guide as this hack can (temporarily) disable your node and has some other dire consequences if not used with care, so only a brief overview follows for the daring:

  • Docker daemon can be configured to run a custom OCI compliant runtime when creating a new container - instead of the default runc(see command line options and configuration keys default-runtime, runtimes, add-runtime)
  • what if... this custom runtime was used as a proxy for default runc, injecting additional capabilities to container bundle configuration before passing the torch to runc? Something similar to
#!/usr/bin/python
from datetime import datetime
import json
import os
import sys

# default runc binary
runc = "/usr/bin/runc"

# capabilities to add to every container
capabilities = ["CAP_NET_ADMIN", "CAP_SYS_ADMIN"]


# adds capabilities to a bundle by extending bundle's config.json
def addCapabilities(bundle, capabilities):
    with open(bundle + "/config.json") as configFile:
        config = json.load(configFile)

    config["process"]["capabilities"]["bounding"].extend(capabilities)
    config["process"]["capabilities"]["effective"].extend(capabilities)
    config["process"]["capabilities"]["inheritable"].extend(capabilities)
    config["process"]["capabilities"]["permitted"].extend(capabilities)

    with open(bundle + "/config.json", "w") as configFile:
        json.dump(config, configFile)


def main():
    for i in range(len(sys.argv)):
        if (sys.argv[i] == "--bundle"):
            addCapabilities(sys.argv[i + 1], capabilities)
            break

    os.execv(runc, sys.argv)

main()

Again - beware, here be dragons:

  • Firstly, this will grant additional capabilities to all containers on a given node. Additional filtering based on bundle identification is highly recommended.
  • And, secondly and more importantly, no sanity checks are performed in the code above - it assumes all Docker versions out there invoke runc with --bundle parameter every time a container has to be created and that the bundle's config.json is always structured according to the predefined pattern. If this is not the case, the script will most likely crash, preventing any container to be created on the node in question. .. not to mention potential incompatibilities with manually created and started containers, etc.

You have been warned.

Building on the answer from @akomelj (thank you so much for this!), I've expanded it slightly to better mimic privileged mode.

Looking at https://github.com/docker/swarmkit/issues/1030#issuecomment-231144514, there are more things to do, specifically regarding device mounts, and to apply every capability in existence. See code.

#!/usr/bin/python3
import json
import os
import pathlib
from typing import List

import sys

# default runc binary
NEXT_RUNC = "/usr/bin/runc"

# capabilities to add to every container
# http://man7.org/linux/man-pages/man7/capabilities.7.html
ADDITIONAL_CAPABILITIES = [
    "CAP_AUDIT_CONTROL", "CAP_AUDIT_READ", "CAP_AUDIT_WRITE", "CAP_BLOCK_SUSPEND",
    "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_DAC_READ_SEARCH", "CAP_FOWNER", "CAP_FSETID",
    "CAP_IPC_LOCK", "CAP_IPC_OWNER", "CAP_KILL", "CAP_LEASE", "CAP_LINUX_IMMUTABLE",
    "CAP_MAC_ADMIN", "CAP_MAC_OVERRIDE", "CAP_MKNOD", "CAP_NET_ADMIN",
    "CAP_NET_BIND_SERVICE", "CAP_NET_BROADCAST", "CAP_NET_RAW", "CAP_SETGID",
    "CAP_SETFCAP", "CAP_SETPCAP", "CAP_SETUID", "CAP_SYS_ADMIN", "CAP_SYS_BOOT",
    "CAP_SYS_CHROOT", "CAP_SYS_MODULE", "CAP_SYS_NICE", "CAP_SYS_PACCT",
    "CAP_SYS_PTRACE", "CAP_SYS_RAWIO", "CAP_SYS_RESOURCE", "CAP_SYS_TIME",
    "CAP_SYS_TTY_CONFIG", "CAP_SYSLOG", "CAP_WAKE_ALARM"
]


# mimics GetDevices in
# https://github.com/opencontainers/runc/blob/master/libcontainer/devices/devices.go
def get_devices(path: pathlib.Path) -> List[pathlib.Path]:
    result = []
    children = list(path.iterdir())
    for c in children:
        if c.is_dir():
            if c.name not in ["pts", "shm", "fd", "mqueue",
                              ".lxc", ".lxd-mounts", ".udev"]:
                result.extend(get_devices(c))
        elif c.name == "console" or c.name.startswith("video"):
            continue
        else:
            result.append(c)

    result = [d for d in result
              if d.exists() and (d.is_block_device() or d.is_char_device())]

    return result


# adds capabilities and devices to a bundle by extending its config.json
def add_capabilities(bundle, capabilities):
    with open(bundle + "/config.json") as config_file:
        config = json.load(config_file)

    config["process"]["capabilities"]["bounding"].extend(capabilities)
    config["process"]["capabilities"]["effective"].extend(capabilities)
    config["process"]["capabilities"]["inheritable"].extend(capabilities)
    config["process"]["capabilities"]["permitted"].extend(capabilities)

    for c in config["linux"]["resources"]["devices"]:
        c["allow"] = True

    # mimics WithDevices in
    # https://github.com/moby/moby/blob/master/daemon/oci_linux.go
    device_paths = get_devices(pathlib.Path("/dev/"))
    config["linux"]["devices"] = [
        {
            "type": "c",
            "path": str(d),
            "minor": os.minor(os.stat(str(d.resolve())).st_rdev),
            "access": "rwm",
            "allow": True,
            "major": os.major(os.stat(str(d.resolve())).st_rdev),
            "uid": 0,
            "gid": 0,
            "filemode": 777
        }
        for d in device_paths
    ]

    with open(bundle + "/config.json", "w") as config_file:
        json.dump(config, config_file)

    with open("/tmp/runcdebug.json", "w") as debug_file:
        json.dump(config, debug_file)


def main():
    for i in range(len(sys.argv)):
        if sys.argv[i] == "--bundle":
            bundle_filename = sys.argv[i + 1]
            add_capabilities(bundle_filename, ADDITIONAL_CAPABILITIES)
            break

    os.execv(NEXT_RUNC, sys.argv)


if __name__ == '__main__':
    main()

Changes include:

To apply changes, do the following:

#!/bin/sh

set -e
set -u

# runc-hack.py is the above spaghetti
cp runc-hack.py /root/runc-hack
chmod u+x /root/runc-hack

cp /etc/docker/daemon.json /etc/docker/daemon.json.old || true
if [ -f /etc/docker/daemon.json ];
    then cat /etc/docker/daemon.json
    else echo "{}"
fi \
    | jq '.+ {"runtimes": {"runc-hack": {"path": "/root/runc-hack"}},
"default-runtime": "runc-hack"}' \
    | tee /etc/docker/daemon.json.new
mv /etc/docker/daemon.json.new /etc/docker/daemon.json

systemctl daemon-reload
systemctl restart docker

Verified on a Swarm worker with Engine 19.03.1 on Debian 9, the master did not have this _fix_ applied.

This is still a huge hack and it Works For Me(tm). Don't use it irresponsibly. It was the least bad solution to my problem and I feel very dirty using it. But hey, it's up to everyone to decide for themselves.

edit@2019-12-30: a _slight_ misscripting in the deployment section
edit@2020-01-10: add failing on error to the deployment script

@sstanovnik in your code mv /etc/docker/daemon.json.new /etc/docker/daemon.json, it cannot find the file daemon.json.new, what can i do?

You very likely weren't running the script as a superuser, so the command above that (tee specifically) wasn't able to create the file. Run the script as superuser and you shouldn't have a problem.

My fault for letting the script continue on error. Editing to add

set -e
set -u

makes the script exit on first error (and tread undefined variables as errors).

You very likely weren't running the script as a superuser, so the command above that (tee specifically) wasn't able to create the file. Run the script as superuser and you shouldn't have a problem.

My fault for letting the script continue on error. Editing to add

set -e
set -u

makes the script exit on first error (and tread undefined variables as errors).

I log in as root. but my system is centos7 ,not Debian 9, Is it this reason?

You very likely weren't running the script as a superuser, so the command above that (tee specifically) wasn't able to create the file. Run the script as superuser and you shouldn't have a problem.
My fault for letting the script continue on error. Editing to add

set -e
set -u

makes the script exit on first error (and tread undefined variables as errors).

I log in as root. but my system is centos7 ,not Debian 9, Is it this reason?

what I do is

  1. write a py file named runc-hack.py, copy the first python code in it.
  2. write a sh file named 'hack.sh`, copy the sh code in it.
  3. put the two file in a dir
  4. chmod +x hack.sh
  5. ./hack.sh

then I get the error:

mv: Unable to get the file status (stat) of "/etc/docker/daemon.json.new": Without that file or directory

Depending on the use case, a workaround is to bind-mount /var/run/docker.sock from the swarm host(s) to the service, then run docker run --privileged ... or docker run --cap-add ... from the service for executing your actual privileged commands. (You’ll have to install docker cli on the image for the service.) The innermost container that you docker run in this way will have the privileges/capabilities of the swarm host rather than of the service, and the service just becomes a thin container layer.

My use case was a Jenkins agent swarm cloud (see https://github.com/jenkinsci/docker-swarm-plugin/issues/58), and I already had the host's /var/run/docker.sock bind-mounted onto the service for doing things like docker stack deploy ..., so this was a natural workaround for running commands in a Jenkins build that required capabilities (like mounting an NFS drive for deployment).

@arseniybanayev, oooh, that's neat!

Just to verify. I assume that you start your child container in foreground mode. When Docker Swarm terminates your launcher container - your child container gets terminated as well, right? I know it should work that way, but I just want to check anyway before I hack my Swarm to pieces.

@akomelj that’s right, the service executes docker run without the background option -d, so if the service is terminated then the innermost container should also be terminated. I haven’t tried this myself but it should be easy to test.

@arseniybanayev thanks for replying and excellent solution to this problem.

I actually had to test this as I'm dying to get rid of hacked runc provisioning on my Swarm and it works flawlessly! I created a general purpose light-weight image from docker:latest - this image simply spins up a new container based on passed-in environment variables.

In case anyone tries the same route - here are Dockerfile configuration and entrypoint.sh script for building your own launcher image. Admittedly, launch could be done with a single environment variable but I wanted to split configuration of child containers to multiple variables just for clarity. Both files should be self-explanatory.

Dockerfile:

# official Docker (CLI) image
FROM docker:latest

# launch parameters
ENV LAUNCH_IMAGE            hello-world
ENV LAUNCH_PULL             false
ENV LAUNCH_CONTAINER_NAME=  
ENV LAUNCH_PRIVILEGED       false
ENV LAUNCH_INTERACTIVE      false
ENV LAUNCH_TTY              false
ENV LAUNCH_HOST_NETWORK     false
ENV LAUNCH_ENVIRONMENT=
ENV LAUNCH_VOLUMES=
ENV LAUNCH_EXTRA_ARGS=

# add entrypoint.sh launcher script
ADD entrypoint.sh   /

# run the image
ENTRYPOINT /entrypoint.sh

entrypoint.sh:

#!/bin/sh
# pull latest image version
if [ "$LAUNCH_PULL" = true ]; then
    echo "Pulling $LAUNCH_IMAGE: docker pull $LAUNCH_IMAGE"
    docker pull $LAUNCH_IMAGE
fi

# build launch parameters
DOCKER_ARGS="run --rm"
[ -n "$LAUNCH_CONTAINER_NAME" ] && DOCKER_ARGS="$DOCKER_ARGS --name $LAUNCH_CONTAINER_NAME"
[ "$LAUNCH_PRIVILEGED" = true ] && DOCKER_ARGS="$DOCKER_ARGS --privileged"
[ "$LAUNCH_INTERACTIVE" = true ] && DOCKER_ARGS="$DOCKER_ARGS -i"
[ "$LAUNCH_TTY" = true ] && DOCKER_ARGS="$DOCKER_ARGS -t"
[ "$LAUNCH_HOST_NETWORK" = true ] && DOCKER_ARGS="$DOCKER_ARGS --net host"
[ "$LAUNCH_PRIVILEGED" = true ] && DOCKER_ARGS="$DOCKER_ARGS --privileged"
DOCKER_ARGS="$DOCKER_ARGS $LAUNCH_ENVIRONMENT $LAUNCH_VOLUMES $LAUNCH_EXTRA_ARGS $LAUNCH_IMAGE"

echo "Running $LAUNCH_IMAGE: exec docker $DOCKER_ARGS"
exec docker $DOCKER_ARGS

And here are the relevant Stack parts using launcher image from above to launch another container.

version: "3.5"

services:
  gate:
    image: registry.aember.com:5000/aember/swarm-launcher:latest

    environment:
      LAUNCH_IMAGE: registry.aember.com:5000/sh-btq-gate:latest
      LAUNCH_PULL: "true"
      LAUNCH_PRIVILEGED: "true"
      LAUNCH_HOST_NETWORK: "true"
      LAUNCH_ENVIRONMENT: "--env INSTANCE={{.Node.Hostname}}"
      LAUNCH_VOLUMES: "-v /var/run/btq.json:/btq.json -v /docker/data/btq:/var/run/btq -v /etc/localtime:/etc/localtime:ro"

    volumes:
      - /var/run/docker.sock:/var/run/docker.sock

@akomelj This works great, the only problem I have is when using private registries then docker pull $LAUNCH_IMAGE fails to authenticate.
This happens despite passing --with-registry-auth to docker stack deploy, which only authenticates the launcher service to download its launcher image, leaving the application image unreachable.

The behavior doesn't make sense if auth data would be stored with the daemon (since it is shared), then I guess it is stored with the docker client?

@dorintt, I'm not an authority on Swarm but I'd guess that auth data is stored with the daemon in Swarm raft data and access to this data is limited - otherwise any docker client on the Swarm worker node could access it, which is probably not a good idea.

What you can do is - create a Swarm config containing Docker config.json with login credentials for your private registry (perform docker login on some host and use $HOME/.docker/config.json as contents of your Swarm config). You can add this config to your Docker launcher stack and change the launcher script to create a symbolic link from the mounted config file to $HOME/.docker/config.json. This, in theory, should allow docker pull to authenticate to the registry and pull the image.

See:

You can also expand the entrypoint.sh file and add the following before # pull latest image version:

# does a docker login first
if [ -n "${LOGIN_USER}" ] && [ -n "${LOGIN_PASSWORD}" ]; then
  echo "Logging in"
  echo "${LOGIN_PASSWORD}" | docker login -u "${LOGIN_USER}" --password-stdin ${LOGIN_REGISTRY}
fi

For convenience, I've packed everything in a docker repository here: ixdotai/swarm-launcher

Guys I am really trying to follow up you on that but I'm unable so I am asking you if you could help please; maybe @tlex or @akomelj.

What I am having as probably most of us discussed in here is container that I need to run with cap-add=NET_ADMIN and devices=/dev/net/tun:/dev/net/tun (this is required for pulling up a openvpn connection from docker worker container actually) OR it works also without this flags but with --privileged.

My ready-to work images are laying down on nodes by name of "dvv".

This is what worked when I externally was establishing openvpn connection:

sudo docker service create -e access_token=something --mode global --name "DVV" dvv

Now, I want to move connection inside of the container. I've done it; but as I require to run all this in swarm and swarm obviously does not support the higher privileges, I am trying to understand how to do it actually with either yaml docker-compose file or a single command. I don't have a lot of experience with docker service creation. I am trying the following:

sudo docker service create -e LAUNCH_IMAGE=dvv -e LAUNCH_PRIVILEGED="true" -e LAUNCH_ENVIRONMENTS="access_token=something" ixdotai/swarm-launcher:dev-master

But it does not seems to work... I think it works if I run it manually with docker run -v /var/run/docker.sock:/var/run/docker.sock ... ... but -v yet again is not supported by services... Can you please guide me through this situation. On how exactly is to run privileged container via this wrapper. Consider that I am didn't build a lot of docker services :D

Thanks

@sxiii The idea behind the image is to create (and launch) a service inside the stack. So, the best way to do it would be to create a compose file as in the first example in README.md.

Since the question isn't related to docker in general but to the image in particular, I suggest you open an issue in /ix-ai/swarm-launcher for it.

Hi, is there a documentation behind why --cap-add arguments is not implemented yet in Docker Swarm?

I see there is a "hack-ish" solution to this but it leave no explanation of possible security flaws.

To be more specific, --cap-add NET_ADMIN issue for modifying iptables.

@kidfrom there is full implementation on included Docker engine codebase already and it will be part of next version. Cli support is also implemented but unfortunately those PRs have been open soon one year waiting for someone of maintainers to finalize review. Last status message you can see on https://github.com/moby/moby/issues/25885#issuecomment-557790402

Was this page helpful?
0 / 5 - 0 ratings