Kubernetes: rlimit support

Created on 18 Jan 2015  ยท  122Comments  ยท  Source: kubernetes/kubernetes

https://github.com/docker/docker/issues/4717#issuecomment-70373299

Now that this is in, we should define how we want to use it.

areisolation kinfeature prioritimportant-soon sinode

Most helpful comment

Instead of everyone having their own hacks and custom entrypoints just to set rlimit I also like to +1 we allow users to set it through k8s API. Is this being worked on at all?

All 122 comments

/cc @vishh @rjnagal @vmarmol

We can set a sane default for now. Do we want this to be exposed as a knob
in the spec, or do we prefer a low/high toggle? The only advantage of
toggle is that we can possibly avoid too many jobs with high value landing
on the same machine.

On Fri, Jan 23, 2015 at 12:19 AM, Brian Grant [email protected]
wrote:

/cc @vishh https://github.com/vishh @rjnagal
https://github.com/rjnagal @vmarmol https://github.com/vmarmol

โ€”
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71160651
.

Are there any downsides to setting high limits by default these days? I
can't keep straight what bugs we have fixed internally that might not have
been accepted upstream, especially regarding things like memcg accounting
of kernel structs.

On Fri, Jan 23, 2015 at 12:29 AM, Rohit Jnagal [email protected]
wrote:

We can set a sane default for now. Do we want this to be exposed as a knob
in the spec, or do we prefer a low/high toggle? The only advantage of
toggle is that we can possibly avoid too many jobs with high value landing
on the same machine.

On Fri, Jan 23, 2015 at 12:19 AM, Brian Grant [email protected]
wrote:

/cc @vishh https://github.com/vishh @rjnagal
https://github.com/rjnagal @vmarmol https://github.com/vmarmol

Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71160651>

.

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71161877
.

+1 to toggle, putting it in the spec is overkill IMO.

If there is a toggle for "few" vs "many" everyone will choose "many". We
need to understand and document why "few" is the better choice most of the
time, and think about how to restrict who uses "many".

On Fri, Jan 23, 2015 at 9:39 AM, Victor Marmol [email protected]
wrote:

+1 to toggle, putting it in the spec is overkill IMO.

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71231707
.

Kernel memory accounting seems to be disabled in our container vm image.
The overall fd limit might also be a factor to consider. Given these
constraints, providing a toggle option makes sense.

On Fri, Jan 23, 2015 at 9:56 AM, Tim Hockin [email protected]
wrote:

If there is a toggle for "few" vs "many" everyone will choose "many". We
need to understand and document why "few" is the better choice most of the
time, and think about how to restrict who uses "many".

On Fri, Jan 23, 2015 at 9:39 AM, Victor Marmol [email protected]
wrote:

+1 to toggle, putting it in the spec is overkill IMO.

Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71231707

.

โ€”
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71234393
.

One way to restrict "many" would be to take into account the global machine
limits and use it in scheduling.
I don't think we have or are planning to add user-based capabilities.

On Fri, Jan 23, 2015 at 10:25 AM, Vish Kannan [email protected]
wrote:

Kernel memory accounting seems to be disabled in our container vm image.
The overall fd limit might also be a factor to consider. Given these
constraints, providing a toggle option makes sense.

On Fri, Jan 23, 2015 at 9:56 AM, Tim Hockin [email protected]
wrote:

If there is a toggle for "few" vs "many" everyone will choose "many". We
need to understand and document why "few" is the better choice most of
the
time, and think about how to restrict who uses "many".

On Fri, Jan 23, 2015 at 9:39 AM, Victor Marmol [email protected]

wrote:

+1 to toggle, putting it in the spec is overkill IMO.

Reply to this email directly or view it on GitHub
<

https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71231707

.

โ€”
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71234393>

.

โ€”
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71239201
.

For our project we would use both "few" and "many". The lower limit would be for our worker containers (sateless) and the high limit would be for our storage containers (stateful)

+1 to toggle, but what exactly does few and many mean?
Also what are the implications to scheduling?

I don't think few and many are useful categorizations. I also disagree with the stateless vs. storage distinction. Many frontends need lots of fds for sockets.

I would assume that we would at best only do minimal checks in scheduler as these resources would be highly overcommitted. We can have an admission check on the node side to reject pod requests or inform scheduler when its running low - more of an out-of-resource model.

For large and few values, we can start with typical linux max for the resource as large, and typical default as few.

@bgrant0607 what kind of model did you have in mind for representing these as resources?

I don't know that we need to track these values in the scheduler. They are more for DoS prevention than allocating a finite resource.

I'm skeptical that "large" and "few" are adequate, because the lack of numerical values would make it difficult for users to predict what category they should request, and the choice might not even be portable and/or stable over time. Do you think users wouldn't know how many file descriptors to request, for example? That seems like it can be computed with simple arithmetic based on the number of clients one wants to support, for instance.

What's the downside of just exposing numerical parameters?

I agree we should choose reasonable, modest defaults.

small and large feels clunky, but I like the ability for site-admins to
define a few grades of service and then to let users choose. I think it
works pretty well internally - at least most users survive with the default
of "small"

On Fri, Mar 6, 2015 at 9:13 PM, Brian Grant [email protected]
wrote:

I don't know that we need to track these values in the scheduler. They are
more for DoS prevention than allocating a finite resource.

I'm skeptical that "large" and "few" are adequate, because the lack of
numerical values would make it difficult for users to predict what category
they should request, and the choice might not even be portable and/or
stable over time. Do you think users wouldn't know how many file
descriptors to request, for example? That seems like it can be computed
with simple arithmetic based on the number of clients one wants to support,
for instance.

What's the downside of just exposing numerical parameters?

I agree we should choose reasonable, modest defaults.

โ€”
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-77674200
.

Re. admin-defined policies, see https://github.com/docker/docker/issues/11187

@bgrant0607: Is this something that we can consider for v1.1?

We can, but I'll have 0 bandwidth to think about it for the next month, probably.

cc @erictune

Docker rlimit feature is process-based, not cgroup based (of course, upstream kernel doesn't have rlimit cgroup yet). This means

  • The limit is applied to container's root process, and all children processes inherit it
  • There is no control on how many children processes
  • Processes for docker exec are not inheriting the same limit

Based on above, I don't think there is a very useful feature, or at least, not a easy-to-use feature to specify and manage.

Where does this stand? is it available through any config?

@shaylevi2: Look at the previous comment. Docker's current implementation isn't what we need.

I keep getting asked questions in this space, so maybe a pointer to documentation somewhere explaining why its not supported yet would be useful. Typing that I admit that it is odd to write something on why we don't do something, but @dchen1107 earlier comment is helpful.

@thockin @bgrant0607 looks like this is not going to be addressed? This is definitely an issue with both elasticsearch and cassandra running on K8s. What is the recommended best practice to get RLIMIT_MEMLOCK happy?

cc @kubernetes/sig-node-feature-requests

Do those applications also require huge pages?

On Friday, June 10, 2016, Brian Grant [email protected] wrote:

cc @kubernetes/sig-node
https://github.com/orgs/kubernetes/teams/sig-node

โ€”
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3595#issuecomment-225110348,
or mute the thread
https://github.com/notifications/unsubscribe/AF8dbBznwyW6cJLD6lE_BJ-g02POtQ06ks5qKQ9CgaJpZM4DT0Ae
.

Yep Cassandra and Elastic - jemalloc for Cassandra

Redis as well uses jemalloc

So it's fair to say you need both features supported on the node?

On Saturday, June 11, 2016, Chris Love [email protected] wrote:

Redis as well uses jemalloc

โ€”
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3595#issuecomment-225398637,
or mute the thread
https://github.com/notifications/unsubscribe/AF8dbD-544FTawGiqk5W_oa2rEbXMAxIks5qKzqwgaJpZM4DT0Ae
.

:) Big want :)

Jvm huge pages with jemalloc and redis is in C with jemalloc

Does privelleged mode all for that now? I think we are actually at kernel / docker level, not k8s

Allow for that .. Github let me edit on mobile ...

@rjnagal Is there a more recent kernel limit cgroup RFC?

http://thread.gmane.org/gmane.linux.kernel.cgroups/12687

what is the status of this?
To run a high traffic nginx site I need to increase the number of open files.
By default the value is

ulimit -n
4096

The nginx configuration to increase the number of open files is

worker_rlimit_nofile 10240;

However on startup, I'm seeing this in error.log

[alert] 78#78: setrlimit(RLIMIT_NOFILE, 10240) failed (1: Operation not permitted)

When running starting docker, I would set ulimit to resolve this problem, e.g.

--ulimit nofile=262144:262144 

How can I do this in kubernetes?

@mingfang you tried privelleged mode?

@chrislovecnm hugepages design doc being worked by @sjenning perhaps you can elaborate on use-cases for ES and Cassandra...those are very relevant for RH as well since we use them in products.

He and I were discussing ulimits yesterday wrt memlock and also we might need to namespace vm.hugetlb_shm_group in the kernel. Very similar use-cases. Makes sense to collapse all those goals into one design doc and we phase the implementation accordingly.

I'm having trouble understanding why the comment by @dchen1107 is a showstopper... perfect enemy of good? ulimits cover a broad range of needs, from running a basic webserver to high performance computing. Privileged mode is not an option in many environments, and I think a lot of people would prefer to use kubernetes in those environments, even if it comes with certain caveats.

@tmandry what features do you need?

@tmandry - the problem is that per-process limits are not very useful for
isolation purposes. There are some uses, I admit, but every facet of API
has cost to develop and test and maintain and document and comprehend, so
we need to balance that. Which rlimits are you interested in?

On Tue, Nov 29, 2016 at 1:44 AM, Vish Kannan notifications@github.com
wrote:

@tmandry https://github.com/tmandry what features do you need?

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3595#issuecomment-263523310,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVGU4TPedAysc0zbdccNHzBWeb9Cpks5rC_P6gaJpZM4DT0Ae
.

From my opinion, regardless of whether it is useful or not to setup a per-process limit, there is a fact: Many people need to increase the file descriptors or other rlimits in order for their processes to work in some scenarios. Examples would be servers or proxies with many concurrent connections (this is the situation for me and other people that have already commented here).

I am in favor of letting the user choose custom values. I agree @bgrant0607 comment about a predefined set of levels being not portable and I also think there is nothing against letting a user to customize the numeric values (if a user wants to increment a file descriptor limit, he wants it because he knows what it means). I also see it easier to implement in Kubernetes rather than having to think on how we can define those levels, configure them and so on.

Furthermore, from my perspective, a user is likely to expect to be able to tune the rlimits of his process in a container as he does in a VM. This is degree of control would be the one achieved with custom values.

Another thing that may be interesting to consider: I think it would make sense to be able to restrict the rlimits a user can set via Pod Security Policies.

I said previously "a user is likely to expect to be able to tune the rlimits of his process in a container as he does in a VM". But, on the other hand, as a cluster admin, I would like to give the user that freedom but with being able to set reasonable limits.

I need this feature to be a workaround for the issue
https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1350

indocker run there is --ulimit
Can I do this in kubernetes now?

Agreed on the fd limit, that's something I've run into a few times. In my case, the ability to pin cores by raising the rtprio ulimit is also important.

Fair enough. Raising a limit is different than isolating a resource.
rlimit is more appropriate for one than the other.

On Thu, Dec 22, 2016 at 5:24 PM, Tyler Mandry notifications@github.com
wrote:

Agreed on the fd limit, that's something I've run into a few times. In my
case, the ability to pin cores by raising the rtprio ulimit is also
important.

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3595#issuecomment-268925649,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVLJp09kWIdwGjVAnBbgeJAYfH-4pks5rKyLQgaJpZM4DT0Ae
.

This issue currently makes it impossible to run Elasticsearch on Kubernetes without performance degradation. ๐Ÿ˜ฐ

[WARN ][bootstrap] Unable to lock JVM Memory: error=12,reason=Out of memory
[WARN ][bootstrap] This can result in part of the JVM being swapped out.
[WARN ][bootstrap] Increase RLIMIT_MEMLOCK, soft limit: 65536, hard limit: 65536

We've kinda gotten around this by editing the ENTRYPOINT script available in the official Elasticsearch Docker image.

Setting this at the top ...

#!/bin/bash

set -e

echo '1: Before ulimit'
ulimit -l
ulimit -l unlimited
ulimit -l
echo '2. After ulimit'

... gives us ...

2017-01-13T09:00:36.117522472Z 1: Before ulimit
2017-01-13T09:00:36.117551488Z 64
2017-01-13T09:00:36.117569173Z unlimited
2017-01-13T09:00:36.117574588Z 2. After ulimit

Ohh man, I didn't realize this was still open.

If folks want to self host a larger cluster this is paramount to fix.

/cc @jbeda @luxas @vishh @philips

If folks want to self host a larger cluster this is paramount to fix.

Well we did some tinkering and we just override with dockers config :-/
So... there is an escape hatch.

@timothysc what exactly would i need to override?

@itskingori official Elasticsearch Docker image has bin/es-docker running under user elasticsearch, it has no privilege to run ulimit command, how did you archive that?

@hiscal2015 By giving the container some extra capabilities ... like this ... ๐Ÿ‘‡

---
apiVersion: apps/v1beta1
kind: StatefulSet
spec:
  template:
    spec:
        - name: elasticsearch
          securityContext:
            capabilities:
              add:
                - IPC_LOCK
                - SYS_RESOURCE

@itskingori Still no luck, here's my Dockerfile:

FROM docker.elastic.co/elasticsearch/elasticsearch:5.2.0
ADD elasticsearch.yml /usr/share/elasticsearch/config/
USER root
RUN chown elasticsearch:elasticsearch config/elasticsearch.yml
RUN sed -i '2i\set -e\' /usr/share/elasticsearch/bin/es-docker
RUN sed -i '3i\ulimit -l unlimited\' /usr/share/elasticsearch/bin/es-docker
USER elasticsearch

And here's my yaml:

    spec:
      containers:
      - name: es-master
        securityContext:
          privileged: true
          capabilities:
            add:
              - IPC_LOCK
              - SYS_RESOURCE

And here's the error msg:

3/20/2017 7:44:14 PMbin/es-docker: line 3: ulimit: max locked memory: cannot modify limit: Operation not permitted

Am I missing something?

@hiscal2015 Could you please edit your comment with "triple ticks" to format the pasted files? Like this:

``` dockerfile
[ dockerfile goes here ]
```

``` yaml
[ yaml goes here ]
```

It will help everyone. Thanks!

@hiscal2015 I'm not sure what you're trying to do in your Dockerfile ... seems you're trying to do stuff during the build phase. Can't follow your idea to an approach that could work.

I'm just gonna share what we did ... we overrode the image ENTRYPOINT (which was docker-entrypoint.sh) to run what we want first. Like this ...

FROM elasticsearch:2.4.3-alpine

COPY configs/elasticsearch.yml /usr/share/elasticsearch/config/elasticsearch.yml
COPY custom-entrypoint.sh /custom-entrypoint.sh

VOLUME /usr/share/elasticsearch/data

ENTRYPOINT ["/custom-entrypoint.sh"]
CMD ["elasticsearch"]

Where custom-entrypoint.sh is ...

#!/bin/bash

set -e

ulimit -l unlimited

exec /docker-entrypoint.sh "$@"

The idea is that custom-entrypoint.sh will run when running the container while means ulimit -l unlimited will run before the command to start Elasticsearch (which is what we want).

When we started trying to run ES 5.0 we hit this issue as well. We're
running a init-container to prep the node. I was unable to get ulimit
inside a running container to do the job so I wrote this guy:
https://github.com/samsung-cnct/set_max_map_count it requires /proc to be
mounted into the init container only which gives me the heebies but it
works.

Example using this is here:
https://github.com/samsung-cnct/elasticsearch-kubernetes/blob/master/es.yaml

note: that was to figure things out. we were running on k8s v1.4. The
chart that is based on that works for 1.5+ and is here:
https://github.com/samsung-cnct/k2-charts/tree/master/elasticsearch

On Wed, Mar 22, 2017 at 9:12 AM, itskingori notifications@github.com
wrote:

@hiscal2015 https://github.com/hiscal2015 I'm not sure what you're
trying to do in your Dockerfile ... seems you're trying to do stuff
during the build phase. Can't follow your idea to an approach that could
work.

I'm just gonna share what we did ... we overrode the image ENTRYPOINT
(which was docker-entrypoint.sh) to run what we want first. Like this ...

FROM elasticsearch:2.4.3-alpine
COPY configs/elasticsearch.yml /usr/share/elasticsearch/config/elasticsearch.ymlCOPY custom-entrypoint.sh /custom-entrypoint.sh
VOLUME /usr/share/elasticsearch/data
ENTRYPOINT ["/custom-entrypoint.sh"]CMD ["elasticsearch"]

Where custom-entrypoint.sh is ...

!/bin/bash

set -e
ulimit -l unlimited
exec /docker-entrypoint.sh "$@"

The idea is that custom-entrypoint.sh will run when running the container
while means ulimit -l unlimited will run before the command to start
Elasticsearch (which is what we want).

โ€”
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3595#issuecomment-288451522,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AApXmgWmIp6DCoyU4UQ6mJxBP9qhHrK5ks5roUiIgaJpZM4DT0Ae
.

pid cgroup https://www.kernel.org/doc/Documentation/cgroup-v1/pids.txt allows limiting the number of processes per cgroup, and very critical to avoid fork bombs. docker seems to have added support since 1.11 release: https://github.com/docker/docker/pull/18697
Can we expose pids-limit in k8s pod container spec.

Instead of everyone having their own hacks and custom entrypoints just to set rlimit I also like to +1 we allow users to set it through k8s API. Is this being worked on at all?

Is this being worked on at all?

No, but I think I'd like to enqueue this for @kubernetes/sig-cluster-lifecycle-feature-requests b/c it will directly affect them.

cc @kow3ns @jagosan

@itskingori - I spent a good bit of time trying to get the memlock working with elasticsearch in kubernetes. Thanks to you I was able it get it going.

Would really prefer if kubernetes natively supported ulimits.

If you are interested on set up a elasticsearch cluster on k8s which request ulimit -l unlimited I share it with all of you.

https://github.com/cesargomezvela/elasticsearch

Take a look inside docker/run.sh file.

# allow for memlock if enabled
if [ "$MEMORY_LOCK" == "true" ]; then
    ulimit -l unlimited
fi

You may also be interested in the following lines that are inside the file kubernetes/elasticsearch-master.yaml

spec:
  replicas: 3
  template:
    metadata:
      annotations:
        pod.alpha.kubernetes.io/init-containers: '[
          {
            "name":"init-sysctl",
            "image":"busybox:1.27.2",
            "imagePullPolicy":"IfNotPresent",
            "command":[
              "sysctl",
              "-w",
              "vm.max_map_count=262144"
            ],
            "securityContext":{
                "privileged":true
            }
          }
        ]'
...
        securityContext:
          privileged: true

@timothysc @tnachen @bgrant0607 @cesargomezvela @gdmello So with securityContext/privileged/true and/or capabilities/add/SYS_RESOURCE code running in the container is able to run ulimit now right? Is there still a TODO item left in this issue?

for example, Do we want to let pod authors explicitly specify the limits? (like the cpu/memory requests/limits?) without needing to bump their capability/privilege?

Notes:

Personally, I believe it is a good idea to let pod authors setup explicit limits without bumping capabilities or privileges for these reasons:

  • SYS_RESOURCE capability may allow applications to perform operations you might not want to allow. It is even more true for privileged containers.
  • While some applications do require to setup limits by themselves (as they or their startup scripts are designed to do so), in other cases you may want to setup them at pod creation time and do not allow the application to change them at runtime (for security reasons, for resource control reasons on public environments...).

Another problem with ulimit in entrypoint is that you can't change the hard limit as non-root and you can't do it via sudo either. Normally, in Linux you do it via /etc/security/limits.conf but docker ignores it, so you'll have to use root as a default user.

If you're looking for elasticsearch image with ulimit in entrypoint and elasticsearch running from elasticsearch user feel free to use/fork https://github.com/wodby/elasticsearch

@kubernetes/sig-node-feature-requests So would we be ok with doing something just for docker here? I can't tell if other CRI implementations can easily support this feature. (cc @timothysc @tnachen @bgrant0607)

There are plenty more arguments why this is required at https://github.com/moby/moby/issues/4717 and merged in https://github.com/moby/moby/pull/9437; All that K8s should do is allow to pass these arguments to the container it runs. Why is there even discussion here concerning about cgroups support? That has been resolved on the container level. All you has to do is allow to pass the extra argument, and let all the people who are creating ugly wrapper-script+priveleged workarounds to finally do the proper thing and just pass the argument like they wanted to do and was available to them for the last 3 years!

@kesor totally fair. i am digging up old issues like this and trying to see what we can do. yes, its easy to pass arguments to docker for sure, but need to figure out if something can be done for other CRI implemenations

@kesor @csandanov @palonsoro @cesargomezvela @gdmello - Do you have some time to kick the tires of #58904 and provide some feedback?

File max limits are tricky and IMO (theoretically) the limits are set at multiple levels (container, docker daemon on the node, node itself and all the way to the storage tier/filesystem).

@dims Should there be a check at PV level in addition? Example - What if at the container level, we set the limit to 1M and a user decides to go with EFS as PV where file-max limits can't be configured and the limit is low?

@dims after I saw your diff, what's the ETA for this docker run --ulimit in kubernetes? Is it in the kubernetes roadmap?

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Are there any plans for this issue?
We're running elassandra docker image in k8s and need possibility to change ulimits. Maybe someone have some hacks how to set them for elassandra?
I've tried to add

args:
    - "--ulimit memlock=10000000:10000000"

with

 securityContext:
     privileged: true
     capabilities:
         add:
             - IPC_LOCK
             - SYS_RESOURCE

to stateful set specification but it didn't work.

@arturalbov we do something similar with elasticsearch/cassandra, just create a custom image with an alternate entrypoint script, e.g:

#!/bin/bash

# Set memlock limit to unlimited
ulimit -l unlimited

# Call original entrypoint script
exec /usr/local/bin/docker-entrypoint.sh "${@}"

then supply the same capabilities you mentioned (privileged isn't required):

 securityContext:
     capabilities:
         add:
             - IPC_LOCK
             - SYS_RESOURCE

@eyalzek Thanks! It works

@arturalbov we do something similar with elasticsearch/cassandra, just create a custom image with an alternate entrypoint script, e.g:

#!/bin/bash

# Set memlock limit to unlimited
ulimit -l unlimited

# Call original entrypoint script
exec /usr/local/bin/docker-entrypoint.sh "${@}"

then supply the same capabilities you mentioned (privileged isn't required):

 securityContext:
     capabilities:
         add:
             - IPC_LOCK
             - SYS_RESOURCE

Can you please help me understand how to get this working with k8s?

@rewt Hi, sure
I created custom-entrypoint.sh file (it should call original entrypoint script, you could find out how to call it in initial Dockerfile of necessary image). For me (and mostly) it looks like this:

#!/bin/bash

# Set memlock limit
ulimit -l 33554432

# Call original entrypoint script
exec /docker-entrypoint.sh "${@}"

Then I created Dockerfile to build custom elassandra image. It looks like that:

FROM strapdata/elassandra:5.5.0.20

COPY custom-entrypoint.sh /custom-entrypoint.sh

ENTRYPOINT ["/custom-entrypoint.sh"]
CMD ["bin/cassandra"]

Place those files into same folder and just build custom image with docker build command. Then you could deploy it to your personal docker hub repo in order to use it in k8s on cloud. So, after you have your custom image you could use it in pod spec definition with addition of security context capabilities as mentioned in your quote.

Here is the example of StatefulSet that I've used.
And here is the example of building custom image.

These required as well?

securityContext:
capabilities:
add:
- IPC_LOCK
- SYS_RESOURCE

On Thu, Sep 20, 2018 at 1:20 PM Artur Albov notifications@github.com
wrote:

@rewt https://github.com/rewt Hi, sure
I created custom-entrypoint.sh file (it should call original entrypoint
script, you could find out how to call it in initial Dockerfile of
necessary image). For me (and mostly) it looks like this:

!/bin/bash

Set memlock limit

ulimit -l 33554432

Call original entrypoint script

exec /docker-entrypoint.sh "${@}"

Then I created Dockerfile to build custom elassandra image. It looks like
that:

FROM strapdata/elassandra:5.5.0.20

COPY custom-entrypoint.sh /custom-entrypoint.sh

ENTRYPOINT ["/custom-entrypoint.sh"]
CMD ["bin/cassandra"]

Place those files into same folder and just build custom image with docker
build command. Then you could deploy it to your personal docker hub repo
in order to use it in k8s on cloud. So, after you have your custom image
you could use it in pod spec definition with addition of security context
capabilities as mentioned in your quote.

Here is the example
https://github.com/cybercongress/cybernode/blob/master/kubernetes-definitions/search/elassandra.yaml
of StatefulSet that I've used.
And here is the example of building custom image
https://github.com/cybercongress/cybernode/tree/master/docker/elassandra
.

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3595#issuecomment-423263977,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACWZzrZTKmWf08qUMB3HlFF5zFJ_WG5Oks5uc85qgaJpZM4DT0Ae
.

@rewt yep, just look at the example of StatefulSet

@rewt yep, just look at the example of StatefulSet

what does "/docker-entrypoint.sh" contain? I placed a shell command, but ulimit -l does no change.

@rewt yep, just look at the example of StatefulSet

what does "/docker-entrypoint.sh" contain? I placed a shell command, but ulimit -l does no change.

It's just original entrypoint script for elassandra image

@rewt yep, just look at the example of StatefulSet

what does "/docker-entrypoint.sh" contain? I placed a shell command, but ulimit -l does no change.

It's just original entrypoint script for elassandra image

Can you tell me if your value for ulimit -l changed?

@arturalbov we do something similar with elasticsearch/cassandra, just create a custom image with an alternate entrypoint script, e.g:

#!/bin/bash

# Set memlock limit to unlimited
ulimit -l unlimited

# Call original entrypoint script
exec /usr/local/bin/docker-entrypoint.sh "${@}"

...sorry for being obtuse, but can someone please explain why popular images like casandra/elasticsearch wouldn't do this already as part of their entrypoint script?

Also, any reason this couldn't just be run as part of the Dockerfile like done here? https://github.com/imyoungyang/docker-swarm-elasticsearch/blob/master/Dockerfile#L12

@jamesdh because that line in the Dockerfile has no effect on a container that is running based on that image. The ulimit is a property of a running process, you cannot set it before the process is actually running - and its effect will only apply to that process, and its children. In the case of your example in the Dockerfile - the command will only apply to that single line#12 and will have zero effect on any other line that comes later in that file, or to the eventual container created from the image that is based on that Dockerfile.

@jamesdh also, Docker, as the parent process - can change the limits of its children (running containers), and in many cases when a child needs to change its limits there are no sufficient permissions for this. Which is why it is important for K8s to allow admins pass these parameters to the docker run command. Simply because in most cases it cannot be done from a running container.

It is critically important for me to set the ulimit -n for all the containers on specific nodes in my cluster. Is there any documentation on how to do this?

long-term-issue (note to self)

I've documented my experiences here -
https://github.com/helm/charts/issues/8348 and have since moved from Pires
to Helm chart as there were some issues with plugins installing on master
nodes.

On Thu, Oct 25, 2018 at 8:58 AM Joshua Oster-Morris <
[email protected]> wrote:

It is critically important for me to set the ulimit -n for all the
containers on specific nodes in my cluster. Is there any documentation on
how to do this?

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3595#issuecomment-433039816,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACWZzo7_NsnhDwVRPnofQI7SHjPrgwAVks5uobV6gaJpZM4DT0Ae
.

@rewt I can't get your solution (https://github.com/rewt/elasticsearch-mlock) to work, because ulimit can't do this as a non-root user, and elastic won't run as root. K8s doesn't seem to respect the /etc/security/limits.conf file of the host (or the image - which makes sense) so I can't find a way to get this right for elastic. We have swap on, which is why it's a problem at all.

You need to:

  1. Create new dockerfile
  2. Add code to container definition to allow execution of ulimit

https://github.com/rewt/elasticsearch-mlock

On Thu, Oct 25, 2018 at 12:16 PM tobymiller1 notifications@github.com
wrote:

@rewt https://github.com/rewt I can't get your solution to work,
because ulimit can't do this as a non-root user, and elastic won't run as
root. K8s doesn't seem to respect the /etc/security/limits.conf file of the
host (or the image - which makes sense) so I can't find a way to get this
right for elastic. We have swap on, which is why it's a problem at all.

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3595#issuecomment-433113534,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACWZzj1f_fv4HQ7Ov3jJC5ibE1RAxs0gks5uoePOgaJpZM4DT0Ae
.

@rewt It's your point 2 that I'm struggling with. I have exactly the same dockerfile as you (elastic 5, but I doubt that's the problem), and I have the following in my pod definition:

securityContext:
  privileged: true
    capabilities:
      add:
      - IPC_LOCK
      - SYS_RESOURCE

Am I missing something?

I found the placement of securityContext code was very specific and
described placement in link above. I placed as follows:

    - name: ES_JAVA_OPTS
      value: "-Djava.net.preferIPv4Stack=true -Xms{{

.Values.data.heapSize }} -Xmx{{ .Values.data.heapSize }}"
{{- range $key, $value := .Values.cluster.env }}
- name: {{ $key }}
value: {{ $value | quote }}
{{- end }}

begin new code

    securityContext:
      capabilities:
        add:
          - IPC_LOCK
          - SYS_RESOURCE

end new code

    image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"

On Thu, Oct 25, 2018 at 12:26 PM tobymiller1 notifications@github.com
wrote:

It's your point 2 that I'm struggling with. I have exactly the same
dockerfile as you (elastic 5, but I doubt that's the problem), and I have
the following in my pod definition:

    securityContext:
      privileged: true
      capabilities:
        add:
        - IPC_LOCK
        - SYS_RESOURCE

Am I missing something?

โ€”
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3595#issuecomment-433117265,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACWZzkx0BCrpQ0UyQP94iaBnfwgncLmAks5uoeZFgaJpZM4DT0Ae
.

@rewt Thanks for your help. That is the configuration that I have, but it doesn't seem to allow ulimit to act as non-root in this way, and I don't understand why it should. I don't want to take over this thread though - I think we'll look for an alternative solution.

can't you pass in -e JAVA_OPTS="-Xmx4096m" with the sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run -e JAVA_OPTS="-Xmx4096m"
rancher/rancher-agent:v2.1.1........I am pretty sure this works around this issue as I have seen this as a solution with ES operators for Kubernetes.

@tobymiller1 in the solution by @rewt for it to actually work you need to add the USER root stanza to the Dockerfile, and then drop to the nobody (or whatever) user after you execute the ulimit -l unlimited command in the alternate wrapper.

@tobymiller1 If you want to be able to run the image as a non root user in kubernetes, you can use setcap directly on the executable. RUN setcap cap_ipc_lock=+ep ./your_exe && setcap cap_sys_resource=+ep ./your_exe

You will need the libcap package installed in alpine to get that command. Add that line before you switch down to your limited user.

In LInux, capabilities are lost when the UID changes from 0 (root) to non zero (your limited user)

Do other stuff RUN setcap cap_ipc_lock=+ep ./your_exe && setcap cap_sys_resource=+ep ./your_exe USER your_limited_user
The executable can now change user limits and lock memory.

Another item to note, aufs does NOT support Capabilities as it doesn't support extended attributes. You must use another FS like overlay2. Make sure your Kubernetes provider is not using aufs.

@jasongerard Just tried your solution but not work, anything I missed?

I am using Alpine as the base image and want to run "ulimit -l unlimited" in entrypoint.sh by non-root user esuser. My Kubernetes cluster running upon Ubuntu 16.04 and using overlay2 FS.

RUN apk add --no-cache libcap && setcap cap_ipc_lock=+ep /entrypoint.sh && setcap cap_sys_resource=+ep /entrypoint.sh
USER esuser

Got error message like this :
/entrypoint.sh: line 21: ulimit: max locked memory: cannot modify limit: Operation not permitted

@wahaha2001 you are calling setcap on a shell script. The executable that gets invoked requires the capability. In the case of a script, whatever the executable is in your โ€˜#!โ€™ will need the capability.

@jasongerard, thanks for the quick response. I am using #!/bin/bash in the entrypoint.sh, so I changed to :

RUN apk add --no-cache libcap && setcap cap_ipc_lock=+ep /bin/bash && setcap cap_sys_resource=+ep /bin/bash
USER esuser

But still get same error :(

@wahaha2001 You will still need to add the IPC_LOCK capability to your pod.

apiVersion: v1
kind: Pod
metadata:
  name: somename
spec:
  containers:
  - name: somename
    image: somename:latest
    securityContext:
      runAsNonRoot: true
      runAsUser: 1000
      capabilities:
        add: ["IPC_LOCK"]
  restartPolicy: Never

Putting this here so I can find it again later: https://do-db2.lkml.org/lkml/2011/6/19/170

@arturalbov we do something similar with elasticsearch/cassandra, just create a custom image with an alternate entrypoint script, e.g:

#!/bin/bash

# Set memlock limit to unlimited
ulimit -l unlimited

# Call original entrypoint script
exec /usr/local/bin/docker-entrypoint.sh "${@}"

then supply the same capabilities you mentioned (privileged isn't required):

 securityContext:
     capabilities:
         add:
             - IPC_LOCK
             - SYS_RESOURCE

This only works if we use root to run the application. Any solution for non-root users?

@manojtr see my comments above about running for non-root. You need to use setcap.

@jasongerard I tried the bellow and I am getting standard_init_linux.go:207: exec user process caused "operation not permitted" error when I simply build the docker using

docker build -t elasticsearch-2.4.6:dev --rm -f image/Dockerfile.v2 image
RUN setcap cap_ipc_lock=+ep /bin/bash && setcap cap_sys_resource=+ep /bin/bash

USER elasticsearch
CMD ["/bin/bash", "bin/es-docker"]

For the Elasticsearch case, changing the rlimit/locking memory is only a concern if you can't disable swapping, correct? AFAIK swapping is disabled on Kubernetes nodes, so this shouldn't be an issue, right?

btw I have tried the following for Cassandra:

      initContainers:
      - name: increase-memlock-ulimit
        image: busybox
        command: ["sh", "-c", "ulimit -l unlimited"]
        securityContext:
          privileged: true

with

      securityContext:
        capabilities:
          add:
            - IPC_LOCK
            - SYS_RESOURCE

but it is not working, I still see errors about RLIMIT_MEMLOCK and the value inside of the container is not changed to unlimited...

@guitmz I do not believe an initContainer will work. In that case, you are setting the ulimit for _that_ container, not the pod as a whole.

Below are links to an example Dockerfile and pod.yaml for running a process succesfully as non-root with the ability to change ulimit. Please note that the process in this example changes the ulimit with a syscall.

Dockerfile: https://github.com/jasongerard/mlockex/blob/master/Dockerfile
pod.yaml: https://github.com/jasongerard/mlockex/blob/master/pod.yaml

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

Can we remove the stale bot for this issue?

/remove-lifecycle stale

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

Context: sig-architecture subgroup work to search for technical debt
Category: Enterprise Readiness
Reason: community interest, people trying workarounds, useful for enterprise workloads, improves enterprise adoption

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

/remove-lifecycle stale

+1

/cc

+1

+1

+1

I would also like to vote for this issue. It has been opened for 5 years and I think that it really makes sense to have the possibility to configure file descriptor limit per container similar to the way we could configure requests and limits for CPU, memory and huge pages size. For proxy-like software file descriptors are key and limiting factor and it is extremely important to have the possibility to require and control this resource for the whole lifecycle of the container, similar to CPU.

I'd like to vote for this issue as well, it would be really nice to have this feature available out of the box in Kubernetes instead of using workarounds and/or hacks.

Ran into this today in an EKS Fargate context where I can't easily override Docker ulimit settings. Would definitely appreciate Kubernetes config support for this.

+1

+1

Was this page helpful?
0 / 5 - 0 ratings