/cc @vishh @rjnagal @vmarmol

bgrant0607 on 23 Jan 2015

We can set a sane default for now. Do we want this to be exposed as a knob
in the spec, or do we prefer a low/high toggle? The only advantage of
toggle is that we can possibly avoid too many jobs with high value landing
on the same machine.

On Fri, Jan 23, 2015 at 12:19 AM, Brian Grant [email protected]
wrote:

/cc @vishh https://github.com/vishh @rjnagal
https://github.com/rjnagal @vmarmol https://github.com/vmarmol

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71160651
.

rjnagal on 23 Jan 2015

Are there any downsides to setting high limits by default these days? I
can't keep straight what bugs we have fixed internally that might not have
been accepted upstream, especially regarding things like memcg accounting
of kernel structs.

On Fri, Jan 23, 2015 at 12:29 AM, Rohit Jnagal [email protected]
wrote:

We can set a sane default for now. Do we want this to be exposed as a knob
in the spec, or do we prefer a low/high toggle? The only advantage of
toggle is that we can possibly avoid too many jobs with high value landing
on the same machine.

On Fri, Jan 23, 2015 at 12:19 AM, Brian Grant [email protected]
wrote:

/cc @vishh https://github.com/vishh @rjnagal
https://github.com/rjnagal @vmarmol https://github.com/vmarmol

Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71160651>

.

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71161877
.

thockin on 23 Jan 2015

+1 to toggle, putting it in the spec is overkill IMO.

vmarmol on 23 Jan 2015

If there is a toggle for "few" vs "many" everyone will choose "many". We
need to understand and document why "few" is the better choice most of the
time, and think about how to restrict who uses "many".

On Fri, Jan 23, 2015 at 9:39 AM, Victor Marmol [email protected]
wrote:

+1 to toggle, putting it in the spec is overkill IMO.

Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71231707
.

thockin on 23 Jan 2015

Kernel memory accounting seems to be disabled in our container vm image.
The overall fd limit might also be a factor to consider. Given these
constraints, providing a toggle option makes sense.

On Fri, Jan 23, 2015 at 9:56 AM, Tim Hockin [email protected]
wrote:

If there is a toggle for "few" vs "many" everyone will choose "many". We
need to understand and document why "few" is the better choice most of the
time, and think about how to restrict who uses "many".

On Fri, Jan 23, 2015 at 9:39 AM, Victor Marmol [email protected]
wrote:

+1 to toggle, putting it in the spec is overkill IMO.

Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71231707

.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71234393
.

vishh on 23 Jan 2015

One way to restrict "many" would be to take into account the global machine
limits and use it in scheduling.
I don't think we have or are planning to add user-based capabilities.

On Fri, Jan 23, 2015 at 10:25 AM, Vish Kannan [email protected]
wrote:

Kernel memory accounting seems to be disabled in our container vm image.
The overall fd limit might also be a factor to consider. Given these
constraints, providing a toggle option makes sense.

On Fri, Jan 23, 2015 at 9:56 AM, Tim Hockin [email protected]
wrote:

If there is a toggle for "few" vs "many" everyone will choose "many". We
need to understand and document why "few" is the better choice most of
the
time, and think about how to restrict who uses "many".

On Fri, Jan 23, 2015 at 9:39 AM, Victor Marmol [email protected]

wrote:

+1 to toggle, putting it in the spec is overkill IMO.

Reply to this email directly or view it on GitHub
<

https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71231707

.

—
Reply to this email directly or view it on GitHub
<
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71234393>

.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-71239201
.

rjnagal on 23 Jan 2015

For our project we would use both "few" and "many". The lower limit would be for our worker containers (sateless) and the high limit would be for our storage containers (stateful)

timcash on 4 Mar 2015

+1 to toggle, but what exactly does few and many mean?
Also what are the implications to scheduling?

timothysc on 4 Mar 2015

I don't think few and many are useful categorizations. I also disagree with the stateless vs. storage distinction. Many frontends need lots of fds for sockets.

bgrant0607 on 6 Mar 2015

I would assume that we would at best only do minimal checks in scheduler as these resources would be highly overcommitted. We can have an admission check on the node side to reject pod requests or inform scheduler when its running low - more of an out-of-resource model.

For large and few values, we can start with typical linux max for the resource as large, and typical default as few.

@bgrant0607 what kind of model did you have in mind for representing these as resources?

rjnagal on 6 Mar 2015

I don't know that we need to track these values in the scheduler. They are more for DoS prevention than allocating a finite resource.

I'm skeptical that "large" and "few" are adequate, because the lack of numerical values would make it difficult for users to predict what category they should request, and the choice might not even be portable and/or stable over time. Do you think users wouldn't know how many file descriptors to request, for example? That seems like it can be computed with simple arithmetic based on the number of clients one wants to support, for instance.

What's the downside of just exposing numerical parameters?

I agree we should choose reasonable, modest defaults.

bgrant0607 on 7 Mar 2015

👍2

small and large feels clunky, but I like the ability for site-admins to
define a few grades of service and then to let users choose. I think it
works pretty well internally - at least most users survive with the default
of "small"

On Fri, Mar 6, 2015 at 9:13 PM, Brian Grant [email protected]
wrote:

I don't know that we need to track these values in the scheduler. They are
more for DoS prevention than allocating a finite resource.

I'm skeptical that "large" and "few" are adequate, because the lack of
numerical values would make it difficult for users to predict what category
they should request, and the choice might not even be portable and/or
stable over time. Do you think users wouldn't know how many file
descriptors to request, for example? That seems like it can be computed
with simple arithmetic based on the number of clients one wants to support,
for instance.

What's the downside of just exposing numerical parameters?

I agree we should choose reasonable, modest defaults.

—
Reply to this email directly or view it on GitHub
https://github.com/GoogleCloudPlatform/kubernetes/issues/3595#issuecomment-77674200
.

thockin on 7 Mar 2015

Re. admin-defined policies, see https://github.com/docker/docker/issues/11187

bgrant0607 on 17 Apr 2015

@bgrant0607: Is this something that we can consider for v1.1?

vishh on 10 Aug 2015

We can, but I'll have 0 bandwidth to think about it for the next month, probably.

cc @erictune

bgrant0607 on 10 Aug 2015

Docker rlimit feature is process-based, not cgroup based (of course, upstream kernel doesn't have rlimit cgroup yet). This means

The limit is applied to container's root process, and all children processes inherit it
There is no control on how many children processes
Processes for docker exec are not inheriting the same limit

Based on above, I don't think there is a very useful feature, or at least, not a easy-to-use feature to specify and manage.

dchen1107 on 23 Sep 2015

👍2

Where does this stand? is it available through any config?

shaylevi2 on 11 Nov 2015

@shaylevi2: Look at the previous comment. Docker's current implementation isn't what we need.

vishh on 11 Nov 2015

I keep getting asked questions in this space, so maybe a pointer to documentation somewhere explaining why its not supported yet would be useful. Typing that I admit that it is odd to write something on why we don't do something, but @dchen1107 earlier comment is helpful.

derekwaynecarr on 3 Dec 2015

@thockin @bgrant0607 looks like this is not going to be addressed? This is definitely an issue with both elasticsearch and cassandra running on K8s. What is the recommended best practice to get RLIMIT_MEMLOCK happy?

chrislovecnm on 8 Jun 2016

👍1

cc @kubernetes/sig-node-feature-requests

bgrant0607 on 10 Jun 2016

Do those applications also require huge pages?

On Friday, June 10, 2016, Brian Grant [email protected] wrote:

cc @kubernetes/sig-node
https://github.com/orgs/kubernetes/teams/sig-node

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3595#issuecomment-225110348,
or mute the thread
https://github.com/notifications/unsubscribe/AF8dbBznwyW6cJLD6lE_BJ-g02POtQ06ks5qKQ9CgaJpZM4DT0Ae
.

derekwaynecarr on 12 Jun 2016

Yep Cassandra and Elastic - jemalloc for Cassandra

chrislovecnm on 12 Jun 2016

Redis as well uses jemalloc

chrislovecnm on 12 Jun 2016

So it's fair to say you need both features supported on the node?

On Saturday, June 11, 2016, Chris Love [email protected] wrote:

Redis as well uses jemalloc

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3595#issuecomment-225398637,
or mute the thread
https://github.com/notifications/unsubscribe/AF8dbD-544FTawGiqk5W_oa2rEbXMAxIks5qKzqwgaJpZM4DT0Ae
.

derekwaynecarr on 12 Jun 2016

:) Big want :)

Jvm huge pages with jemalloc and redis is in C with jemalloc

Does privelleged mode all for that now? I think we are actually at kernel / docker level, not k8s

chrislovecnm on 12 Jun 2016

Allow for that .. Github let me edit on mobile ...

chrislovecnm on 12 Jun 2016

@rjnagal Is there a more recent kernel limit cgroup RFC?

http://thread.gmane.org/gmane.linux.kernel.cgroups/12687

bgrant0607 on 16 Jun 2016

what is the status of this?
To run a high traffic nginx site I need to increase the number of open files.
By default the value is

ulimit -n
4096

The nginx configuration to increase the number of open files is

worker_rlimit_nofile 10240;

However on startup, I'm seeing this in error.log

[alert] 78#78: setrlimit(RLIMIT_NOFILE, 10240) failed (1: Operation not permitted)

When running starting docker, I would set ulimit to resolve this problem, e.g.

--ulimit nofile=262144:262144

How can I do this in kubernetes?

mingfang on 8 Jul 2016

@mingfang you tried privelleged mode?

chrislovecnm on 9 Jul 2016

👍1

@chrislovecnm hugepages design doc being worked by @sjenning perhaps you can elaborate on use-cases for ES and Cassandra...those are very relevant for RH as well since we use them in products.

He and I were discussing ulimits yesterday wrt memlock and also we might need to namespace vm.hugetlb_shm_group in the kernel. Very similar use-cases. Makes sense to collapse all those goals into one design doc and we phase the implementation accordingly.

jeremyeder on 23 Sep 2016

I'm having trouble understanding why the comment by @dchen1107 is a showstopper... perfect enemy of good? ulimits cover a broad range of needs, from running a basic webserver to high performance computing. Privileged mode is not an option in many environments, and I think a lot of people would prefer to use kubernetes in those environments, even if it comes with certain caveats.

tmandry on 23 Nov 2016

👍1

@tmandry what features do you need?

vishh on 29 Nov 2016

@tmandry - the problem is that per-process limits are not very useful for
isolation purposes. There are some uses, I admit, but every facet of API
has cost to develop and test and maintain and document and comprehend, so
we need to balance that. Which rlimits are you interested in?

On Tue, Nov 29, 2016 at 1:44 AM, Vish Kannan notifications@github.com
wrote:

@tmandry https://github.com/tmandry what features do you need?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3595#issuecomment-263523310,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVGU4TPedAysc0zbdccNHzBWeb9Cpks5rC_P6gaJpZM4DT0Ae
.

thockin on 29 Nov 2016

👎1

From my opinion, regardless of whether it is useful or not to setup a per-process limit, there is a fact: Many people need to increase the file descriptors or other rlimits in order for their processes to work in some scenarios. Examples would be servers or proxies with many concurrent connections (this is the situation for me and other people that have already commented here).

I am in favor of letting the user choose custom values. I agree @bgrant0607 comment about a predefined set of levels being not portable and I also think there is nothing against letting a user to customize the numeric values (if a user wants to increment a file descriptor limit, he wants it because he knows what it means). I also see it easier to implement in Kubernetes rather than having to think on how we can define those levels, configure them and so on.

Furthermore, from my perspective, a user is likely to expect to be able to tune the rlimits of his process in a container as he does in a VM. This is degree of control would be the one achieved with custom values.

palonsoro on 11 Dec 2016

👍9

Another thing that may be interesting to consider: I think it would make sense to be able to restrict the rlimits a user can set via Pod Security Policies.

I said previously "a user is likely to expect to be able to tune the rlimits of his process in a container as he does in a VM". But, on the other hand, as a cluster admin, I would like to give the user that freedom but with being able to set reasonable limits.

palonsoro on 11 Dec 2016

I need this feature to be a workaround for the issue
https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1350

indocker run there is --ulimit
Can I do this in kubernetes now?

BruceZu on 12 Dec 2016

Agreed on the fd limit, that's something I've run into a few times. In my case, the ability to pin cores by raising the rtprio ulimit is also important.

tmandry on 23 Dec 2016

Fair enough. Raising a limit is different than isolating a resource.
rlimit is more appropriate for one than the other.

On Thu, Dec 22, 2016 at 5:24 PM, Tyler Mandry notifications@github.com
wrote:

Agreed on the fd limit, that's something I've run into a few times. In my
case, the ability to pin cores by raising the rtprio ulimit is also
important.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3595#issuecomment-268925649,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVLJp09kWIdwGjVAnBbgeJAYfH-4pks5rKyLQgaJpZM4DT0Ae
.

thockin on 23 Dec 2016

👍1

This issue currently makes it impossible to run Elasticsearch on Kubernetes without performance degradation. 😰

[WARN ][bootstrap] Unable to lock JVM Memory: error=12,reason=Out of memory
[WARN ][bootstrap] This can result in part of the JVM being swapped out.
[WARN ][bootstrap] Increase RLIMIT_MEMLOCK, soft limit: 65536, hard limit: 65536

itskingori on 13 Jan 2017

👍3

We've kinda gotten around this by editing the ENTRYPOINT script available in the official Elasticsearch Docker image.

Setting this at the top ...

#!/bin/bash

set -e

echo '1: Before ulimit'
ulimit -l
ulimit -l unlimited
ulimit -l
echo '2. After ulimit'

... gives us ...

2017-01-13T09:00:36.117522472Z 1: Before ulimit
2017-01-13T09:00:36.117551488Z 64
2017-01-13T09:00:36.117569173Z unlimited
2017-01-13T09:00:36.117574588Z 2. After ulimit

itskingori on 13 Jan 2017

👍1

Ohh man, I didn't realize this was still open.

If folks want to self host a larger cluster this is paramount to fix.

/cc @jbeda @luxas @vishh @philips

timothysc on 24 Feb 2017

👍10

If folks want to self host a larger cluster this is paramount to fix.

Well we did some tinkering and we just override with dockers config :-/
So... there is an escape hatch.

timothysc on 24 Feb 2017

@timothysc what exactly would i need to override?

webwurst on 16 Mar 2017

@itskingori official Elasticsearch Docker image has bin/es-docker running under user elasticsearch, it has no privilege to run ulimit command, how did you archive that?

yongzhang on 20 Mar 2017

@hiscal2015 By giving the container some extra capabilities ... like this ... 👇

---
apiVersion: apps/v1beta1
kind: StatefulSet
spec:
  template:
    spec:
        - name: elasticsearch
          securityContext:
            capabilities:
              add:
                - IPC_LOCK
                - SYS_RESOURCE

itskingori on 20 Mar 2017

👍2

@itskingori Still no luck, here's my Dockerfile:

FROM docker.elastic.co/elasticsearch/elasticsearch:5.2.0
ADD elasticsearch.yml /usr/share/elasticsearch/config/
USER root
RUN chown elasticsearch:elasticsearch config/elasticsearch.yml
RUN sed -i '2i\set -e\' /usr/share/elasticsearch/bin/es-docker
RUN sed -i '3i\ulimit -l unlimited\' /usr/share/elasticsearch/bin/es-docker
USER elasticsearch

And here's my yaml:

    spec:
      containers:
      - name: es-master
        securityContext:
          privileged: true
          capabilities:
            add:
              - IPC_LOCK
              - SYS_RESOURCE

And here's the error msg:

3/20/2017 7:44:14 PMbin/es-docker: line 3: ulimit: max locked memory: cannot modify limit: Operation not permitted

Am I missing something?

yongzhang on 20 Mar 2017

@hiscal2015 Could you please edit your comment with "triple ticks" to format the pasted files? Like this:

``` dockerfile
[ dockerfile goes here ]
```

``` yaml
[ yaml goes here ]
```

It will help everyone. Thanks!

jarpy on 21 Mar 2017

@hiscal2015 I'm not sure what you're trying to do in your Dockerfile ... seems you're trying to do stuff during the build phase. Can't follow your idea to an approach that could work.

I'm just gonna share what we did ... we overrode the image ENTRYPOINT (which was docker-entrypoint.sh) to run what we want first. Like this ...

FROM elasticsearch:2.4.3-alpine

COPY configs/elasticsearch.yml /usr/share/elasticsearch/config/elasticsearch.yml
COPY custom-entrypoint.sh /custom-entrypoint.sh

VOLUME /usr/share/elasticsearch/data

ENTRYPOINT ["/custom-entrypoint.sh"]
CMD ["elasticsearch"]

Where custom-entrypoint.sh is ...

#!/bin/bash

set -e

ulimit -l unlimited

exec /docker-entrypoint.sh "$@"

The idea is that custom-entrypoint.sh will run when running the container while means ulimit -l unlimited will run before the command to start Elasticsearch (which is what we want).

itskingori on 22 Mar 2017

🎉1 👍1

When we started trying to run ES 5.0 we hit this issue as well. We're
running a init-container to prep the node. I was unable to get ulimit
inside a running container to do the job so I wrote this guy:
https://github.com/samsung-cnct/set_max_map_count it requires /proc to be
mounted into the init container only which gives me the heebies but it
works.

Example using this is here:
https://github.com/samsung-cnct/elasticsearch-kubernetes/blob/master/es.yaml

note: that was to figure things out. we were running on k8s v1.4. The
chart that is based on that works for 1.5+ and is here:
https://github.com/samsung-cnct/k2-charts/tree/master/elasticsearch

On Wed, Mar 22, 2017 at 9:12 AM, itskingori notifications@github.com
wrote:

@hiscal2015 https://github.com/hiscal2015 I'm not sure what you're
trying to do in your Dockerfile ... seems you're trying to do stuff
during the build phase. Can't follow your idea to an approach that could
work.

I'm just gonna share what we did ... we overrode the image ENTRYPOINT
(which was docker-entrypoint.sh) to run what we want first. Like this ...

FROM elasticsearch:2.4.3-alpine
COPY configs/elasticsearch.yml /usr/share/elasticsearch/config/elasticsearch.ymlCOPY custom-entrypoint.sh /custom-entrypoint.sh
VOLUME /usr/share/elasticsearch/data
ENTRYPOINT ["/custom-entrypoint.sh"]CMD ["elasticsearch"]

Where custom-entrypoint.sh is ...

!/bin/bash

set -e
ulimit -l unlimited
exec /docker-entrypoint.sh "$@"

The idea is that custom-entrypoint.sh will run when running the container
while means ulimit -l unlimited will run before the command to start
Elasticsearch (which is what we want).

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3595#issuecomment-288451522,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AApXmgWmIp6DCoyU4UQ6mJxBP9qhHrK5ks5roUiIgaJpZM4DT0Ae
.

coffeepac on 22 Mar 2017

pid cgroup https://www.kernel.org/doc/Documentation/cgroup-v1/pids.txt allows limiting the number of processes per cgroup, and very critical to avoid fork bombs. docker seems to have added support since 1.11 release: https://github.com/docker/docker/pull/18697
Can we expose pids-limit in k8s pod container spec.

ravilr on 27 Mar 2017

Instead of everyone having their own hacks and custom entrypoints just to set rlimit I also like to +1 we allow users to set it through k8s API. Is this being worked on at all?

tnachen on 28 Mar 2017

👍36

Is this being worked on at all?

No, but I think I'd like to enqueue this for @kubernetes/sig-cluster-lifecycle-feature-requests b/c it will directly affect them.

timothysc on 30 Mar 2017

👍8

cc @kow3ns @jagosan

bgrant0607 on 21 Jun 2017

@itskingori - I spent a good bit of time trying to get the memlock working with elasticsearch in kubernetes. Thanks to you I was able it get it going.

Would really prefer if kubernetes natively supported ulimits.

gdmello on 9 Nov 2017

👍5

If you are interested on set up a elasticsearch cluster on k8s which request ulimit -l unlimited I share it with all of you.

https://github.com/cesargomezvela/elasticsearch

Take a look inside docker/run.sh file.

# allow for memlock if enabled
if [ "$MEMORY_LOCK" == "true" ]; then
    ulimit -l unlimited
fi

You may also be interested in the following lines that are inside the file kubernetes/elasticsearch-master.yaml

spec:
  replicas: 3
  template:
    metadata:
      annotations:
        pod.alpha.kubernetes.io/init-containers: '[
          {
            "name":"init-sysctl",
            "image":"busybox:1.27.2",
            "imagePullPolicy":"IfNotPresent",
            "command":[
              "sysctl",
              "-w",
              "vm.max_map_count=262144"
            ],
            "securityContext":{
                "privileged":true
            }
          }
        ]'
...
        securityContext:
          privileged: true

cesargomezvela on 29 Dec 2017

😄2

@timothysc @tnachen @bgrant0607 @cesargomezvela @gdmello So with securityContext/privileged/true and/or capabilities/add/SYS_RESOURCE code running in the container is able to run ulimit now right? Is there still a TODO item left in this issue?

for example, Do we want to let pod authors explicitly specify the limits? (like the cpu/memory requests/limits?) without needing to bump their capability/privilege?

Notes:

dims on 9 Jan 2018

Personally, I believe it is a good idea to let pod authors setup explicit limits without bumping capabilities or privileges for these reasons:

SYS_RESOURCE capability may allow applications to perform operations you might not want to allow. It is even more true for privileged containers.
While some applications do require to setup limits by themselves (as they or their startup scripts are designed to do so), in other cases you may want to setup them at pod creation time and do not allow the application to change them at runtime (for security reasons, for resource control reasons on public environments...).

palonsoro on 10 Jan 2018

👍3

Another problem with ulimit in entrypoint is that you can't change the hard limit as non-root and you can't do it via sudo either. Normally, in Linux you do it via /etc/security/limits.conf but docker ignores it, so you'll have to use root as a default user.

If you're looking for elasticsearch image with ulimit in entrypoint and elasticsearch running from elasticsearch user feel free to use/fork https://github.com/wodby/elasticsearch

csandanov on 18 Jan 2018

@kubernetes/sig-node-feature-requests So would we be ok with doing something just for docker here? I can't tell if other CRI implementations can easily support this feature. (cc @timothysc @tnachen @bgrant0607)

dims on 18 Jan 2018

There are plenty more arguments why this is required at https://github.com/moby/moby/issues/4717 and merged in https://github.com/moby/moby/pull/9437; All that K8s should do is allow to pass these arguments to the container it runs. Why is there even discussion here concerning about cgroups support? That has been resolved on the container level. All you has to do is allow to pass the extra argument, and let all the people who are creating ugly wrapper-script+priveleged workarounds to finally do the proper thing and just pass the argument like they wanted to do and was available to them for the last 3 years!

kesor on 26 Jan 2018

👍7

@kesor totally fair. i am digging up old issues like this and trying to see what we can do. yes, its easy to pass arguments to docker for sure, but need to figure out if something can be done for other CRI implemenations

dims on 26 Jan 2018

👍3

@kesor @csandanov @palonsoro @cesargomezvela @gdmello - Do you have some time to kick the tires of #58904 and provide some feedback?

dims on 27 Jan 2018

File max limits are tricky and IMO (theoretically) the limits are set at multiple levels (container, docker daemon on the node, node itself and all the way to the storage tier/filesystem).

@dims Should there be a check at PV level in addition? Example - What if at the container level, we set the limit to 1M and a user decides to go with EFS as PV where file-max limits can't be configured and the limit is low?

ankushchadha on 4 Feb 2018

@dims after I saw your diff, what's the ETA for this docker run --ulimit in kubernetes? Is it in the kubernetes roadmap?

irvifa on 3 Apr 2018

👍16 🎉2

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 2 Jul 2018

/remove-lifecycle stale

george-angel on 2 Jul 2018

Are there any plans for this issue?
We're running elassandra docker image in k8s and need possibility to change ulimits. Maybe someone have some hacks how to set them for elassandra?
I've tried to add

args:
    - "--ulimit memlock=10000000:10000000"

with

 securityContext:
     privileged: true
     capabilities:
         add:
             - IPC_LOCK
             - SYS_RESOURCE

to stateful set specification but it didn't work.

arturalbov on 8 Jul 2018

@arturalbov we do something similar with elasticsearch/cassandra, just create a custom image with an alternate entrypoint script, e.g:

#!/bin/bash

# Set memlock limit to unlimited
ulimit -l unlimited

# Call original entrypoint script
exec /usr/local/bin/docker-entrypoint.sh "${@}"

then supply the same capabilities you mentioned (privileged isn't required):

 securityContext:
     capabilities:
         add:
             - IPC_LOCK
             - SYS_RESOURCE

eyalzek on 9 Jul 2018

👍7

@eyalzek Thanks! It works

arturalbov on 9 Jul 2018

@arturalbov we do something similar with elasticsearch/cassandra, just create a custom image with an alternate entrypoint script, e.g:
#!/bin/bash

# Set memlock limit to unlimited
ulimit -l unlimited

# Call original entrypoint script
exec /usr/local/bin/docker-entrypoint.sh "${@}"
then supply the same capabilities you mentioned (privileged isn't required):
 securityContext:
     capabilities:
         add:
             - IPC_LOCK
             - SYS_RESOURCE

Can you please help me understand how to get this working with k8s?

rewt on 20 Sep 2018

@rewt Hi, sure
I created custom-entrypoint.sh file (it should call original entrypoint script, you could find out how to call it in initial Dockerfile of necessary image). For me (and mostly) it looks like this:

#!/bin/bash

# Set memlock limit
ulimit -l 33554432

# Call original entrypoint script
exec /docker-entrypoint.sh "${@}"

Then I created Dockerfile to build custom elassandra image. It looks like that:

FROM strapdata/elassandra:5.5.0.20

COPY custom-entrypoint.sh /custom-entrypoint.sh

ENTRYPOINT ["/custom-entrypoint.sh"]
CMD ["bin/cassandra"]

Place those files into same folder and just build custom image with docker build command. Then you could deploy it to your personal docker hub repo in order to use it in k8s on cloud. So, after you have your custom image you could use it in pod spec definition with addition of security context capabilities as mentioned in your quote.

Here is the example of StatefulSet that I've used.
And here is the example of building custom image.

arturalbov on 20 Sep 2018

These required as well?

securityContext:
capabilities:
add:
- IPC_LOCK
- SYS_RESOURCE

On Thu, Sep 20, 2018 at 1:20 PM Artur Albov notifications@github.com
wrote:

@rewt https://github.com/rewt Hi, sure
I created custom-entrypoint.sh file (it should call original entrypoint
script, you could find out how to call it in initial Dockerfile of
necessary image). For me (and mostly) it looks like this:

!/bin/bash

Set memlock limit

ulimit -l 33554432

Call original entrypoint script

exec /docker-entrypoint.sh "${@}"

Then I created Dockerfile to build custom elassandra image. It looks like
that:

FROM strapdata/elassandra:5.5.0.20

COPY custom-entrypoint.sh /custom-entrypoint.sh

ENTRYPOINT ["/custom-entrypoint.sh"]
CMD ["bin/cassandra"]

Place those files into same folder and just build custom image with docker
build command. Then you could deploy it to your personal docker hub repo
in order to use it in k8s on cloud. So, after you have your custom image
you could use it in pod spec definition with addition of security context
capabilities as mentioned in your quote.

Here is the example
https://github.com/cybercongress/cybernode/blob/master/kubernetes-definitions/search/elassandra.yaml
of StatefulSet that I've used.
And here is the example of building custom image
https://github.com/cybercongress/cybernode/tree/master/docker/elassandra
.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3595#issuecomment-423263977,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACWZzrZTKmWf08qUMB3HlFF5zFJ_WG5Oks5uc85qgaJpZM4DT0Ae
.

rewt on 20 Sep 2018

@rewt yep, just look at the example of StatefulSet

arturalbov on 20 Sep 2018

@rewt yep, just look at the example of StatefulSet

what does "/docker-entrypoint.sh" contain? I placed a shell command, but ulimit -l does no change.

abeltre1 on 28 Sep 2018

@rewt yep, just look at the example of StatefulSet

what does "/docker-entrypoint.sh" contain? I placed a shell command, but ulimit -l does no change.

It's just original entrypoint script for elassandra image

arturalbov on 28 Sep 2018

@rewt yep, just look at the example of StatefulSet

what does "/docker-entrypoint.sh" contain? I placed a shell command, but ulimit -l does no change.

It's just original entrypoint script for elassandra image

Can you tell me if your value for ulimit -l changed?

abeltre1 on 28 Sep 2018

@arturalbov we do something similar with elasticsearch/cassandra, just create a custom image with an alternate entrypoint script, e.g:
#!/bin/bash

# Set memlock limit to unlimited
ulimit -l unlimited

# Call original entrypoint script
exec /usr/local/bin/docker-entrypoint.sh "${@}"
...sorry for being obtuse, but can someone please explain why popular images like casandra/elasticsearch wouldn't do this already as part of their entrypoint script?

jamesdh on 9 Oct 2018

Also, any reason this couldn't just be run as part of the Dockerfile like done here? https://github.com/imyoungyang/docker-swarm-elasticsearch/blob/master/Dockerfile#L12

jamesdh on 9 Oct 2018

@jamesdh because that line in the Dockerfile has no effect on a container that is running based on that image. The ulimit is a property of a running process, you cannot set it before the process is actually running - and its effect will only apply to that process, and its children. In the case of your example in the Dockerfile - the command will only apply to that single line#12 and will have zero effect on any other line that comes later in that file, or to the eventual container created from the image that is based on that Dockerfile.

kesor on 11 Oct 2018

❤6

@jamesdh also, Docker, as the parent process - can change the limits of its children (running containers), and in many cases when a child needs to change its limits there are no sufficient permissions for this. Which is why it is important for K8s to allow admins pass these parameters to the docker run command. Simply because in most cases it cannot be done from a running container.

kesor on 11 Oct 2018

👍4

It is critically important for me to set the ulimit -n for all the containers on specific nodes in my cluster. Is there any documentation on how to do this?

craftyc0der on 25 Oct 2018

long-term-issue (note to self)

dims on 25 Oct 2018

👍2

I've documented my experiences here -
https://github.com/helm/charts/issues/8348 and have since moved from Pires
to Helm chart as there were some issues with plugins installing on master
nodes.

On Thu, Oct 25, 2018 at 8:58 AM Joshua Oster-Morris <
[email protected]> wrote:

It is critically important for me to set the ulimit -n for all the
containers on specific nodes in my cluster. Is there any documentation on
how to do this?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3595#issuecomment-433039816,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACWZzo7_NsnhDwVRPnofQI7SHjPrgwAVks5uobV6gaJpZM4DT0Ae
.

rewt on 25 Oct 2018

@rewt I can't get your solution (https://github.com/rewt/elasticsearch-mlock) to work, because ulimit can't do this as a non-root user, and elastic won't run as root. K8s doesn't seem to respect the /etc/security/limits.conf file of the host (or the image - which makes sense) so I can't find a way to get this right for elastic. We have swap on, which is why it's a problem at all.

tobymiller1 on 25 Oct 2018

You need to:

Create new dockerfile
Add code to container definition to allow execution of ulimit

https://github.com/rewt/elasticsearch-mlock

On Thu, Oct 25, 2018 at 12:16 PM tobymiller1 notifications@github.com
wrote:

@rewt https://github.com/rewt I can't get your solution to work,
because ulimit can't do this as a non-root user, and elastic won't run as
root. K8s doesn't seem to respect the /etc/security/limits.conf file of the
host (or the image - which makes sense) so I can't find a way to get this
right for elastic. We have swap on, which is why it's a problem at all.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3595#issuecomment-433113534,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACWZzj1f_fv4HQ7Ov3jJC5ibE1RAxs0gks5uoePOgaJpZM4DT0Ae
.

rewt on 25 Oct 2018

@rewt It's your point 2 that I'm struggling with. I have exactly the same dockerfile as you (elastic 5, but I doubt that's the problem), and I have the following in my pod definition:

securityContext:
  privileged: true
    capabilities:
      add:
      - IPC_LOCK
      - SYS_RESOURCE

Am I missing something?

tobymiller1 on 25 Oct 2018

I found the placement of securityContext code was very specific and
described placement in link above. I placed as follows:

    - name: ES_JAVA_OPTS
      value: "-Djava.net.preferIPv4Stack=true -Xms{{

.Values.data.heapSize }} -Xmx{{ .Values.data.heapSize }}"
{{- range $key, $value := .Values.cluster.env }}
- name: {{ $key }}
value: {{ $value | quote }}
{{- end }}

begin new code

    securityContext:
      capabilities:
        add:
          - IPC_LOCK
          - SYS_RESOURCE

end new code

    image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"

On Thu, Oct 25, 2018 at 12:26 PM tobymiller1 notifications@github.com
wrote:

It's your point 2 that I'm struggling with. I have exactly the same
dockerfile as you (elastic 5, but I doubt that's the problem), and I have
the following in my pod definition:
    securityContext:
      privileged: true
      capabilities:
        add:
        - IPC_LOCK
        - SYS_RESOURCE
Am I missing something?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/kubernetes/kubernetes/issues/3595#issuecomment-433117265,
or mute the thread
https://github.com/notifications/unsubscribe-auth/ACWZzkx0BCrpQ0UyQP94iaBnfwgncLmAks5uoeZFgaJpZM4DT0Ae
.

rewt on 25 Oct 2018

@rewt Thanks for your help. That is the configuration that I have, but it doesn't seem to allow ulimit to act as non-root in this way, and I don't understand why it should. I don't want to take over this thread though - I think we'll look for an alternative solution.

tobymiller1 on 25 Oct 2018

can't you pass in -e JAVA_OPTS="-Xmx4096m" with the sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run -e JAVA_OPTS="-Xmx4096m"
rancher/rancher-agent:v2.1.1........I am pretty sure this works around this issue as I have seen this as a solution with ES operators for Kubernetes.

Shifter2600 on 30 Oct 2018

@tobymiller1 in the solution by @rewt for it to actually work you need to add the USER root stanza to the Dockerfile, and then drop to the nobody (or whatever) user after you execute the ulimit -l unlimited command in the alternate wrapper.

kesor on 6 Nov 2018

😄1

@tobymiller1 If you want to be able to run the image as a non root user in kubernetes, you can use setcap directly on the executable. RUN setcap cap_ipc_lock=+ep ./your_exe && setcap cap_sys_resource=+ep ./your_exe

You will need the libcap package installed in alpine to get that command. Add that line before you switch down to your limited user.

In LInux, capabilities are lost when the UID changes from 0 (root) to non zero (your limited user)

Do other stuff RUN setcap cap_ipc_lock=+ep ./your_exe && setcap cap_sys_resource=+ep ./your_exe USER your_limited_user
The executable can now change user limits and lock memory.

Another item to note, aufs does NOT support Capabilities as it doesn't support extended attributes. You must use another FS like overlay2. Make sure your Kubernetes provider is not using aufs.

jasongerard on 14 Nov 2018

❤4

@jasongerard Just tried your solution but not work, anything I missed?

I am using Alpine as the base image and want to run "ulimit -l unlimited" in entrypoint.sh by non-root user esuser. My Kubernetes cluster running upon Ubuntu 16.04 and using overlay2 FS.

RUN apk add --no-cache libcap && setcap cap_ipc_lock=+ep /entrypoint.sh && setcap cap_sys_resource=+ep /entrypoint.sh
USER esuser

Got error message like this :
/entrypoint.sh: line 21: ulimit: max locked memory: cannot modify limit: Operation not permitted

wahaha2001 on 2 Jan 2019

👍4

@wahaha2001 you are calling setcap on a shell script. The executable that gets invoked requires the capability. In the case of a script, whatever the executable is in your ‘#!’ will need the capability.

jasongerard on 2 Jan 2019

👍1

@jasongerard, thanks for the quick response. I am using #!/bin/bash in the entrypoint.sh, so I changed to :

RUN apk add --no-cache libcap && setcap cap_ipc_lock=+ep /bin/bash && setcap cap_sys_resource=+ep /bin/bash
USER esuser

But still get same error :(

wahaha2001 on 2 Jan 2019

👍2

@wahaha2001 You will still need to add the IPC_LOCK capability to your pod.

apiVersion: v1
kind: Pod
metadata:
  name: somename
spec:
  containers:
  - name: somename
    image: somename:latest
    securityContext:
      runAsNonRoot: true
      runAsUser: 1000
      capabilities:
        add: ["IPC_LOCK"]
  restartPolicy: Never

jasongerard on 2 Jan 2019

Putting this here so I can find it again later: https://do-db2.lkml.org/lkml/2011/6/19/170

bgrant0607 on 27 Feb 2019

@arturalbov we do something similar with elasticsearch/cassandra, just create a custom image with an alternate entrypoint script, e.g:
#!/bin/bash

# Set memlock limit to unlimited
ulimit -l unlimited

# Call original entrypoint script
exec /usr/local/bin/docker-entrypoint.sh "${@}"
then supply the same capabilities you mentioned (privileged isn't required):
 securityContext:
     capabilities:
         add:
             - IPC_LOCK
             - SYS_RESOURCE

This only works if we use root to run the application. Any solution for non-root users?

manojtr on 6 Mar 2019

@manojtr see my comments above about running for non-root. You need to use setcap.

jasongerard on 6 Mar 2019

@jasongerard I tried the bellow and I am getting standard_init_linux.go:207: exec user process caused "operation not permitted" error when I simply build the docker using

docker build -t elasticsearch-2.4.6:dev --rm -f image/Dockerfile.v2 image

RUN setcap cap_ipc_lock=+ep /bin/bash && setcap cap_sys_resource=+ep /bin/bash

USER elasticsearch
CMD ["/bin/bash", "bin/es-docker"]

manojtr on 6 Mar 2019

👍1

For the Elasticsearch case, changing the rlimit/locking memory is only a concern if you can't disable swapping, correct? AFAIK swapping is disabled on Kubernetes nodes, so this shouldn't be an issue, right?

mattayes on 26 Mar 2019

👀4

btw I have tried the following for Cassandra:

      initContainers:
      - name: increase-memlock-ulimit
        image: busybox
        command: ["sh", "-c", "ulimit -l unlimited"]
        securityContext:
          privileged: true

with

      securityContext:
        capabilities:
          add:
            - IPC_LOCK
            - SYS_RESOURCE

but it is not working, I still see errors about RLIMIT_MEMLOCK and the value inside of the container is not changed to unlimited...

guitmz on 30 Apr 2019

@guitmz I do not believe an initContainer will work. In that case, you are setting the ulimit for _that_ container, not the pod as a whole.

Below are links to an example Dockerfile and pod.yaml for running a process succesfully as non-root with the ability to change ulimit. Please note that the process in this example changes the ulimit with a syscall.

Dockerfile: https://github.com/jasongerard/mlockex/blob/master/Dockerfile
pod.yaml: https://github.com/jasongerard/mlockex/blob/master/pod.yaml

jasongerard on 30 Apr 2019

👍6

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 29 Jul 2019

Can we remove the stale bot for this issue?

agolomoodysaada on 29 Jul 2019

👍1

/remove-lifecycle stale

dims on 29 Jul 2019

👍1

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 27 Oct 2019

/remove-lifecycle stale

frittentheke on 30 Oct 2019

Context: sig-architecture subgroup work to search for technical debt
Category: Enterprise Readiness
Reason: community interest, people trying workarounds, useful for enterprise workloads, improves enterprise adoption

krmayankk on 30 Oct 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 28 Jan 2020

/remove-lifecycle stale

tbarrella on 28 Jan 2020

+1

kolikons on 7 Feb 2020

👎1

/cc

zrss on 24 Feb 2020

+1

better0332 on 12 Apr 2020

👍7 👎4

+1

aqt01 on 28 May 2020

👎1

+1

lsferreira42 on 26 Jul 2020

I would also like to vote for this issue. It has been opened for 5 years and I think that it really makes sense to have the possibility to configure file descriptor limit per container similar to the way we could configure requests and limits for CPU, memory and huge pages size. For proxy-like software file descriptors are key and limiting factor and it is extremely important to have the possibility to require and control this resource for the whole lifecycle of the container, similar to CPU.

ivadoncheva on 14 Aug 2020

👍9

I'd like to vote for this issue as well, it would be really nice to have this feature available out of the box in Kubernetes instead of using workarounds and/or hacks.

jujoramos on 28 Aug 2020

👍5

Ran into this today in an EKS Fargate context where I can't easily override Docker ulimit settings. Would definitely appreciate Kubernetes config support for this.