Skaffold: High CPU usage on Mac OS X

Created on 8 May 2018  Â·  28Comments  Â·  Source: GoogleContainerTools/skaffold

Actual behavior

skaffold process generates about 100% sustained cpu usage when it is started with "skaffold dev" command. The project itself is quite small (4 deployments of third-party services and 1 locally built docker image), and the high cpu usage is present even without any local file changes.

VBoxHeadless process also has a sustained 150-200% CPU usage in Activity Monitor.

Information

Skaffold version: v0.5.0
Operating system: Mac OS X 10.12.6
Content of skaffold.yaml:

apiVersion: skaffold/v1alpha2
kind: Config
build:
  artifacts:
  - imageName: my-tools-image
    workspace: ./tools
    docker: {}
  local: {}
deploy:
  kubectl:
    manifests:
    - ./kubernetes/*

kinbug

Most helpful comment

I've instrumented my local build of skaffold and found that this line:

https://github.com/GoogleContainerTools/skaffold/blob/762c3ada56d758913e15dfaadfa73b0e68328e8b/pkg/skaffold/kubernetes/log.go#L74

Seems to suddenly return a nil/empty Event repeatedly. I have been able to trigger it by switching from one internet connection to another. I think the watch channel is closed and we're stuck in an infinite loop because of it.

I'm not certain this is the source of the problem, but I'm experimenting locally with:

    case evt, ok := <-watcher.ResultChan():
        if !ok {
            // Try reopening
            logrus.Infof("Log aggregator watch has stopped, respawning it.")
            a.Start(ctx, client)
            break
        }

All 28 comments

Also seeing this behavior on 10.11.6. I've been having issues with running out of file descriptors in other apps, though I'm not sure if that's relevant to what's going on here or not...

I have seen it also on linux (Linux turtle 4.16.7-1-ARCH #1 SMP PREEMPT Wed May 2 21:12:36 UTC 2018 x86_64 GNU/Linux)

Ref #549

I've instrumented my local build of skaffold and found that this line:

https://github.com/GoogleContainerTools/skaffold/blob/762c3ada56d758913e15dfaadfa73b0e68328e8b/pkg/skaffold/kubernetes/log.go#L74

Seems to suddenly return a nil/empty Event repeatedly. I have been able to trigger it by switching from one internet connection to another. I think the watch channel is closed and we're stuck in an infinite loop because of it.

I'm not certain this is the source of the problem, but I'm experimenting locally with:

    case evt, ok := <-watcher.ResultChan():
        if !ok {
            // Try reopening
            logrus.Infof("Log aggregator watch has stopped, respawning it.")
            a.Start(ctx, client)
            break
        }

I've been running this for a few days now and while the CPU usage issue is gone, I've seen the "Log aggregator watch has stopped, respawning it." message spewed to my logs over and over again.

I guess there's something that goes bad elsewhere in the log watching state machine and just respawning it again is not sufficient. This change is better than nothing as it gives a landmark in the logs to launch deeper investigations from. Anyone have ideas about what might be going on?

We have an mtime watcher now that should be better.

@dlorenc can you elaborate please? I looked at the linked PR but there is not much detail on it. Does the mtime watcher basically poll large directories for modification of files? I'm experiencing this issue when switching between my home network and my office network. If the mtime solution using polling this will probably be bad for my use case since I'm using nodejs and the node_modules tree can be extremely large (gigabytes of deep directory structure).

Very curious about the underlying cause of the inotify/fsevents solution. I've used those tools a lot on other projects and switched networks without issues, as I routinely work from home and from the office on the same laptop.

Is the mtime watcher the default on 0.7.0, or did it miss the cut? If it's the default then I think I'm hitting the 100% CPU usage with the mtime watcher after suspending the laptop for > 1 hour while skaffold dev is running.

I see this issue while using a custom file watching system I wrote that's
completely independent of mtime and kevent.

My current suspect is the log tailing logic...

On Sun, Jun 17, 2018, 18:36 Chris Corbyn notifications@github.com wrote:

Is the mtime watcher the default on 0.7.0, or did it miss the cut? If it's
the default then I think I'm hitting the 100% CPU usage with the mtime
watcher after suspending the laptop for > 1 hour while skaffold dev is
running.

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/GoogleContainerTools/skaffold/issues/531#issuecomment-397922064,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAAcndJjsp6ut2tqELmOqo7q8uwF5mZIks5t9wQWgaJpZM4T3BQC
.

Oh yeah, I've had the 100% CPU issue using kail for log tailing before. stern didn't have that issue, so I switched to that. Not sure how skaffold is doing it, but you might be onto something there.

I seem to have run into the 100% CPU usage again after around 45 minutes of inactivity.

Interestingly I'm running multiple skaffold processes (microservices architecture) and they all collectively started using 100% CPU at the same time. So something environmental causes it to start happening.

Can we re-open this issue?

image

Thanks @r2d4 - I'll see if I can pinpoint where the issue is.

I've done some rudimentary debugging (fmt.Println in spots I suspect to be problematic) and it seems that when this issue starts happening, it's in log.go's Start() func.

for {
    select {
    case <-ctx.Done():
        return
    case evt := <-watcher.ResultChan():
        fmt.Println("Got watcher evt")
        if evt.Type != watch.Added && evt.Type != watch.Modified {
            fmt.Println("Wrong evt.Type")
            continue
        }

It seems to start thrashing there and receives events seemingly in a tight loop. This matches up with what @ajbouh observed too.

https://github.com/kubernetes/client-go/issues/12

watcher error after 20m

When I create a watch on pods e.g. like this:
...
I have the trouble that after exactly 20 minutes, the ResultChan() receives infinite messages, but they are all empty.

It is expected. Server will timeout the watch connection, see code. Though I'm not sure why it's 20 mins, it timeout after 30-60 mins if you didn't change the default configs of your server, because the default MinRequestTimeout is 30 mins.

Example code that uses Watcher correctly: https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/tools/cache/reflector.go#L358

This probably matches what I see. Every time this issue starts happening for me, it's when log output from the containers has been idle for a while (e.g. when I go for lunch).

I think we may need to reopen this.

I am observing this with low activity, also with skaffold dev. I am running v1.9.1 and it's just tailing logs from one pod. It's also watching for changes in a folder to sync (manual FileSync). Any pointers on how to reduce CPU usage?

Screenshot 2020-06-01 at 16 55 37

The full command is:

$ ps -ef | grep skaffold
  501 18378 18350   0  3:19pm ??       113:05.89 skaffold dev --filename skaffold.yml --profile qc --trigger=manual --cleanup=false --auto-build=false --auto-deploy=false --default-repo eu.gcr.io/repo --port-forward=true --status-check=true

Can anyone reopen this issue or point to a more recent one? In the 2018 Macs the fans are at full speed when leaving skaffold dev monitoring logs

@davidfernandezm wow, that's quite high. have you tried running with a different watcher? skaffold dev --trigger=polling may fix your issue temporarily. i'd love to know more about your project though, is the folder that you're watching for sync really big?

Thanks a lot for your response @nkubala. There's one folder being watched that has around 5000 files, it's my suspicion that's the culprit. I will try the polling trigger to see if that has any effect.

This is how my skaffold.yml profile looks like:

...
- name: qc
    build:
       artifacts:
        - image: "skaffold-qc"
          context: '..'
          sync:
            manual:
            - src: 'KVP/**/*.class'
              dest: /usr/local/tomcat/webapps/ROOT
              strip: 'KVP'
            - src: 'KVP/client/**/*.*'
              dest: /usr/local/tomcat/webapps/ROOT
              strip: 'KVP'
            - src: 'deploy/external-configuration/app/external-configuration/*.*'
              dest: /app/external-configuration
              strip: 'deploy/external-configuration/app/external-configuration'
          jib:
            args:
            - "-PskipBuildClient=true"
       local: {}
    deploy:
      helm:
        releases:
          - name: "{{.USER}}-skaffold-qc"
            chartPath: "qumucloud"
            values:
              image: "skaffold-qc"
            setValueTemplates:
              initDb: "{{.INIT_DB}}"
            valuesFiles:
              - "qumucloud/values-local-{{.USER}}.yaml"
            recreatePods: true
        flags:
          upgrade:
            - --install
      # This is the time that skaffold waits before assuming a deployment was timing out
      # Needed to be increased as QC an take sometimes 3 or more minutes to be built & deployed first time round
      statusCheckDeadlineSeconds: 240

and on JIB side:

...
jib {
    from.image = 'eu.gcr.io/qumu-dev/qumu-cloud-base-local:1.0.0'

    extraDirectories.paths = ['deploy/external-configuration',
                              'build/libs', 
                              'deploy/database', 
                              'scripts/job-processor', 
                              'database/liquibase', 
                              'src/main/jib'] 

    extraDirectories.permissions = [
            '/app/entrypoint.sh': '755',
            '/app/restart-tomcat.sh': '755',
            '/liquibase': '755'
    ]

    container.workingDirectory = "/app"
    container.entrypoint = ["/app/entrypoint.sh"]
    container.appRoot = '/usr/local/tomcat/webapps/ROOT'

    skaffold {
        watch {
            includes = [ "KVP/" ]
        }
    }
}

@nkubala thanks a lot for your suggestion, it fixed the CPU issue... now it's like standing at 20% (has been running for 3h). Still want to confirm, but could you point to some doc or explanation of the differences between manual, polling, notify and why manual goes crazy with CPU?

Thanks a lot again!

@davidfernandezm glad to hear that fixed your issue! we have some brief documentation on the trigger types at https://skaffold.dev/docs/workflows/dev/ under "File Watcher and Watch Modes", but this could definitely be improved. I don't have time to do this right now but if you'd like to become a contributor feel free and throw some improvements together :)

FYI it's not manual that goes crazy, it's fsnotify. especially on OS X, when a project has lots and lots of files, the CPU usage skyrockets. the polling trigger simply computes changes in your workspace on a time interval, so it doesn't actually have to "monitor" anything. not as quick, but much less overhead.

I'd love to become a contributor, but I am still learning Go basics. I am actually reading the code and learning a lot. Congrats on such great product to speed up K8s and cloud adoption in dev environments.

Thanks @davidfernandezm!

I don't know if this is the right place for this, but this is related to File Sync of thousands of files, which won't work at the moment. A block like this:

 artifacts:
        - image: "skaffold-qc"
          context: '..'
          sync:
            manual:
            - src: 'KVP/**/*.class'
              dest: /usr/local/tomcat/webapps/ROOT
              strip: 'KVP'
            - src: 'KVP/client/**/*.*'
              dest: /usr/local/tomcat/webapps/ROOT
              strip: 'KVP'
            - src: 'deploy/external-configuration/app/external-configuration/*.*'
              dest: /app/external-configuration
              strip: 'deploy/external-configuration/app/external-configuration'

in our scenario is trying to sync 4000+ files and getting:

time="2020-08-03T15:54:49+01:00" level=warning msg="Skipping deploy due to sync error: deleting files: starting command /usr/local/bin/kubectl --context gke_qumu-dev_europe-west1-b_qumu-dev-dev1 exec qumucloud-dfernandez-85dc9c9d85-xn9xk --namespace default -c qumucloud-dfernandez -- rm -rf -- /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/api/cache/Delegate.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/api/content/chapter/ChapterApi.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/integration/stoneshot/GetSessionsInvitedForContactResponse.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/security/SsoUserMapper.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/security/jwt/JwtSigningException.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/api/dto/AsyncJobDto.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/integration/stone...
-- MANY MORE FILE PATHS --
INF/classes/com/kuluvalley/guru/server/presentation/controller/admin/MetadataTypeEnumController.class /usr/local/tomcat/webapps/ROOT/WEB-INF/classes/com/kuluvalley/guru/server/security/KeyStoreConfig.class: fork/exec /usr/local/bin/kubectl: argument list too long"

key part: /usr/local/bin/kubectl: argument list too long"

Is there any workaround for when such number of files need to be deleted/synced?

@davidfernandezm interesting....never seen that one before haha. would you mind opening a new issue to track that?

Was this page helpful?
0 / 5 - 0 ratings