Test-infra: prow fails to update job configmap due to length limit

Created on 8 Mar 2019 · 20Comments · Source: kubernetes/test-infra

What happened: Prow's config-updater plugin failed due to the job configmap being too large

{
 insertId:  "tey3tsg1971nnw"  
 jsonPayload: {
  author:  "BenTheElder"   
  component:  "hook"   
  error:  "update config map err: ConfigMap "job-config" is invalid: []: Too long: must have at most 1048576 characters"   
  event-GUID:  "7c77cca0-41db-11e9-83d0-bbff3a601bc3"   
  event-type:  "pull_request"   
  level:  "error"   
  msg:  "Error handling PullRequestEvent."   
  org:  "kubernetes"   
  plugin:  "config-updater"   
  pr:  11702   
  repo:  "test-infra"   
  url:  "https://github.com/kubernetes/test-infra/pull/11702"   
 }
 labels: {
  compute.googleapis.com/resource_name:  "fluentd-gcp-v3.2.0-56c8g"   
  container.googleapis.com/namespace_name:  "default"   
  container.googleapis.com/pod_name:  "hook-587bd464fc-pgdwd"   
  container.googleapis.com/stream:  "stderr"   
 }
 logName:  "projects/k8s-prow/logs/hook"  
 receiveTimestamp:  "2019-03-08T19:51:01.491289557Z"  
 resource: {
  labels: {
   cluster_name:  "prow"    
   container_name:  "hook"    
   instance_id:  "7000980459144515921"    
   namespace_id:  "default"    
   pod_id:  "hook-587bd464fc-pgdwd"    
   project_id:  "k8s-prow"    
   zone:  "us-central1-f"    
  }
  type:  "container"   
 }
 severity:  "ERROR"  
 timestamp:  "2019-03-08T19:50:55Z"  
}

What you expected to happen: The config updater should not fail so we can update job configs

How to reproduce it (as minimally and precisely as possible): have a giant job config and merge a PR that updates the cnofig

Please provide links to example occurrences, if any: PR that triggered this: https://github.com/kubernetes/test-infra/pull/11702

/priority critical-urgent
/area prow

areprow kinbug lifecyclstale prioritcritical-urgent

Source

BenTheElder

All 20 comments

some options:

shard the configmap and have prow deployments mount multiple job config maps
gzip the job configs
drop comments / convert to equivalent JSON for denser encoding
stop using a configmap entirely in favor of something else

BenTheElder on 8 Mar 2019

converting to JSON on upload would be a pretty huge win:
```console
$ cat $CONFIG_FILE | dd > /dev/null
21+1 records in
21+1 records out
10814 bytes transferred in 0.002618 secs (4130893 bytes/sec)

$ go run $HOME/yaml2json.go $CONFIG_FILE | dd > /dev/null
10+1 records in
10+1 records out
5569 bytes transferred in 0.742876 secs (7497 bytes/sec)

$ go run $HOME/yaml2yaml.go $CONFIG_FILE | dd > /dev/null
12+1 records in
12+1 records out
6387 bytes transferred in 0.559138 secs (11423 bytes/sec)
```

BenTheElder on 8 Mar 2019

You could shard by branch or job type or repo or org...

See how we shard our config:

https://github.com/openshift/release/blob/8363c34b9933916b093960067a6f66788e2086d5/cluster/ci/config/prow/plugins.yaml#L592-L613

stevekuznetsov on 8 Mar 2019

That would require changing the mounts, though:

https://github.com/openshift/release/blob/8363c34b9933916b093960067a6f66788e2086d5/cluster/ci/config/prow/openshift/hook.yaml#L103-L114

https://github.com/openshift/release/blob/8363c34b9933916b093960067a6f66788e2086d5/cluster/ci/config/prow/openshift/hook.yaml#L78-L89

stevekuznetsov on 8 Mar 2019

(I also have a presubmit job on my config repo that ensures that every file under config/ matches at least one shard, since there is no way to have a negative glob)

stevekuznetsov on 8 Mar 2019

Shard validator tool is specific to our config but pretty simple: https://github.com/openshift/ci-operator-prowgen/blob/master/cmd/config-shard-validator/main.go

stevekuznetsov on 8 Mar 2019

circling back on this:

We should be OK for a brief while now after making the generated jobs denser (json), we dropped about 48k from the configmap

Sharding is a good answer to look at going forward, but we'll have to identify a good way to shard:

[bentheelder@bentheelder:~/go/src/k8s.io/test-infra·2019-03-08T15:43:25-0800·master@d08ab52ca]
$ du -h -d1 config/jobs
292K    config/jobs/kubernetes-sigs
16K     config/jobs/kubernetes-incubator
24K     config/jobs/gke
16K     config/jobs/apache-spark-on-k8s
24K     config/jobs/kubernetes-csi
12K     config/jobs/cadvisor
132K    config/jobs/kubernetes-security
16K     config/jobs/tensorflow
16K     config/jobs/helm
24K     config/jobs/GoogleCloudPlatform
28K     config/jobs/kubeflow
16K     config/jobs/bazelbuild
20K     config/jobs/containerd
1.2M    config/jobs/kubernetes
1.8M    config/jobs

[bentheelder@bentheelder:~/go/src/k8s.io/test-infra·2019-03-08T15:43:27-0800·master@d08ab52ca]
$ du -h -d1 config/jobs/kubernetes
48K     config/jobs/kubernetes/test-infra
12K     config/jobs/kubernetes/enhancements
20K     config/jobs/kubernetes/sig-instrumentation
68K     config/jobs/kubernetes/sig-aws
12K     config/jobs/kubernetes/cloud-provider-openstack
12K     config/jobs/kubernetes/cloud-provider-aws
8.0K    config/jobs/kubernetes/sig-scheduling
8.0K    config/jobs/kubernetes/minikube
120K    config/jobs/kubernetes/sig-gcp
12K     config/jobs/kubernetes/publishing-bot
68K     config/jobs/kubernetes/sig-scalability
12K     config/jobs/kubernetes/cluster-registry
16K     config/jobs/kubernetes/sig-azure
224K    config/jobs/kubernetes/generated
16K     config/jobs/kubernetes/sig-release
12K     config/jobs/kubernetes/community
80K     config/jobs/kubernetes/sig-node
12K     config/jobs/kubernetes/sig-apps
12K     config/jobs/kubernetes/sig-storage
96K     config/jobs/kubernetes/sig-testing
108K    config/jobs/kubernetes/sig-cluster-lifecycle
20K     config/jobs/kubernetes/sig-autoscaling
12K     config/jobs/kubernetes/cloud-provider-gcp
12K     config/jobs/kubernetes/org
12K     config/jobs/kubernetes/cloud-provider-azure
12K     config/jobs/kubernetes/cloud-provider-vsphere
72K     config/jobs/kubernetes/sig-network
12K     config/jobs/kubernetes/sig-api-machinery
32K     config/jobs/kubernetes/sig-cli
12K     config/jobs/kubernetes/release
16K     config/jobs/kubernetes/kops
1.2M    config/jobs/kubernetes

BenTheElder on 9 Mar 2019

In OpenShift we enforce that jobs are in files that match org-repo-branch-jobtype.yaml -- you could do something similar, allowing for custom suffixes or whatever to keep current naming the same. If you do that and enforce it in a presubmit you can shard simply, on job type, branch, whatever

stevekuznetsov on 9 Mar 2019

gzip the job configs

just throwing this out there:

$ find ./config/jobs/ -name "*.yaml" -exec cat {} + > all-files
$ gzip < all-files > all-files.gz
$ stat --format=%s all-files
997270
$ stat --format=%s all-files.gz
69305

ratio is 14:1.

neolit123 on 9 Mar 2019

👎1 👍1

gzip was considered at length by @cjwagner and I but it's complicated -- we need to gzip on the input in the plugin as well as on the output by every config loader. Sharding it requires 0 code change and works great. That's what we chose for simplicity and I don't really see why y'all can't, either :)

stevekuznetsov on 9 Mar 2019

👎1

There is only one config loading code path across the components. Your shard checker and file name enforcement each require more code than optional gzip support would require... gzip is rather simple to add.

Some concerned API machinery folks saw this bug and spoke with me about this, gzip was one of their first suggestions (as was JSON), it's a pretty obvious choice given how trivially compressible our data is.

@cjwagner why was gzip rejected? Where was this conversation? Do you want to rearrange the configs and shard?

BenTheElder on 9 Mar 2019

@cjwagner why was gzip rejected? Where was this conversation? Do you want to rearrange the configs and shard?

I don't recall discussing this at length and I'm definitely for it if we can make it work more easily. I was favoring config sharding as a temporary mitigation that didn't require additional code, but if validation requires extra code then that isn't really a selling point.
One concern mentioned was that incremental updates might be harder if the config is gzipped, but I'm not sure thats true?

@stevekuznetsov Is there some complexity to adding optional gzip to the config package that we're missing?

cjwagner on 11 Mar 2019

👍1

Yeah I thought the downsides to gzip were:

only delays the problem, does not solve it
required code where sharding did not (but without negative globs you may want validation, which needs code)
did not support incremental updates

If those are all not blockers, gzip seems ok

stevekuznetsov on 11 Mar 2019

👀1

Delaying by a 13x reduction is a pretty significant delay. If we surpass ~ 14.6MB of uncompressed config we need to either rethink the config format or shard the cluster itself.
Fair, shouldn't be much.
Not sure what you mean, do you mean apply to the key contents? We don't do that currently IIRC. If we gzip the files -> keys in configmaps it should be fine for updates.

BenTheElder on 11 Mar 2019

Yeah, agreed that if the config balloons more than 14x it would be a good time to think about what on earth we're doing lol.

I think the no-more-code-added-to-Prow was a strong reason for when we chose sharding, it was just the simplest thing.

If we GZIP each file individually we'd not get the full 14x compression though, so I don't think we need that.

If we implement an opt-in gzipping on the config-uploader plugin and transparent extraction in the config loading in Prow, that should be fine?

stevekuznetsov on 11 Mar 2019

If we GZIP each file individually we'd not get the full 14x compression though, so I don't think we need that.

gzip has no concept of files, just a stream of data, so we'd need to convert them a document stream and support reading that or tar the files then gzip.

The code for individual file gzip read/write is simpler even if we lose a little to extra gzip headers. It also allows us to incrementally migrate keys in place.

BenTheElder on 12 Mar 2019

Err I meant we would get more compression from doing the entire blob of data at once instead of compressing each key individually? But yes, we could do that as well if we need to do incremental updates

stevekuznetsov on 12 Mar 2019

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot on 10 Jun 2019

/close

stevekuznetsov on 10 Jun 2019

@stevekuznetsov: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.