What happened: Prow's config-updater plugin failed due to the job configmap being too large
{
insertId: "tey3tsg1971nnw"
jsonPayload: {
author: "BenTheElder"
component: "hook"
error: "update config map err: ConfigMap "job-config" is invalid: []: Too long: must have at most 1048576 characters"
event-GUID: "7c77cca0-41db-11e9-83d0-bbff3a601bc3"
event-type: "pull_request"
level: "error"
msg: "Error handling PullRequestEvent."
org: "kubernetes"
plugin: "config-updater"
pr: 11702
repo: "test-infra"
url: "https://github.com/kubernetes/test-infra/pull/11702"
}
labels: {
compute.googleapis.com/resource_name: "fluentd-gcp-v3.2.0-56c8g"
container.googleapis.com/namespace_name: "default"
container.googleapis.com/pod_name: "hook-587bd464fc-pgdwd"
container.googleapis.com/stream: "stderr"
}
logName: "projects/k8s-prow/logs/hook"
receiveTimestamp: "2019-03-08T19:51:01.491289557Z"
resource: {
labels: {
cluster_name: "prow"
container_name: "hook"
instance_id: "7000980459144515921"
namespace_id: "default"
pod_id: "hook-587bd464fc-pgdwd"
project_id: "k8s-prow"
zone: "us-central1-f"
}
type: "container"
}
severity: "ERROR"
timestamp: "2019-03-08T19:50:55Z"
}
What you expected to happen: The config updater should not fail so we can update job configs
How to reproduce it (as minimally and precisely as possible): have a giant job config and merge a PR that updates the cnofig
Please provide links to example occurrences, if any: PR that triggered this: https://github.com/kubernetes/test-infra/pull/11702
/priority critical-urgent
/area prow
some options:
converting to JSON on upload would be a pretty huge win:
```console
$ cat $CONFIG_FILE | dd > /dev/null
21+1 records in
21+1 records out
10814 bytes transferred in 0.002618 secs (4130893 bytes/sec)
$ go run $HOME/yaml2json.go $CONFIG_FILE | dd > /dev/null
10+1 records in
10+1 records out
5569 bytes transferred in 0.742876 secs (7497 bytes/sec)
$ go run $HOME/yaml2yaml.go $CONFIG_FILE | dd > /dev/null
12+1 records in
12+1 records out
6387 bytes transferred in 0.559138 secs (11423 bytes/sec)
```
You could shard by branch or job type or repo or org...
See how we shard our config:
That would require changing the mounts, though:
(I also have a presubmit job on my config repo that ensures that every file under config/ matches at least one shard, since there is no way to have a negative glob)
Shard validator tool is specific to our config but pretty simple: https://github.com/openshift/ci-operator-prowgen/blob/master/cmd/config-shard-validator/main.go
circling back on this:
We should be OK for a brief while now after making the generated jobs denser (json), we dropped about 48k from the configmap
Sharding is a good answer to look at going forward, but we'll have to identify a good way to shard:
[bentheelder@bentheelder:~/go/src/k8s.io/test-infra路2019-03-08T15:43:25-0800路master@d08ab52ca]
$ du -h -d1 config/jobs
292K config/jobs/kubernetes-sigs
16K config/jobs/kubernetes-incubator
24K config/jobs/gke
16K config/jobs/apache-spark-on-k8s
24K config/jobs/kubernetes-csi
12K config/jobs/cadvisor
132K config/jobs/kubernetes-security
16K config/jobs/tensorflow
16K config/jobs/helm
24K config/jobs/GoogleCloudPlatform
28K config/jobs/kubeflow
16K config/jobs/bazelbuild
20K config/jobs/containerd
1.2M config/jobs/kubernetes
1.8M config/jobs
[bentheelder@bentheelder:~/go/src/k8s.io/test-infra路2019-03-08T15:43:27-0800路master@d08ab52ca]
$ du -h -d1 config/jobs/kubernetes
48K config/jobs/kubernetes/test-infra
12K config/jobs/kubernetes/enhancements
20K config/jobs/kubernetes/sig-instrumentation
68K config/jobs/kubernetes/sig-aws
12K config/jobs/kubernetes/cloud-provider-openstack
12K config/jobs/kubernetes/cloud-provider-aws
8.0K config/jobs/kubernetes/sig-scheduling
8.0K config/jobs/kubernetes/minikube
120K config/jobs/kubernetes/sig-gcp
12K config/jobs/kubernetes/publishing-bot
68K config/jobs/kubernetes/sig-scalability
12K config/jobs/kubernetes/cluster-registry
16K config/jobs/kubernetes/sig-azure
224K config/jobs/kubernetes/generated
16K config/jobs/kubernetes/sig-release
12K config/jobs/kubernetes/community
80K config/jobs/kubernetes/sig-node
12K config/jobs/kubernetes/sig-apps
12K config/jobs/kubernetes/sig-storage
96K config/jobs/kubernetes/sig-testing
108K config/jobs/kubernetes/sig-cluster-lifecycle
20K config/jobs/kubernetes/sig-autoscaling
12K config/jobs/kubernetes/cloud-provider-gcp
12K config/jobs/kubernetes/org
12K config/jobs/kubernetes/cloud-provider-azure
12K config/jobs/kubernetes/cloud-provider-vsphere
72K config/jobs/kubernetes/sig-network
12K config/jobs/kubernetes/sig-api-machinery
32K config/jobs/kubernetes/sig-cli
12K config/jobs/kubernetes/release
16K config/jobs/kubernetes/kops
1.2M config/jobs/kubernetes
In OpenShift we enforce that jobs are in files that match org-repo-branch-jobtype.yaml -- you could do something similar, allowing for custom suffixes or whatever to keep current naming the same. If you do that and enforce it in a presubmit you can shard simply, on job type, branch, whatever
gzip the job configs
just throwing this out there:
$ find ./config/jobs/ -name "*.yaml" -exec cat {} + > all-files
$ gzip < all-files > all-files.gz
$ stat --format=%s all-files
997270
$ stat --format=%s all-files.gz
69305
ratio is 14:1.
gzip was considered at length by @cjwagner and I but it's complicated -- we need to gzip on the input in the plugin as well as on the output by every config loader. Sharding it requires 0 code change and works great. That's what we chose for simplicity and I don't really see why y'all can't, either :)
There is only one config loading code path across the components. Your shard checker and file name enforcement each require more code than optional gzip support would require... gzip is rather simple to add.
Some concerned API machinery folks saw this bug and spoke with me about this, gzip was one of their first suggestions (as was JSON), it's a pretty obvious choice given how trivially compressible our data is.
@cjwagner why was gzip rejected? Where was this conversation? Do you want to rearrange the configs and shard?
@cjwagner why was gzip rejected? Where was this conversation? Do you want to rearrange the configs and shard?
I don't recall discussing this at length and I'm definitely for it if we can make it work more easily. I was favoring config sharding as a temporary mitigation that didn't require additional code, but if validation requires extra code then that isn't really a selling point.
One concern mentioned was that incremental updates might be harder if the config is gzipped, but I'm not sure thats true?
@stevekuznetsov Is there some complexity to adding optional gzip to the config package that we're missing?
Yeah I thought the downsides to gzip were:
If those are all not blockers, gzip seems ok
Delaying by a 13x reduction is a pretty significant delay. If we surpass ~ 14.6MB of uncompressed config we need to either rethink the config format or shard the cluster itself.
Fair, shouldn't be much.
Not sure what you mean, do you mean apply to the key contents? We don't do that currently IIRC. If we gzip the files -> keys in configmaps it should be fine for updates.
Yeah, agreed that if the config balloons more than 14x it would be a good time to think about what on earth we're doing lol.
I think the no-more-code-added-to-Prow was a strong reason for when we chose sharding, it was just the simplest thing.
If we GZIP each file individually we'd not get the full 14x compression though, so I don't think we need that.
If we implement an opt-in gzipping on the config-uploader plugin and transparent extraction in the config loading in Prow, that should be fine?
If we GZIP each file individually we'd not get the full 14x compression though, so I don't think we need that.
gzip has no concept of files, just a stream of data, so we'd need to convert them a document stream and support reading that or tar the files then gzip.
The code for individual file gzip read/write is simpler even if we lose a little to extra gzip headers. It also allows us to incrementally migrate keys in place.
Err I meant we would get more compression from doing the entire blob of data at once instead of compressing each key individually? But yes, we could do that as well if we need to do incremental updates
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale
/close
@stevekuznetsov: Closing this issue.
In response to this:
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.