Provide a brief description of your issue here. For example:
On a multi master install, if the first master goes down we can no
longer scaleup the cluster with new nodes or masters.
Installing OKD v3.11 with ELK the fluentd containers keep crashing with state "Crash Loop Back-off".
I've tried to install the cluster complying with requirements.txt for pip but It was throwing an error saying "_es_node is undefined" and it would stop the playbook at openshif-logging install stage, although it didn't show any failed item.
After downgrading ansible (via pip) to 2.8.1 (also tried with 2.6, 2.6.2, 2.6.4, 2.8.4) fluentd pods cant start.
I've tried some different commits as I could install it with success 4 days ago, so I switch to some commit from Jan 16. No luck.
Please put the following version information in the code block
indicated below.
ansible --versionansible 2.8.1
config file = /home/ansible/openshift-ansible/ansible.cfg
configured module search path = [u'/home/ansible/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /home/ansible/.local/lib/python2.7/site-packages/ansible
executable location = /home/ansible/.local/bin/ansible
python version = 2.7.5 (default, Aug 7 2019, 00:51:29) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]
If you're operating from a git clone:
git describeIf you're running from playbooks installed via RPM
rpm -q openshift-ansiblePlace the output between the code block below:
VERSION INFORMATION HERE PLEASE
openshift-ansible-3.11.165-1
Fluentd containers to be running
[root@master2 ~]# oc get pods -n openshift-logging
logging-es-data-master-x4gi8bkf-1-69f6c 2/2 Running 0 21m
logging-fluentd-26fwc 0/1 CrashLoopBackOff 8 21m
logging-fluentd-9qb28 0/1 CrashLoopBackOff 8 21m
logging-fluentd-bnfzv 0/1 CrashLoopBackOff 8 21m
logging-fluentd-d6vbl 0/1 CrashLoopBackOff 8 21m
logging-fluentd-kw8bp 0/1 CrashLoopBackOff 8 21m
logging-fluentd-xbshv 0/1 CrashLoopBackOff 8 21m
logging-fluentd-zcrgz 0/1 CrashLoopBackOff 8 21m
logging-kibana-1-j9mm7 2/2 Running 0 22m
Describe what is actually happening.
For some reason, the fluentd is in CrashLoopBack-Off state as it cant't find /etc/fluent/metrics/tls.crt
[root@master2 ~]# cat /var/log/fluentd/fluentd.log
2020-01-25 16:24:03 -0300 [error]: unexpected error error_class=Errno::ENOENT error="No such file or directory @ rb_sysopen - /etc/fluent/metrics/tls.crt"
2020-01-25 16:24:03 -0300 [error]: suppressed same stacktrace
2020-01-25 16:24:03 -0300 [error]: fluent/log.rb:362:error: unexpected error error_class=Errno::ENOENT error="No such file or directory @ rb_sysopen - /etc/fluent/metrics/tls.crt"
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluent-plugin-prometheus-1.3.0/lib/fluent/plugin/in_prometheus.rb:68:in read' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluent-plugin-prometheus-1.3.0/lib/fluent/plugin/in_prometheus.rb:68:in start'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/root_agent.rb:203:in block in start' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/root_agent.rb:192:in block (2 levels) in lifecycle'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/root_agent.rb:191:in each' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/root_agent.rb:191:in block in lifecycle'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/root_agent.rb:178:in each' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/root_agent.rb:178:in lifecycle'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/root_agent.rb:202:in start' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/engine.rb:274:in start'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/engine.rb:219:in run' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/supervisor.rb:805:in run_engine'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/supervisor.rb:549:in block in run_worker' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/supervisor.rb:730:in main_process'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/supervisor.rb:544:in run_worker' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/lib/fluent/command/fluentd.rb:316:in <top (required)>'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/share/rubygems/rubygems/core_ext/kernel_require.rb:59:in require' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/share/rubygems/rubygems/core_ext/kernel_require.rb:59:in require'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/share/gems/gems/fluentd-1.4.1/bin/fluentd:8:in <top (required)>' 2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/bin/fluentd:23:in load'
2020-01-25 16:24:03 -0300 [error]: fluent/supervisor.rb:729:main_process: /opt/rh/rh-ruby25/root/usr/local/bin/fluentd:23:in `
For long output or logs, consider using a gist
Provide any additional information which may help us diagnose the
issue.
$ cat /etc/redhat-release)EXTRA INFORMATION GOES HERE
[root@master2 ~]# cat /etc/redhat-release
CentOS Linux release 7.7.1908 (Core)
[OSEv3:vars]
ansible_ssh_user=ansible
ansible_become=true
openshift_deployment_type=origin
openshift_release=v3.11
openshift_master_cluster_method=native
openshift_portal_net=10.1.128.0/18
osm_cluster_network_cidr=10.1.0.0/17
osm_host_subnet_length=9
openshift_use_calico=True
openshift_use_openshift_sdn=False
os_sdn_network_plugin_name='cni'
openshift_console_install=true
openshift_console_hostname=console.openshift.local
openshift_master_cluster_hostname=okd-int.openshift.local
openshift_master_cluster_public_hostname=okd.openshift.local
openshift_master_default_subdomain=apps.openshift.local
openshift_disable_check=disk_availability,docker_storage,memory_availability,docker_image_availability
openshift_use_crio=True
openshift_use_crio_only=False
openshift_crio_enable_docker_gc=True
openshift_docker_options=--bip 10.0.0.1/24 --log-opt max-size=100M --log-opt max-file=3 --insecure-registry 10.1.128.0/17 --insecure-registry 10.0.0.0/24 --log-driver=json-file
openshift_master_identity_providers=[{'name': 'Local Authentication', 'login': 'true', 'challenge': 'true', 'kind': 'HTPasswdPasswordIdentityProvider'}]
openshift_master_htpasswd_users={'admin': '$apr1$4qJTE7h9$j9zVzh43pFMjaCa/wuVlY.', 'developer': '$apr1$VtHG.FnT$b0XJ3355yxtDzqtiwb7Ag/' }
os_firewall_use_firewalld=true
openshift_hosted_registry_cert_expire_days=3650
openshift_ca_cert_expire_days=5475
openshift_node_cert_expire_days=3650
openshift_master_cert_expire_days=3650
etcd_ca_default_days=5475
openshift_master_dynamic_provisioning_enabled=true
openshift_enable_unsupported_configurations=True
openshift_hosted_registry_storage_kind=nfs
openshift_hosted_registry_storage_access_modes=['ReadWriteMany']
openshift_hosted_registry_storage_host=nfs.openshift.local
openshift_hosted_registry_storage_nfs_directory=/exports
openshift_hosted_registry_storage_nfs_options='*(rw,root_squash)'
openshift_hosted_registry_storage_volume_name=registry
openshift_hosted_registry_storage_volume_size=40Gi
openshift_metrics_install_metrics=true
openshift_metrics_hawkular_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_metrics_cassandra_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_metrics_heapster_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_metrics_storage_kind=nfs
openshift_metrics_storage_access_modes=['ReadWriteOnce']
openshift_metrics_storage_host=nfs.openshift.local
openshift_metrics_storage_nfs_directory=/exports
openshift_metrics_storage_volume_name=metrics
openshift_metrics_storage_volume_size=20Gi
openshift_logging_install_logging=true
openshift_logging_storage_kind=nfs
openshift_logging_kibana_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_logging_curator_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_logging_es_nodeselector={"node-role.kubernetes.io/infra": "true"}
openshift_logging_storage_access_modes=['ReadWriteOnce']
openshift_logging_storage_nfs_options='*(rw,root_squash)'
openshift_logging_storage_host=nfs.openshift.local
openshift_logging_storage_nfs_directory=/exports
openshift_logging_storage_volume_name=logging
openshift_logging_storage_volume_size=15Gi
openshift_logging_elasticsearch_storage_type=pvc
openshift_logging_es_pvc_size=15Gi
openshift_logging_es_pvc_storage_class_name=''
openshift_logging_es_pvc_dynamic=true
openshift_logging_es_pvc_prefix=logging
openshift_logging_es_memory_limit=2Gi
openshift_node_groups=[{'name': 'node-config-master-crio', 'labels': ['node-role.kubernetes.io/master=true', 'runtime=cri-o'], 'edits': [{ 'key': 'kubeletArguments.container-runtime','value': ['remote'] }, { 'key': 'kubeletArguments.container-runtime-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.image-service-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.runtime-request-timeout','value': ['10m'] }]},{'name': 'node-config-infra-crio', 'labels': ['node-role.kubernetes.io/infra=true', 'runtime=cri-o'], 'edits': [{ 'key': 'kubeletArguments.container-runtime','value': ['remote'] }, { 'key': 'kubeletArguments.container-runtime-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.image-service-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.runtime-request-timeout','value': ['10m'] }]},{'name': 'node-config-compute-crio', 'labels': ['node-role.kubernetes.io/compute=true', 'runtime=cri-o'], 'edits': [{ 'key': 'kubeletArguments.container-runtime','value': ['remote'] }, { 'key': 'kubeletArguments.container-runtime-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.image-service-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.runtime-request-timeout','value': ['10m'] }]},{'name': 'node-config-master-infra-crio', 'labels': ['node-role.kubernetes.io/infra=true', 'node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/compute=true', 'runtime=cri-o'], 'edits': [{ 'key': 'kubeletArguments.container-runtime','value': ['remote'] }, { 'key': 'kubeletArguments.container-runtime-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.image-service-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.runtime-request-timeout','value': ['10m'] }]},{'name': 'node-config-all-in-one-crio', 'labels': ['node-role.kubernetes.io/infra=true', 'node-role.kubernetes.io/compute=true' ,'node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/compute=true', 'runtime=cri-o'], 'edits': [{ 'key': 'kubeletArguments.container-runtime','value': ['remote'] }, { 'key': 'kubeletArguments.container-runtime-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.image-service-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.runtime-request-timeout','value': ['10m'] }]},{'name': 'node-config-all-in-one', 'labels': ['node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/infra=true', 'node-role.kubernetes.io/compute=true']}, {'name': 'node-config-master-infra', 'labels': ['node-role.kubernetes.io/master=true', 'node-role.kubernetes.io/infra=true']}, {'name': 'node-config-compute', 'labels': ['node-role.kubernetes.io/compute=true']}, {'name': 'node-config-master', 'labels': ['node-role.kubernetes.io/master=true']}, {'name': 'node-config-infra', 'labels': ['node-role.kubernetes.io/infra=true']}, {'name': 'node-config-compute-crio-prod', 'labels': ['node-role.kubernetes.io/compute=true', 'runtime=cri-o', 'node-role.kubernetes.io/environment=prod'], 'edits': [{ 'key': 'kubeletArguments.container-runtime','value': ['remote'] }, { 'key': 'kubeletArguments.container-runtime-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.image-service-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.runtime-request-timeout','value': ['10m'] }]}, {'name': 'node-config-compute-crio-stage', 'labels': ['node-role.kubernetes.io/compute=true', 'runtime=cri-o', 'node-role.kubernetes.io/environment=stage', 'node-role.kubernetes.io/build=true'], 'edits': [{ 'key': 'kubeletArguments.container-runtime','value': ['remote'] }, { 'key': 'kubeletArguments.container-runtime-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.image-service-endpoint','value': ['/var/run/crio/crio.sock'] }, { 'key': 'kubeletArguments.runtime-request-timeout','value': ['10m'] }]}]
osm_default_node_selector='node-role.kubernetes.io/environment=stage'
openshift_builddefaults_nodeselectors={'node-role.kubernetes.io/build': 'true'}
[nfs]
nfs.openshift.local
[etcd]
master1.openshift.local
master2.openshift.local
master3.openshift.local
[masters]
master1.openshift.local
master2.openshift.local
master3.openshift.local
[lb]
lb.openshift.local
[nodes]
master[1:3].openshift.local openshift_node_group_name='node-config-master-infra-crio'
node[1:2].openshift.local openshift_node_group_name='node-config-compute-crio-prod'
node[3:4].openshift.local openshift_node_group_name='node-config-compute-crio-stage'
I noticed it starts to happen as soon as elastic search roll out.
If I stop ES pod the fluentd containers starts to run again. Is it something related to ES memory limit?
It seems the fluentd image is broken. I downgraded it to image version tag v3.10 and the container is now running. I’m not sure what is the github repository responsible for this image.
As I have another setup that is running with no issues I discovered the image digest and downloaded it to my new system (where the pods are crashing). After that I changed the daemoset to the "v3.10" image tag so I can delete the image with the tag v3.11 and use this tag with the one I've pulled.
After that I changed back the daemonset to v3.11 tag again and deleted the v3.10 image.
yum install -y podman
crictl pull openshift/origin-logging-fluentd@sha256:c26a9c143c76e32982c4acb520b9f468b303e5df4f0f8873364acae0f5083e0a
Now you must change the daemonset to use image with the "v3.10" tag and wait to all pods to terminate so you are able to the delete the v3.11 tag image as it is not in use*
[root@master1 ~]# crictl images | grep fluentd
docker.io/openshift/origin-logging-fluentd v3.11 33b86066482f9 480MB
[root@master1 ~]# crictl rmi 33b86066482f9
podman tag 33b86066482f9 docker.io/openshift/origin-logging-fluentd:v3.11
Now you must change the daemonset back to use image with the "v3.11" tag and wait to all pods to terminate so you are able to the delete the "v3.10" tag image as it is not in use*
[root@master1 ~]# crictl images | grep fluentd
docker.io/openshift/origin-logging-fluentd v3.10 33b86066482f9 480MB
[root@master1 ~]# crictl rmi 33b86066482f9
Finally its working :)
I encountered this error when I restarted fluentd on OKD 3.11 and a new image was pulled (tag v3.11 was recently updated).
I was able to workaround it by adding METRICS_CERT and METRICS_KEY to the DeamonSet definition, either after deployment, or before in the template at roles/openshift_logging_fluentd/templates/fluentd.j2
It seems that the following change is involved:
https://github.com/openshift/origin-aggregated-logging/pull/1565
This is only a workaround though. Someone involved in the change above should be able to give more insight for a proper fix.
I encountered this error when I restarted fluentd on OKD 3.11 and a new image was pulled (tag v3.11 was recently updated).
I was able to workaround it by adding METRICS_CERT and METRICS_KEY to the DeamonSet definition, either after deployment, or before in the template at roles/openshift_logging_fluentd/templates/fluentd.j2
It seems that the following change is involved:
openshift/origin-aggregated-logging#1565This is only a workaround though. Someone involved in the change above should be able to give more insight for a proper fix.
It could be related but this pull request is from 10 months ago and the image was working at least until 21 Jan. Somepoint after that the image broke.
I see they merged it about 10 months ago, I don’t know if it should be the cause.
@uselessidbr
Thank you for your information.
I could run openshift-logging-fluentd with following commands at master.
oc login -u system:admin
oc -n openshift-logging set image daemonset logging-fluentd *=openshift/origin-logging-fluentd@sha256:c26a9c143c76e32982c4acb520b9f468b303e5df4f0f8873364acae0f5083e0a
I have same trouble (OKD 3.11), only 3.10 image does work success
We are investigating but you can always work around the issue by building the image yourself from:
https://github.com/openshift/origin-aggregated-logging/blob/release-3.11/fluentd/Dockerfile.centos7
$fluentdir>docker build -t openshift/logging-fluentd:v3.11 -f Dockerfile.centos7 .
The latest image had the same digest as 3.11 and 4.x tag, just make sure they are decoupled to grant that 3.11 will not be updated when latest is updated.
oc -n openshift-logging set image daemonset logging-fluentd *=openshift/origin-logging-fluentd@sha256:c26a9c143c76e32982c4acb520b9f468b303e5df4f0f8873364acae0f5083e0a
It worked for me as well. Thanks you.
Does the problem persist in new deployments? The v3.11 image digest still
the same as v4.0 and latest.
Em dom, 9 de fev de 2020 Ã s 14:11, Arunabha Banerjee <
[email protected]> escreveu:
oc -n openshift-logging set image daemonset logging-fluentd
*=openshift/origin-logging-fluentd@sha256
:c26a9c143c76e32982c4acb520b9f468b303e5df4f0f8873364acae0f5083e0aIt worked for me as well. Thanks you.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/openshift/openshift-ansible/issues/12089?email_source=notifications&email_token=AK5OIZ7W2H6UA3IP55X6G7DRCA2LPA5CNFSM4KLS4XK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOELGSGYI#issuecomment-583869281,
or unsubscribe
https://github.com/notifications/unsubscribe-auth/AK5OIZYNB3KZDIGMTES6GZTRCA2LPANCNFSM4KLS4XKQ
.
Most helpful comment
@uselessidbr
Thank you for your information.
I could run openshift-logging-fluentd with following commands at master.