Calico: pod cannot access another pod on different node

Created on 31 Jan 2019  路  9Comments  路  Source: projectcalico/calico

Description

I deploy the cluster with the inventory below:

[OSEv3:children]
masters
nodes
etcd
lb

[OSEv3:vars]

Ensure these variables are set for bootstrap

openshift required configuration variables follow
ansible_user=root
ansible_become=yes
openshift_deployment_type=origin
openshift_release="3.11"
openshift_version="3.11"
openshift_master_default_subdomain=apps.xxx.com
openshift_master_cluster_hostname=openshift.xxx.com
openshift_master_cluster_public_hostname=openshift.xxx.com
openshift_master_public_console_url=https://openshift.xxx.com:8443/console
openshift_master_public_api_url=https://openshift.xxx.com:8443
openshift_master_api_port=8443
openshift_master_console_port=8443
openshift_clusterid=xxx
debug_level=2
openshift_disable_check=memory_availability,disk_availability,docker_image_availability
containerized=true

template_service_broker_install=True
ansible_service_broker_install=True
ansible_service_broker_remove=False
openshift_console_install=True

openshift_deployment_type is required for installation
openshift_cloudprovider_kind=openstack
openshift_use_calico=True
openshift_use_openshift_sdn=False
os_sdn_network_plugin_name="cni"
osm_cluster_network_cidr=10.1.0.0/16

openshift_storageclass_default="true"
openshift_storageclass_name="standard"
openshift_storageclass_provisioner="cinder"
openshift_storageclass_parameters={"fstype": "xfs", "type": "sata"}

openshift_cloudprovider_openstack_auth_url="https://console.xxx.com:13000/v2.0"
openshift_cloudprovider_openstack_username="xxx"
openshift_cloudprovider_openstack_password="xxx"
openshift_cloudprovider_openstack_tenant_name="xxx"

hosted registry swift
openshift_hosted_registry_storage_kind=object
openshift_hosted_registry_storage_provider=swift
openshift_hosted_registry_storage_swift_container="xxx"
openshift_hosted_registry_storage_swift_authurl="https://console.xxx.com:13000/v2.0"
openshift_hosted_registry_storage_swift_username="xxx"
openshift_hosted_registry_storage_swift_password="xxx"
openshift_hosted_registry_storage_swift_tenant="xxx"

promethues
openshift_cluster_monitoring_operator_install=True
openshift_cluster_monitoring_operator_prometheus_storage_enabled=True
openshift_cluster_monitoring_operator_prometheus_storage_capacity=100Gi
openshift_cluster_monitoring_operator_alertmanager_storage_enabled=True
openshift_cluster_monitoring_operator_alertmanager_storage_capacity=10Gi

openshift metrics
openshift_metrics_install_metrics=True
openshift_metrics_cassandra_storage_type=dynamic
openshift_metrics_storage_volume_size=50Gi

ELK logging
openshift_logging_install_logging=True
openshift_logging_es_pvc_dynamic=true
openshift_logging_es_ops_memory_limit=4Gi
openshift_logging_es_pvc_size=200Gi
openshift_logging_es_nodeselector={"node-role.kubernetes.io/compute":"true","staging":"true"}

cluster specific settings maybe be placed here
[masters]
openshift-master-1

[etcd]
openshift-master-1

[lb]

[nodes]
openshift-compute-1 openshift_node_group_name='node-config-compute' openshift_schedulable=True
openshift-compute-2 openshift_node_group_name='node-config-compute' openshift_schedulable=True
openshift-master-1 openshift_node_group_name='node-config-master' openshift_schedulable=True
openshift-infra-1 openshift_node_group_name='node-config-infra' openshift_schedulable=True
openshift-infra-2 openshift_node_group_name='node-config-infra' openshift_schedulable=True

And the pod "prometheus-k8s" on openshift-infra-1 cannot reach the pod "alertmanager-main" on openshift-infra-2. Therefore, we always get the error from pod "prometheus-k8s". The error msg is ""

Version

the openshift version is v3.11

Troubleshooting

I try to do some basic troubleshooting on the nodes and find those:

On openshift-infra-1:

[root@openshift-infra-1 ~]# ip r | grep blackhole
blackhole 10.1.89.0/26 proto bird
[root@openshift-infra-1 ~]# ip a | grep tun
7: tunl0@NONE: <NOARP,UP,LOWER_UP> mtu 1440 qdisc noqueue state UNKNOWN group default qlen 1000
inet 10.1.89.0/32 brd 10.1.89.0 scope global tunl0

In openshift-infra-1 calico node container:

cat /etc/calico/confd/config/bird_aggr.cfg

Generated by confd
------------- Static black hole addresses -------------
protocol static {

route 10.1.89.0/26 blackhole;

}

Aggregation of routes on this host; export the block, nothing beneath it.
function calico_aggr ()
{

  #Block 10.1.89.0/26 is confirmed

  if ( net = 10.1.89.0/26 ) then { accept; }
  if ( net ~ 10.1.89.0/26 ) then { reject; }
}

In etcd,

[root@openshift-master-1 ~]# etcdctl --cert=/etc/etcd/peer.crt --key=/etc/etcd/peer.key --cacert=/etc/etcd/ca.crt --endpoints=[https://192.168.1.12:2379] get /calico/resources/v3/projectcalico.org/nodes/o
penshift-infra-1
/calico/resources/v3/projectcalico.org/nodes/openshift-infra-1
{"kind":"Node","apiVersion":"projectcalico.org/v3","metadata":{"name":"openshift-infra-1","uid":"7cdfa494-fef5-11e8-95c0-fa163e056d1f","creationTimestamp":"2018-12-13T16:38:15Z"},"spec":{"bgp":{"ipv4Addre
ss":"192.168.1.17/24","ipv4IPIPTunnelAddr":"10.1.89.0"},"orchRefs":[{"nodeName":"openshift-infra-1","orchestrator":"k8s"}]}}

I suspend that ipv4IPIPTunnelAddr cannot set to "10.1.89.0" ? and it should be 10.1.89.[1-254]? Anyone can help to diagnose this problem. I do not find where to set ipv4IPIPTunnelAddr into etcd.

kinsupport

Most helpful comment

Solved it using Flannel. Seems Calico is not compatible with Azure by default.

All 9 comments

I suspend that ipv4IPIPTunnelAddr cannot set to "10.1.89.0" ? and it should be 10.1.89.[1-254]? Anyone can help to diagnose this problem. I do not find where to set ipv4IPIPTunnelAddr into etcd.

It should be OK for that address to end in .0.

Have you confirmed that pods->pod on the same host works OK?

A few other things to check:

  • Do you see the expected /26 routes on each node, routing traffic between k8s nodes?
  • Have you configured your underlying network (e.g., AWS) to allow Calico's use of IPIP?
  • Can you verify where packets are getting dropped? i.e., iptables on the sending host, the underlying network, iptables on the remote host, somewhere else?

@caseydavenport ,

Have you confirmed that pods->pod on the same host works OK?
I confirm those pods on the same host can reach other pods.

A few other things to check:

Do you see the expected /26 routes on each node, routing traffic between k8s nodes?

yes

Have you configured your underlying network (e.g., AWS) to allow Calico's use of IPIP?

yes, I deploy calico's use of IPIP on OpenStack VMs, and underlying network acl, security groups are well configured.

Can you verify where packets are getting dropped? i.e., iptables on the sending host, the underlying
network, iptables on the remote host, somewhere else?

yes, I am sure the packets is received on the host physical nics and drops somewhere, in routing staging, but I am not sure where exactly the packets dropped (pre-routing, forwarding or ...). Just one host cannot reach all other hosts and vice verse.

By the way, firstly, I delete the infra-1 node with all its routing ip configurations in etcd(related to 10.1.89.0/26 blackhole) and join the infra-1 node again. Finally, I get the same routing ip 10.1.89.0/26 and the network issue is remaining. I do the same things once again but do not delete routing ip configurations in etcd so that I get a new routing ip 10.1.89.19/26, and the network issue is disappeared. anyone can provide more information or documentation to help me troubleshooting this issue? thanks.

@mslovy Is this still a problem?

You said

Just one host cannot reach all other hosts and vice verse.

Is that a separate problem or part of the problem where pods on one host cannot reach pods on another host?

That is a part of problem

If one host cannot reach the other hosts then you need to resolve that before you should expect pod-to-pod traffic to work.
The only reason Calico would be blocking traffic between hosts is if you have setup Host Endpoints but since you have not mentioned using them I am assuming you have not set up any Host Endpoints.

Hello everyone,

I deployed a Kubernetes cluster on azure VM. I am facing networking issue as described below.

Issue : PODs sitting on a different node can not be reached from a host.
Curl command to such a POD IP is failing. This observation is true for any hosts in the cluster.

All the hosts are reachable from another. They all are in the same network.

Any help is deeply appreciated.

Kubernetes Version: 1.15.3
Node Type : Azure VM
OS : CoreOs 1967.6.0
Kubespray : release-2.11
ETCD Version: 3.3.10
Cluster Networking Solution : Calico

Solved it using Flannel. Seems Calico is not compatible with Azure by default.

@RajdeepSardar Yes Calico networking is not possible but Calico can still be used for Network Policy. See https://docs.projectcalico.org/v3.9/reference/public-cloud/azure for the options available.

@mslovy Are your nodes on Azure?

Solved it using Flannel. Seems Calico is not compatible with Azure by default.

You can enable Calico VXLAN encapsulation for Azure. By default Calico uses IP-in-IP which won't work on Azure.

See https://docs.projectcalico.org/v3.9/networking/vxlan-ipip

See https://docs.projectcalico.org/v3.9/reference/public-cloud/azure for the options available.

Looks like we need to update that to include Calico VXLAN encap!

Was this page helpful?
0 / 5 - 0 ratings