Spring-cloud-netflix: Eureka Server and Client incompatible between Angel (1.0.x) and Brixton (1.1.x)

Created on 19 Apr 2016  路  22Comments  路  Source: spring-cloud/spring-cloud-netflix

After doing a fair bit of investigation with this I have found that the Eureka server and client components have some odd functionality between the two versions. This issue is mostly just for others running into this same problem during the upgrade path. If a solution can be found then that would be great, but there seems to be some documentation that is missing if this is what is going to happen going forward.

So there are 4 scenarios in total: (Angel.SR6 and Brixton.RC2 are the versions being used below)

1) Server (Brixton) and Client (Brixton)
2) Server (Brixton) and Client (Angel)
3) Server (Angel) and Client (Brixton)
4) Server (Angel) and Client (Angel)

Numbers 1 and 4 work as expected. However, numbers 2 and 3 have some differing results.

With 2, the client registers, but continues to register time after time. I found that this is because the instanceId doesn't match what is stored in eureka server. With this setup, I don't have access to the hostname via the configuration properties, so I can't get the instanceId populated with the same value that eureka server contains without doing some code magic.

With 3, the client is able to register on the first lookup, but if the instance can't be found in eureka server then it never is able to re-register and get back to the up state. One thing to note on this flavor as well is that I had to make the eureka.instance.metadata.instanceId property look exactly like the second half of the instanceId created dynamically by the org.springframework.cloud.commons.util.IdUtils class. So the resulting value ended up being: ${spring.application.name}:${server.port} in my test case and the instanceId field was ${spring.cloud.client.hostname}:${spring.application.name}:${server.port}. Upon digging further into this version of the issues I found that in the com.netflix.discovery.shared.transport.jersey.AbstractJerseyEurekaHttpClient#sendHeartBeat method that it was updated to see if the response had a body or not, since the response is a 404 the body doesn't match what it's expecting to get back and then throws a mapping exception. This path is what ends up causing the failure to re-register.

All 22 comments

So did some more poking around after the release of Brixton and I've discovered a couple more things. Also now providing the configuration that I'm using for testing this error.

https://github.com/shanman190/eureka-error

There are two branches in the above project that demonstrate the middle two issues.

So the first new thing that I've found is that at least in my setup I have the following facts:

Ubuntu 16.04 LTS
Oracle Java 8u91
Network: DHCP

When running in configuration 2 (server Brixton, client Angel), the instanceId is in the same configuration as what Spencer recommends in pr-608 from the customers service example. The thing that I notice here is that when the client calls over to the server it's trying to query for instanceId = localhost:eureka-client:8080 which doesn't exist because in the server. Instead, in the server it can be queried by instanceId = eureka-client:8080. From what I've search through in the Angel codebase there isn't a way to get the hostname onto the instanceId field at the query time on the eureka server side in order to get the values to collide correctly or to remove the localhost prefix from the instanceId property on the client side, so ultimately what the result is is that the client at every renew will perform a re-register of the lease.

When running in configuration 3 (server Angel, client Brixton), Initially, on startup, we register with an instanceId = eureka-client:8080, however when it came time to renew in my setup the lookup is sending an instanceId value of IP:eureka-client:8080 and the server is expecting to see an instanceId value of localhost:eureka-client:8080. I can force the IP to the hostname by adding an entry in my hosts file with the DHCP address, but that would only work until the IP was renewed. Once I do that though an additional issue rises up and that is that in this configuration the client registers on the startup just fine with an the hostname of localhost and an instanceId value of eureka-client:8080, however that doesn't match the same value as the instanceId when it goes to renew the lease as it's the IP version now. Once the two are consistent then the registration and renewal happens correctly. The last bit of trouble that I ran into was that if the eureka server is restarted at this point, the Brixton client won't be able to re-register without a restart of the application. This I found was because of the json response that spring boot returns on an error response.

In summary,
Config 2:

  • instanceId for the instance doesn't match what eureka server expects causing continuous re-registration

Config 3:

  • instanceId value changes during startup when using DHCP and doesn't collide for renewal, but because of the json return value is not able to re-register.
  • because the client is unable to re-register it is also unable to re-register after eureka server has been restarted.

+1 for this, suffering from incompatibility between 1.0.3 server and 1.1.0 client.

afe8f4d0ef6b3064df655453f941c61c1a8769a4 introduces a fix for 3) by setting the following:

eureka.instance.instanceId=${spring.cloud.client.hostname}:client1
eureka.instance.metadataMap.instanceId=${eureka.instance.instanceId}

client1 is arbitrary, could be port, could be an id from platform, random number etc...

Angel Server expects the value of eureka.instance.hostname to prefix instanceId. The angel client does it by default. @dsyer why did we do that? Was it for running on a single machine?

@spencergibb That part definitely looks good to fix the instanceId problem. Is there a way that you know of to easily disable the 404 json with the Angel eureka server? That would fix the second part of 3 preventing the instance from re-registering.

@shanman190 I'm not seeing a 404 for 3 with my fix.

@spencergibb The 404 will come from the eureka server after performing a restart of the server application and the client sending it's next heartbeat. In my example, I'm running a single node, so there isn't replication. The problem originates from a change to the Eureka DiscoveryClient.HeartbeatThread (1.1.147) and AbstractJerseyEurekaHttpClient (1.4.6) (this is where the heartbeat moved to) where it was looking for a status code of 404 coming back from Eureka Server and now it's looking for the response body having content.

DiscoveryCleint.HeartbeatThread:
https://github.com/Netflix/eureka/blob/v1.1.147/eureka-client/src/main/java/com/netflix/discovery/DiscoveryClient.java#L1591

AbstractJerseyEurekaHttpClient:
https://github.com/Netflix/eureka/blob/v1.4.6/eureka-client/src/main/java/com/netflix/discovery/shared/transport/jersey/AbstractJerseyEurekaHttpClient.java#L104

So I realize that my previous comment isn't very clear, so I'm going to add some more information in the hope that my previous comment becomes a little bit more clear.

Steps:
1) Start server
2) Start client and wait until registration is complete
3) Restart server, client will no longer be registered, but the client thinks that it is
4) Client will send heartbeat
5) Server sends 404 with Spring Boot json response body (here's where things start to go wrong)
6) AbstractJerseyEurekaHttpClient begins capturing the values of the response because the response has a json body it also tries to deserialize that body into an InstanceInfo object causing an exception to be thrown up the stack.

If there was a way to disable the json response body then the exception caused by the incorrect deserialization would be skipped and the Eureka response would return back to the renew method of the DiscoveryClient and it then checks for the 404 response code and re-registers the client with the server.

Thanks for all the analysis. Here's my summary:

  • We should give a low priority to scenario 3 (Brixton client, Angel server) because upgrading the server should always be easy.
  • Commit a0c1b4c0 should fix scenario 2 (Angel client, Brixton server).
  • The 404 problem in #1033 is actually a barrier for the Brixton client, irrespective of the server anyway, so we should fix that urgently as well (in the server if necessary). It might even fix scenario 3.

Update: #1033 is only a problem with the Angel server, so it's not a high priority. As far as I can tell we have it all covered in the issue. There might still be some issues with the client thrashing its registration (see comments with log snippets in #1013), so I'm going to leave this open for a bit in case we can work out what that means.

@dsyer In which branch live Commit a0c1b4c0 ?
This scenario (2) is very important because we cannot oblige all eureka clients to release themselves if we upgrade eureka server to brixton.

It's on master (now).

I've tested the issue as well using the latest Brixton snapshot and can confirm that upgrading the eureka server to the Brixton snapshot allows both eureka clients running Angel and Brixton to properly register and re-register after the server has been restarted. This most definitely fixes scenario 2 and gives users an upgrade path from Angel to Brixton when utilizing Eureka service discovery.

@dsyer You're welcome. Glad to help and thank you and @spencergibb for all the work that went into correcting this issue!

Does this fix will be part of release 1.1.1 ?
Thx
Christophe

@ouaibsky yes, @dsyer's fix for item 3 a0c1b4c05d9c98d98a863b248608b2fff1177b7f would be part of 1.1.1 (Brixton.SR1). My fix for item 2 (https://github.com/spring-cloud/spring-cloud-netflix/issues/978#issuecomment-219168252) would be part of 1.0.x and Angel.SR7.

Thx a lot
Christophe

Is there any thought on when the Brixton.SR1 release is going to be?

Probably next week (still waiting for a couple of bug fixes).

What is the process to push releases on maven central ?
I can see 1.1.1 and 1.1.2 in spring repo: https://repo.spring.io/release/org/springframework/cloud/spring-cloud-netflix-eureka-server/,

but nothing in central: http://search.maven.org/#search|gav|1|g%3A%22org.springframework.cloud%22%20AND%20a%3A%22spring-cloud-netflix-eureka-server%22

Christophe

The process isn't finished yet.

Brixton SR1 fixed the problem for me. Good job, thanks.

In my opinion this works only when in the Angel client has no custom 'eureka.instance.metadataMap.instanceId' set.

In my case the Angel client has:

eureka:  
  instance:
    metadataMap:
      instanceId: ${spring.application.name}:${spring.application.instance_id:${random.value}}

In this case the client tries to register itself all 30 sec:

2016-06-16 17:37:39.620  INFO 11044 --- [pool-2-thread-1] com.netflix.discovery.DiscoveryClient    : DiscoveryClient_XXX-EUREKA-CLIENT-ANGEL/U245496:xxx-eureka-client-angel:983d24591515138d72191e85eb0479a6 - Re-registering apps/XXX-EUREKA-CLIENT-ANGEL
2016-06-16 17:37:39.620  INFO 11044 --- [pool-2-thread-1] com.netflix.discovery.DiscoveryClient    : DiscoveryClient_XXX-EUREKA-CLIENT-ANGEL/U245496:xxx-eureka-client-angel:983d24591515138d72191e85eb0479a6: registering service...
2016-06-16 17:37:39.621  INFO 11044 --- [pool-2-thread-1] com.netflix.discovery.DiscoveryClient    : DiscoveryClient_XXX-EUREKA-CLIENT-ANGEL/U245496:xxx-eureka-client-angel:983d24591515138d72191e85eb0479a6 - registration status: 204
2016-06-16 17:38:09.661  INFO 11044 --- [pool-2-thread-1] com.netflix.discovery.DiscoveryClient    : DiscoveryClient_XXX-EUREKA-CLIENT-ANGEL/U245496:xxx-eureka-client-angel:983d24591515138d72191e85eb0479a6 - Re-registering apps/XXX-EUREKA-CLIENT-ANGEL
2016-06-16 17:38:09.661  INFO 11044 --- [pool-2-thread-1] com.netflix.discovery.DiscoveryClient    : DiscoveryClient_XXX-EUREKA-CLIENT-ANGEL/U245496:xxx-eureka-client-angel:983d24591515138d72191e85eb0479a6: registering service...
2016-06-16 17:38:09.662  INFO 11044 --- [pool-2-thread-1] com.netflix.discovery.DiscoveryClient    : DiscoveryClient_XXX-EUREKA-CLIENT-ANGEL/U245496:xxx-eureka-client-angel:983d24591515138d72191e85eb0479a6 - registration status: 204

Brixton.SR1 Eureka Server logs:

2016-06-16 17:37:39.620  WARN 11392 --- [nio-8761-exec-5] c.n.e.registry.AbstractInstanceRegistry  : DS: Registry: lease doesn't exist, registering resource: XXX-EUREKA-CLIENT-ANGEL - U245496:xxx-eureka-client-angel:983d24591515138d72191e85eb0479a6
2016-06-16 17:37:39.620  WARN 11392 --- [nio-8761-exec-5] c.n.eureka.resources.InstanceResource    : Not Found (Renew): XXX-EUREKA-CLIENT-ANGEL - U245496:xxx-eureka-client-angel:983d24591515138d72191e85eb0479a6
2016-06-16 17:37:39.621  INFO 11392 --- [nio-8761-exec-7] c.n.e.registry.AbstractInstanceRegistry  : Registered instance XXX-EUREKA-CLIENT-ANGEL/U245496 with status UP (replication=false)
2016-06-16 17:38:09.661  WARN 11392 --- [nio-8761-exec-6] c.n.e.registry.AbstractInstanceRegistry  : DS: Registry: lease doesn't exist, registering resource: XXX-EUREKA-CLIENT-ANGEL - U245496:xxx-eureka-client-angel:983d24591515138d72191e85eb0479a6
2016-06-16 17:38:09.661  WARN 11392 --- [nio-8761-exec-6] c.n.eureka.resources.InstanceResource    : Not Found (Renew): XXX-EUREKA-CLIENT-ANGEL - U245496:xxx-eureka-client-angel:983d24591515138d72191e85eb0479a6
2016-06-16 17:38:09.662  INFO 11392 --- [nio-8761-exec-9] c.n.e.registry.AbstractInstanceRegistry  : Registered instance XXX-EUREKA-CLIENT-ANGEL/U245496 with status UP (replication=false)

com.netflix.eureka.registry.AbstractInstanceRegistry.register will register with hostname(instanceInfo.getId()) when no instanceId is set(never set in Angel clients).

Thanks for the logs. Can you please open a new issue specifically about this scenario?

Was this page helpful?
0 / 5 - 0 ratings