Nest: ClientProxy on Redis random CONNECTION_BROKEN

Created on 20 Feb 2020 · 10Comments · Source: nestjs/nest

Bug Report

Current behavior

I'm building an application with a microservices architecture, and everything works fine. The underneath services are orchestrated by a central Gateway that emits events, and send messages using the ClientProxy configured with a Redis connection.

The thing is that sometimes (doesn't seem to be a particular reason causing this) Nest complains about an error in the console:

[Nest] 93917   - 02/20/2020, 10:23:34   [ClientProxy] Object:
{
  "code": "CONNECTION_BROKEN",
  "origin": {}
}

If I then try to make another HTTP call to the Gateway, that will send a message to the respective service/s, everything is stuck.

If I reboot the Gateway, and retry the call without rebooting the underneath service, the call reaches the service, but the service crashes when it tries to emit an event after some processing.

The strange thing is, that the service gets activated via @MessagePattern, which indeed uses the ClientProxy? So if the connection is broken, how can it receive the message from the gateway?
I'm surely expecting to crash whenever it tries to do anything with Redis at this point... Or am I missing something?

Another thing to mention is that the Redis service is hosted on DigitalOcean, and is a managed Redis service, when the CONNECTION_BROKEN error occurs it happens on both the service and the Gateway.. Which leads me to think that is the DO service itself that somehow "cuts" the connection after a while?

Input Code

Gateway code

// auth.controller.ts (the public API that the users can call on the Gateway)
@Post('auth/classic')
    classicAuth(@Body() dto: ClassicAuthDto, @Response() httpResp: ExpressResp) {
        this.callService(this.identitySvc.tryClassicAuth(dto.email, dto.password), httpResp);
}

///...
/// callService method (inside base class)
protected callService<T>(stream: Observable<BaseResponse<T>>, httpResponse: Response): void {
        const sub = stream.pipe(
            tap(resp => (httpResponse.status(resp.statusCode).send(resp))),
            catchError(err => this.handleResponseError(err, httpResponse)),
            finalize(() => sub.unsubscribe()),
        ).subscribe();
}


// auth.service.ts (the service that sends the message to the underneath microservice)
tryClassicAuth(email: string, password: string): Observable<BaseResponse<AuthResponseDto>> {
        return this.bus.send(IdentityMessages.identityClassicAuth, { email, password });
}

Microservice code

// auth.controller.ts (the method that receives the message from Redis)
@MessagePattern(IdentityMessages.identityTokenRefresh)
    onTokenRefresh(data: IRefreshTokenModel) {
        Logger.debug(this.onTokenRefresh.name, IdentityController.name);
        return this.identitySvc.tokenRefresh(data.refreshToken);
}


// auth.service.ts (the method that emits the event)
private async generateAuthResponse(userUid: string): Promise<IAuthResponse> {
        let shellyUser = await this.getShellyUser(userUid);
        if (!shellyUser) {
            shellyUser = await this.initShellyUser(userUid);
            Logger.log(`New user initialized: ${shellyUser.uid}`);
        }

        const token = this.signToken(shellyUser.uid);
        const refreshToken = this.signRefreshToken(shellyUser.uid);

        this.pushInMemoryUser(token, shellyUser);

        // The bus is the ClientProxy
        const sub = this.bus.emit(IdentityEvents.newUserAuthenticated, { uid: shellyUser.uid }).subscribe(() => {
            Logger.debug(`Event emitted: ${IdentityEvents.newUserAuthenticated}`, IdentityService.name);
            sub.unsubscribe();
        });

        return Promise.resolve({
            token,
            type: 'Bearer',
            refresh: refreshToken,
        });
    }

Update 1

I've tried to change the Redis provider from DigitalOcean to ScaleGrid, and this happened:

The Gateway (as it seems for now) did not throw the CONNECTION_BROKEN error, while the underneath service did. Of course, this time the gateway was fully operational, and ready to accept requests, but when the requests arrived to the underneath service after the CONNECTION_BROKEN error (still confused about this though) the service blew up while trying to emit a message on an already closed connection:

Then I've tried to move the underneath service on a machine, and the gateway on another. Same scenario.

I can now safely assume that the problem resides in Nest or the library used for the Redis communication? Or again, maybe I'm missing something?

I'm pretty stuck on this thing, and I wouldn't want to switch over TCP 😭

Update 2

It looks like that if I pull a Redis docker image locally, and bind it only to my 127.0.0.1:6379:6379 (therefore allowing only a local communication) it seems to keep up. No CONNECTION_BROKEN until now... So what could it be? I'm not a Redis expert btw 🤔

Expected behavior

I expect the connection to stay alive forever, but if this depends on external factors, I expect the framework to try an automatic reconnect upon failure (?)

Possible Solution

An event that we can subscribe to so we manually reconnect, or an automatic procedure handled by Nest itself?

Environment



Nest version: 6.10.14

For Tooling issues:


Node version: 12.16.0
Platform:  Mac


                        
                                                        microservices
                                                        question 🙌
                                                        
                                Source


                    
                        
                            
                                
                                caiusCitiriga



            
                 
            

            
            
                
                    
                    
                
            

            
                                All 10 comments
                
                            

            
                                
                    
                        
  The strange thing is, that the service gets activated via @MessagePattern, which indeed uses the ClientProxy? So if the connection is broken, how can it receive the message from the gateway?

  I'm surely expecting to crash whenever it tries to do anything with Redis at this point... Or am I missing something?


Redis pub/sub uses 2 separate clients, respectively "subscriber" and "publisher.


  I expect the connection to stay alive forever, but if this depends on external factors, I expect the framework to try an automatic reconnect upon failure (?)


Have you tried setting the retryAttempts and the retryDelay values?

transport: Transport.REDIS,
options: {
    url: 'redis://localhost:6379', // your URL
    retryAttempts: 10,
    retryDelay: 3000, // in miliseconds
},



  I've tried to change the Redis provider from DigitalOcean to ScaleGrid, and this happened:


I'm not very familiar with the ScaleGrid so I can't really help here. Have you tried asking on our Discord channel? https://discordapp.com/channels/nestjs

                    
                    
                        
                            
                                
                                kamilmysliwiec
                                on 21 Feb 2020
                            
                            
                                                                                            
                        
                    
                

                                                
                    
                        Hey @kamilmysliwiec sorry for the delay, but I was away from the office. Just got back. So, during my absence I've tried to understand a bit more what was going on. So here's what I've found out:

Nest is working like a charm. I was very sceptic about a Nest bug, but still could have been a possibility. 

The good news is that the problem was solved just by increasing the memory of the redis machine. Since it was a managed service, hosted on a t2.micro instance it was pretty lightweight. So I switched to a bigger machine managed by me, hiding it behind a FW.

I've also tried what you suggested with the retry attempts, but still not successfull. It looks like that when the service was cutting the connection it was actually "dying" (?)

Anyhow, the main cause was the service memory itself (pretty confident about it)

Also I'm guessing that the Redis path it's soon to be abandoned, since we will move to RabbitMQ soon or late.

                    
                    
                        
                            
                                
                                caiusCitiriga
                                on 25 Feb 2020
                            
                            
                                                                                            
                        
                    
                

                                                
                    
                        Hi there,

I was experiencing the same error. Tried to discuss here, but no one seems to be going through this error yet.

I somehow figured out a hack to handle it. I had CacheManager injected in both AppModule and in my PagesAPIModule. I was also using it for caching the SSR HTML against each URL.

When I removed it from the AppModule and also from SSR and only started using it in PagesAPIModule, the application start to work.

I would love to find out the reason, and a proper solution. For example, if I want to use it in my AppController API routes and also for SSR service then how would I do it?

                    
                    
                        
                            
                                
                                ansaries
                                on 7 May 2020
                            
                            
                                                                                            
                        
                    
                

                                                
                    
                        We seem to experience the same issue. We were first running REDIS inside our k8s cluster which didn't cause any issues. For production we moved REDIS outside the cluster to the managed solution on Digital Ocean. We seem to have enough memory avaiable (1gig) and running on 50% usage. We are using REDIS for events. After a while the following error occurs:

2020-10-21T15:01:35.167024685Z [Auth-Service] 214 - Error 10/21/2020, 3:01:35 PM [ClientProxy] Error: Redis connection in broken state: retry aborted.  - {"trace":""}
2020-10-21T15:01:35.178638071Z Error: Redis connection in broken state: retry aborted.
2020-10-21T15:01:35.178685472Z     at RedisClient.connection_gone (/app/node_modules/redis/index.js:568:30)
2020-10-21T15:01:35.178694556Z     at TLSSocket.<anonymous> (/app/node_modules/redis/index.js:230:14)
2020-10-21T15:01:35.178702373Z     at Object.onceWrapper (events.js:421:28)
2020-10-21T15:01:35.178709765Z     at TLSSocket.emit (events.js:327:22)
2020-10-21T15:01:35.178716816Z     at TLSSocket.EventEmitter.emit (domain.js:505:15)
2020-10-21T15:01:35.178723709Z     at endReadableNT (_stream_readable.js:1221:12)
2020-10-21T15:01:35.178730803Z     at processTicksAndRejections (internal/process/task_queues.js:84:21) {
2020-10-21T15:01:35.178737803Z   code: 'CONNECTION_BROKEN',
2020-10-21T15:01:35.178744496Z   origin: Error: Retry time exhausted
2020-10-21T15:01:35.178751419Z       at RedisStreamClient.createRetryStrategy (/app/node_modules/@mark_hoog/redis-streams-transport/dist/redis.client.js:70:20)
2020-10-21T15:01:35.178758598Z       at Object.retry_strategy (/app/node_modules/@mark_hoog/redis-streams-transport/dist/redis.client.js:57:50)
2020-10-21T15:01:35.178765646Z       at RedisClient.connection_gone (/app/node_modules/redis/index.js:553:41)
2020-10-21T15:01:35.178772337Z       at TLSSocket.<anonymous> (/app/node_modules/redis/index.js:230:14)
2020-10-21T15:01:35.178779857Z       at Object.onceWrapper (events.js:421:28)
2020-10-21T15:01:35.178786652Z       at TLSSocket.emit (events.js:327:22)
2020-10-21T15:01:35.178793297Z       at TLSSocket.EventEmitter.emit (domain.js:505:15)
2020-10-21T15:01:35.178800402Z       at endReadableNT (_stream_readable.js:1221:12)
2020-10-21T15:01:35.178807300Z       at processTicksAndRejections (internal/process/task_queues.js:84:21)
2020-10-21T15:01:35.178814496Z }


It seems to work fine at first, but after a while 569543ms it the connection is broken.

[EDIT] It seems to be different per service. Sometimes the disconnect happens after 10 minutes, sometimes after half an hour or a few hours.

                    
                    
                        
                            
                                
                                stijlbreuk
                                on 21 Oct 2020
                            
                            
                                                                👍1
                            
                        
                    
                

                                                
                    
                        @stijlbreuk same exact thing. Also, I confirm the "per service" disconnect behaviour. Each service dies at a different time after a successful connection.

                    
                    
                        
                            
                                
                                caiusCitiriga
                                on 23 Oct 2020
                            
                            
                                                                                            
                        
                    
                

                                
                    
                    
                
                                                
                    
                        I'm experiencing similar issues with a Kubernetes cluster using the bitnami/redis chart. Relevant info:


Got a microservice called Media, connecting to redis via app.connectMicroservice()
Got an app called Api, using ClientProxy to send messages to Media
Api after some time logs the CONNECTION_BROKEN error with empty origin field
Media does not show any errors, even though it connects to the same redis node


                    
                    
                        
                            
                                
                                RafaelVidaurre
                                on 1 Dec 2020
                            
                            
                                                                👀1
👍1
                            
                        
                    
                

                                                
                    
                        @RafaelVidaurre still fighting with this... did you find any solutions? 

                    
                    
                        
                            
                                
                                caiusCitiriga
                                on 2 Dec 2020
                            
                            
                                                                                            
                        
                    
                

                                                
                    
                        @caiusCitiriga I fixed it yesterday. For me the fix was pretty weird though, not sure what was broken but it had to do with how I was connecting to Redis it would seem. There's definitely a problem in the sense that the error has no useful info.

So, in my case one of the following things was the issue:


I just had an undefined env variable (REDIS_URI), and was passing that to ClientProxyFactory. Probably I would've gotten a more explicit error if this were the case, but who knows.
ClientProxyFactory didn't like the way the redis URL was formed.
I had some issue with my Redis node in Kubernetes related to the password. I was using the default auto-generated password, and that was causing other issues sometimes with the slave nodes, so ended up setting up a fixed password instead on my values.yaml file (I use Helm).


In case this helps anyone, this is how I'm building ClientProxy now. 

_(Please excuse that static url there, probably injecting those env values via the ConfigModule would be more proper for a Nest app)_

import { Module } from '@nestjs/common';
import { ClientProxyFactory, Transport } from '@nestjs/microservices';

import { environment } from '../../environments/environment';
import { MEDIA_SERVICE } from '../constants/injectables';

const redis = environment.redis;
let redisCredentialsString = '';

if (redis.password) {
  redisCredentialsString = `${redis.username || ''}:${redis.password}@`;
}
const redisUrl = `redis://${redisCredentialsString}${redis.host}:${redis.port}/0`;

@Module({
  providers: [
    {
      provide: MEDIA_SERVICE,
      useFactory: () => {
        return ClientProxyFactory.create({
          transport: Transport.REDIS,
          options: {
            url: redisUrl,
            retryAttempts: 20,
            retryDelay: 3000,
          },
        });
      },
    },
  ],
  exports: [MEDIA_SERVICE],
})
export class MediaServiceModule {}



                    
                    
                        
                            
                                
                                RafaelVidaurre
                                on 2 Dec 2020
                            
                            
                                                                ❤1
                            
                        
                    
                

                                                
                    
                        @RafaelVidaurre thank you so much for sharing! I'm currently working on another project now, but as soon as I'll have time I'll go back to the old one and try implementing this change 😊  I'll let you know if it will fix the problem. 

Because of course... it's randomly appearing 🤣 

                    
                    
                        
                            
                                
                                caiusCitiriga
                                on 2 Dec 2020
                            
                            
                                                                                            
                        
                    
                

                                                
                    
                        @RafaelVidaurre unfortunately my redis url seemed to be ok.. since it's a dead simple redis://redis, the redis service is a docker container running locally, and not exposed to the outside world... 

But the temporary solution that seems to be working is the following one:

https://stackoverflow.com/questions/56242848/kill-nestjs-node-js-process-while-lost-redis-connection-microservice

                    
                    
                        
                            
                                
                                caiusCitiriga
                                on 7 Dec 2020
                            
                            
                                                                                            
                        
                    
                

                                            

            
                
                    
                        Was this page helpful?
                                                                                                    
                                                                                                                        
                                                                
                                                                
                                                                
                                                                
                                                                                    
                        0 / 5 - 0 ratings


    

        

        
            
                Related issues
                                                
                    
                        decorator apply to Injectable class can not get that object's properties
                    
                
                
                    
                    hackboy
                                         · 
                    3Comments
                                    
                 
                                                
                    
                        Add RPC streaming support
                    
                
                
                    
                    artaommahe
                                         · 
                    3Comments
                                    
                 
                                                
                    
                        Mongoose static/methods in Schema
                    
                
                
                    
                    rlesniak
                                         · 
                    3Comments
                                    
                 
                                                
                    
                        WebSocketServer .on not firing but SubscribeMessage working
                    
                
                
                    
                    FranciZ
                                         · 
                    3Comments
                                    
                 
                                                
                    
                        NestMiddleware freezes request
                    
                
                
                    
                    KamGor
                                         · 
                    3Comments