Ray: Connection Issue - Reset by Peer

Created on 22 Sep 2018  ·  18Comments  ·  Source: ray-project/ray

The same code on the same Ec2 instance now throw the error below. It appears to come from the line in my code: register_trainable("exp", obj_function)

I have redis '2.10.6' installed.

Any hints on the cause?

Process STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:60723 to respond...
Waiting for redis server at 127.0.0.1:48080 to respond...
Starting local scheduler with the following resources: {'CPU': 96, 'GPU': 0}.

======================================================================
View the web UI at http://localhost:8889/notebooks/ray_ui33174.ipynb?token=74b2c0738a07e667d60580a4b2f24d7317954eaf830ab0ef
======================================================================

---------------------------------------------------------------------------
ConnectionResetError                      Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/redis/connection.py in send_packed_command(self, command)
    589             for item in command:
--> 590                 self._sock.sendall(item)
    591         except socket.timeout:

ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/redis/client.py in execute_command(self, *args, **options)
    666         try:
--> 667             connection.send_command(*args)
    668             return self.parse_response(connection, command_name, **options)

~/anaconda3/lib/python3.6/site-packages/redis/connection.py in send_command(self, *args)
    609         "Pack and send a command to the Redis server"
--> 610         self.send_packed_command(self.pack_command(*args))
    611 

~/anaconda3/lib/python3.6/site-packages/redis/connection.py in send_packed_command(self, command)
    602             raise ConnectionError("Error %s while writing to socket. %s." %
--> 603                                   (errno, errmsg))
    604         except:

ConnectionError: Error 104 while writing to socket. Connection reset by peer.

During handling of the above exception, another exception occurred:

ConnectionResetError                      Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/redis/connection.py in send_packed_command(self, command)
    589             for item in command:
--> 590                 self._sock.sendall(item)
    591         except socket.timeout:

ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
<ipython-input-6-48c88d974e24> in <module>()
     75         ray.init()
     76 
---> 77     register_trainable("exp", obj_function) #registers the above config and the objective function
     78 
     79     hpo=HyperOptSearch(space, max_concurrent=4, reward_attr="neg_mean_loss") #smaller is better for log loss

~/anaconda3/lib/python3.6/site-packages/ray/tune/registry.py in register_trainable(name, trainable)
     36         raise TypeError("Second argument must be convertable to Trainable",
     37                         trainable)
---> 38     _global_registry.register(TRAINABLE_CLASS, name, trainable)
     39 
     40 

~/anaconda3/lib/python3.6/site-packages/ray/tune/registry.py in register(self, category, key, value)
     77         self._to_flush[(category, key)] = pickle.dumps(value)
     78         if _internal_kv_initialized():
---> 79             self.flush_values()
     80 
     81     def contains(self, category, key):

~/anaconda3/lib/python3.6/site-packages/ray/tune/registry.py in flush_values(self)
     99     def flush_values(self):
    100         for (category, key), value in self._to_flush.items():
--> 101             _internal_kv_put(_make_key(category, key), value, overwrite=True)
    102         self._to_flush.clear()
    103 

~/anaconda3/lib/python3.6/site-packages/ray/experimental/internal_kv.py in _internal_kv_put(key, value, overwrite)
     29     worker = ray.worker.get_global_worker()
     30     if overwrite:
---> 31         updated = worker.redis_client.hset(key, "value", value)
     32     else:
     33         updated = worker.redis_client.hsetnx(key, "value", value)

~/anaconda3/lib/python3.6/site-packages/ray/utils.py in _wrapper(*args, **kwargs)
    322                 def _wrapper(*args, **kwargs):
    323                     with self.lock:
--> 324                         return orig_attr(*args, **kwargs)
    325 
    326                 self._wrapper_cache[attr] = _wrapper

~/anaconda3/lib/python3.6/site-packages/redis/client.py in hset(self, name, key, value)
   1990         Returns 1 if HSET created a new field, otherwise 0
   1991         """
-> 1992         return self.execute_command('HSET', name, key, value)
   1993 
   1994     def hsetnx(self, name, key, value):

~/anaconda3/lib/python3.6/site-packages/redis/client.py in execute_command(self, *args, **options)
    671             if not connection.retry_on_timeout and isinstance(e, TimeoutError):
    672                 raise
--> 673             connection.send_command(*args)
    674             return self.parse_response(connection, command_name, **options)
    675         finally:

~/anaconda3/lib/python3.6/site-packages/redis/connection.py in send_command(self, *args)
    608     def send_command(self, *args):
    609         "Pack and send a command to the Redis server"
--> 610         self.send_packed_command(self.pack_command(*args))
    611 
    612     def can_read(self, timeout=0):

~/anaconda3/lib/python3.6/site-packages/redis/connection.py in send_packed_command(self, command)
    601                 errmsg = e.args[1]
    602             raise ConnectionError("Error %s while writing to socket. %s." %
--> 603                                   (errno, errmsg))
    604         except:
    605             self.disconnect()

ConnectionError: Error 104 while writing to socket. Connection reset by peer.

Most helpful comment

How big is your data set? I kept getting ConnectionResetError: [Errno 104] Connection reset by peer errors. Then, I followed the guide on working with large objects (mine were in the GB range) here: https://ray.readthedocs.io/en/latest/tune-usage.html#handling-large-datasets
and I no longer got that error.

All 18 comments

Does anyone know of any changes that may have occurred or things to trouble shoot with this issue?

This is on a single machine, right?

Can you look at the logs under /tmp/ray or /tmp/raylogs/ depending on your version of Ray and see if the files that start with redis-* say anything interesting?

What if you just do import ray; ray.init() and nothing else. Does that work?

Hello @robertnishihara, I have same issue here.
When I using my own tensorflow code with ray.tune, it give me the error message:

redis.exceptions.ConnectionError: Error 104 while writing to socket. Connection reset by peer.

But when I running tune_mnist_ray.py in my computer, it works well and complete in the end.


in the tmp/raylogs of the redis-* file, it say something like this:

36799:M 21 Oct 17:23:46.804 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
36799:M 21 Oct 17:23:46.804 # Server started, Redis version 3.9.102
36799:M 21 Oct 17:23:46.804 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
36799:M 21 Oct 17:23:46.804 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.

Is there any other info I can give you so that you can help us fix this bug?

Does your own code work on your own laptop?

I'm having the same issue, did anybody solved the issue?

me too

Can you try the nightly wheels (see https://ray.readthedocs.io/en/latest/installation.html#trying-snapshots-from-master) and see if the issue still occurs?

Hi, thanks for responding. I tried the nightly wheels for python 3.6 and the same error still occurs

redis.exceptions.ConnectionError: Error 104 while writing to socket. Connection reset by peer.

描述:将Pyspark(spark = SparkSession.. )和(数据获取) 放在Ray代码之外 作为数据源,再将其传入Ray代码内,即会报这个错误。

解决:将spark初始化放入Ray的初始化(_setup(self, config))中,并在Ray的类内定义一个用于数据获取的函数.

I get the sense this is an issue if your environment has a large memory consumption. @robertnishihara is there an easy way around this?

@dmadeka If the issue is that Redis is crashing because it is using too much memory, you can limit it by calling ray.init(redis_max_memory=10**9) or something like that (in bytes).

I'm having the same issue. When trying to register a new environment, the following error comes up:


ConnectionResetError Traceback (most recent call last)
~/miniconda/lib/python3.7/site-packages/redis/connection.py in send_packed_command(self, command)
599 for item in command:
--> 600 self._sock.sendall(item)
601 except socket.timeout:

ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

ConnectionError Traceback (most recent call last)
in
1 from ray.tune.registry import register_env
----> 2 register_env("CryptoTrain", env_creator)

~/miniconda/lib/python3.7/site-packages/ray/tune/registry.py in register_env(name, env_creator)
61 if not isinstance(env_creator, FunctionType):
62 raise TypeError("Second argument must be a function.", env_creator)
---> 63 _global_registry.register(ENV_CREATOR, name, env_creator)
64
65

~/miniconda/lib/python3.7/site-packages/ray/tune/registry.py in register(self, category, key, value)
89 self._to_flush[(category, key)] = pickle.dumps(value)
90 if _internal_kv_initialized():
---> 91 self.flush_values()
92
93 def contains(self, category, key):

~/miniconda/lib/python3.7/site-packages/ray/tune/registry.py in flush_values(self)
111 def flush_values(self):
112 for (category, key), value in self._to_flush.items():
--> 113 _internal_kv_put(_make_key(category, key), value, overwrite=True)
114 self._to_flush.clear()
115

~/miniconda/lib/python3.7/site-packages/ray/experimental/internal_kv.py in _internal_kv_put(key, value, overwrite)
40
41 if overwrite:
---> 42 updated = worker.redis_client.hset(key, "value", value)
43 else:
44 updated = worker.redis_client.hsetnx(key, "value", value)

~/miniconda/lib/python3.7/site-packages/redis/client.py in hset(self, name, key, value)
2672 Returns 1 if HSET created a new field, otherwise 0
2673 """
-> 2674 return self.execute_command('HSET', name, key, value)
2675
2676 def hsetnx(self, name, key, value):

~/miniconda/lib/python3.7/site-packages/redis/client.py in execute_command(self, args, *options)
772 connection = pool.get_connection(command_name, *options)
773 try:
--> 774 connection.send_command(
args)
775 return self.parse_response(connection, command_name, **options)
776 except (ConnectionError, TimeoutError) as e:

~/miniconda/lib/python3.7/site-packages/redis/connection.py in send_command(self, args)
618 def send_command(self, *args):
619 "Pack and send a command to the Redis server"
--> 620 self.send_packed_command(self.pack_command(
args))
621
622 def can_read(self, timeout=0):

~/miniconda/lib/python3.7/site-packages/redis/connection.py in send_packed_command(self, command)
611 errmsg = e.args[1]
612 raise ConnectionError("Error %s while writing to socket. %s." %
--> 613 (errno, errmsg))
614 except: # noqa: E722
615 self.disconnect()

ConnectionError: Error 104 while writing to socket. Connection reset by peer.

HELP PLEASE!!!!!! It worked until yesterday, the code hasn't changed, nor anything else. The only one change is the size of the files I'm using as the data for the environment, but setting redis memory or object store memory higher is not helping at all... the issue remains the same.

Can you share a script for reproducing the issue?

I have the same issue which runs with no redis at my gpu server.

ConnectionError: Error 104 while writing to socket. Connection reset by peer.

Can I remove the redis connection?

How big is your data set? I kept getting ConnectionResetError: [Errno 104] Connection reset by peer errors. Then, I followed the guide on working with large objects (mine were in the GB range) here: https://ray.readthedocs.io/en/latest/tune-usage.html#handling-large-datasets
and I no longer got that error.

I had the same issue and solved it following @flowersw suggestion on large datasets

Resolved, if not please reopen.

How big is your data set? I kept getting ConnectionResetError: [Errno 104] Connection reset by peer errors. Then, I followed the guide on working with large objects (mine were in the GB range) here: ray.readthedocs.io/en/latest/tune-usage.html#handling-large-datasets
and I no longer got that error.

This is now located at https://docs.ray.io/en/latest/tune/tutorials/tune-usage.html#handling-large-datasets

Was this page helpful?
0 / 5 - 0 ratings