The same code on the same Ec2 instance now throw the error below. It appears to come from the line in my code: register_trainable("exp", obj_function)
I have redis '2.10.6' installed.
Any hints on the cause?
Process STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:60723 to respond...
Waiting for redis server at 127.0.0.1:48080 to respond...
Starting local scheduler with the following resources: {'CPU': 96, 'GPU': 0}.
======================================================================
View the web UI at http://localhost:8889/notebooks/ray_ui33174.ipynb?token=74b2c0738a07e667d60580a4b2f24d7317954eaf830ab0ef
======================================================================
---------------------------------------------------------------------------
ConnectionResetError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/redis/connection.py in send_packed_command(self, command)
589 for item in command:
--> 590 self._sock.sendall(item)
591 except socket.timeout:
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
ConnectionError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/redis/client.py in execute_command(self, *args, **options)
666 try:
--> 667 connection.send_command(*args)
668 return self.parse_response(connection, command_name, **options)
~/anaconda3/lib/python3.6/site-packages/redis/connection.py in send_command(self, *args)
609 "Pack and send a command to the Redis server"
--> 610 self.send_packed_command(self.pack_command(*args))
611
~/anaconda3/lib/python3.6/site-packages/redis/connection.py in send_packed_command(self, command)
602 raise ConnectionError("Error %s while writing to socket. %s." %
--> 603 (errno, errmsg))
604 except:
ConnectionError: Error 104 while writing to socket. Connection reset by peer.
During handling of the above exception, another exception occurred:
ConnectionResetError Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/redis/connection.py in send_packed_command(self, command)
589 for item in command:
--> 590 self._sock.sendall(item)
591 except socket.timeout:
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
ConnectionError Traceback (most recent call last)
<ipython-input-6-48c88d974e24> in <module>()
75 ray.init()
76
---> 77 register_trainable("exp", obj_function) #registers the above config and the objective function
78
79 hpo=HyperOptSearch(space, max_concurrent=4, reward_attr="neg_mean_loss") #smaller is better for log loss
~/anaconda3/lib/python3.6/site-packages/ray/tune/registry.py in register_trainable(name, trainable)
36 raise TypeError("Second argument must be convertable to Trainable",
37 trainable)
---> 38 _global_registry.register(TRAINABLE_CLASS, name, trainable)
39
40
~/anaconda3/lib/python3.6/site-packages/ray/tune/registry.py in register(self, category, key, value)
77 self._to_flush[(category, key)] = pickle.dumps(value)
78 if _internal_kv_initialized():
---> 79 self.flush_values()
80
81 def contains(self, category, key):
~/anaconda3/lib/python3.6/site-packages/ray/tune/registry.py in flush_values(self)
99 def flush_values(self):
100 for (category, key), value in self._to_flush.items():
--> 101 _internal_kv_put(_make_key(category, key), value, overwrite=True)
102 self._to_flush.clear()
103
~/anaconda3/lib/python3.6/site-packages/ray/experimental/internal_kv.py in _internal_kv_put(key, value, overwrite)
29 worker = ray.worker.get_global_worker()
30 if overwrite:
---> 31 updated = worker.redis_client.hset(key, "value", value)
32 else:
33 updated = worker.redis_client.hsetnx(key, "value", value)
~/anaconda3/lib/python3.6/site-packages/ray/utils.py in _wrapper(*args, **kwargs)
322 def _wrapper(*args, **kwargs):
323 with self.lock:
--> 324 return orig_attr(*args, **kwargs)
325
326 self._wrapper_cache[attr] = _wrapper
~/anaconda3/lib/python3.6/site-packages/redis/client.py in hset(self, name, key, value)
1990 Returns 1 if HSET created a new field, otherwise 0
1991 """
-> 1992 return self.execute_command('HSET', name, key, value)
1993
1994 def hsetnx(self, name, key, value):
~/anaconda3/lib/python3.6/site-packages/redis/client.py in execute_command(self, *args, **options)
671 if not connection.retry_on_timeout and isinstance(e, TimeoutError):
672 raise
--> 673 connection.send_command(*args)
674 return self.parse_response(connection, command_name, **options)
675 finally:
~/anaconda3/lib/python3.6/site-packages/redis/connection.py in send_command(self, *args)
608 def send_command(self, *args):
609 "Pack and send a command to the Redis server"
--> 610 self.send_packed_command(self.pack_command(*args))
611
612 def can_read(self, timeout=0):
~/anaconda3/lib/python3.6/site-packages/redis/connection.py in send_packed_command(self, command)
601 errmsg = e.args[1]
602 raise ConnectionError("Error %s while writing to socket. %s." %
--> 603 (errno, errmsg))
604 except:
605 self.disconnect()
ConnectionError: Error 104 while writing to socket. Connection reset by peer.
Does anyone know of any changes that may have occurred or things to trouble shoot with this issue?
This is on a single machine, right?
Can you look at the logs under /tmp/ray or /tmp/raylogs/ depending on your version of Ray and see if the files that start with redis-* say anything interesting?
What if you just do import ray; ray.init() and nothing else. Does that work?
Hello @robertnishihara, I have same issue here.
When I using my own tensorflow code with ray.tune, it give me the error message:
redis.exceptions.ConnectionError: Error 104 while writing to socket. Connection reset by peer.
But when I running tune_mnist_ray.py in my computer, it works well and complete in the end.
in the tmp/raylogs of the redis-* file, it say something like this:
36799:M 21 Oct 17:23:46.804 # WARNING: The TCP backlog setting of 511 cannot be enforced because /proc/sys/net/core/somaxconn is set to the lower value of 128.
36799:M 21 Oct 17:23:46.804 # Server started, Redis version 3.9.102
36799:M 21 Oct 17:23:46.804 # WARNING overcommit_memory is set to 0! Background save may fail under low memory condition. To fix this issue add 'vm.overcommit_memory = 1' to /etc/sysctl.conf and then reboot or run the command 'sysctl vm.overcommit_memory=1' for this to take effect.
36799:M 21 Oct 17:23:46.804 # WARNING you have Transparent Huge Pages (THP) support enabled in your kernel. This will create latency and memory usage issues with Redis. To fix this issue run the command 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' as root, and add it to your /etc/rc.local in order to retain the setting after a reboot. Redis must be restarted after THP is disabled.
Is there any other info I can give you so that you can help us fix this bug?
Does your own code work on your own laptop?
I'm having the same issue, did anybody solved the issue?
me too
Can you try the nightly wheels (see https://ray.readthedocs.io/en/latest/installation.html#trying-snapshots-from-master) and see if the issue still occurs?
Hi, thanks for responding. I tried the nightly wheels for python 3.6 and the same error still occurs
redis.exceptions.ConnectionError: Error 104 while writing to socket. Connection reset by peer.
描述:将Pyspark(spark = SparkSession.. )和(数据获取) 放在Ray代码之外 作为数据源,再将其传入Ray代码内,即会报这个错误。
解决:将spark初始化放入Ray的初始化(_setup(self, config))中,并在Ray的类内定义一个用于数据获取的函数.
I get the sense this is an issue if your environment has a large memory consumption. @robertnishihara is there an easy way around this?
@dmadeka If the issue is that Redis is crashing because it is using too much memory, you can limit it by calling ray.init(redis_max_memory=10**9) or something like that (in bytes).
I'm having the same issue. When trying to register a new environment, the following error comes up:
ConnectionResetError Traceback (most recent call last)
~/miniconda/lib/python3.7/site-packages/redis/connection.py in send_packed_command(self, command)
599 for item in command:
--> 600 self._sock.sendall(item)
601 except socket.timeout:
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
ConnectionError Traceback (most recent call last)
1 from ray.tune.registry import register_env
----> 2 register_env("CryptoTrain", env_creator)
~/miniconda/lib/python3.7/site-packages/ray/tune/registry.py in register_env(name, env_creator)
61 if not isinstance(env_creator, FunctionType):
62 raise TypeError("Second argument must be a function.", env_creator)
---> 63 _global_registry.register(ENV_CREATOR, name, env_creator)
64
65
~/miniconda/lib/python3.7/site-packages/ray/tune/registry.py in register(self, category, key, value)
89 self._to_flush[(category, key)] = pickle.dumps(value)
90 if _internal_kv_initialized():
---> 91 self.flush_values()
92
93 def contains(self, category, key):
~/miniconda/lib/python3.7/site-packages/ray/tune/registry.py in flush_values(self)
111 def flush_values(self):
112 for (category, key), value in self._to_flush.items():
--> 113 _internal_kv_put(_make_key(category, key), value, overwrite=True)
114 self._to_flush.clear()
115
~/miniconda/lib/python3.7/site-packages/ray/experimental/internal_kv.py in _internal_kv_put(key, value, overwrite)
40
41 if overwrite:
---> 42 updated = worker.redis_client.hset(key, "value", value)
43 else:
44 updated = worker.redis_client.hsetnx(key, "value", value)
~/miniconda/lib/python3.7/site-packages/redis/client.py in hset(self, name, key, value)
2672 Returns 1 if HSET created a new field, otherwise 0
2673 """
-> 2674 return self.execute_command('HSET', name, key, value)
2675
2676 def hsetnx(self, name, key, value):
~/miniconda/lib/python3.7/site-packages/redis/client.py in execute_command(self, args, *options)
772 connection = pool.get_connection(command_name, *options)
773 try:
--> 774 connection.send_command(args)
775 return self.parse_response(connection, command_name, **options)
776 except (ConnectionError, TimeoutError) as e:
~/miniconda/lib/python3.7/site-packages/redis/connection.py in send_command(self, args)
618 def send_command(self, *args):
619 "Pack and send a command to the Redis server"
--> 620 self.send_packed_command(self.pack_command(args))
621
622 def can_read(self, timeout=0):
~/miniconda/lib/python3.7/site-packages/redis/connection.py in send_packed_command(self, command)
611 errmsg = e.args[1]
612 raise ConnectionError("Error %s while writing to socket. %s." %
--> 613 (errno, errmsg))
614 except: # noqa: E722
615 self.disconnect()
ConnectionError: Error 104 while writing to socket. Connection reset by peer.
HELP PLEASE!!!!!! It worked until yesterday, the code hasn't changed, nor anything else. The only one change is the size of the files I'm using as the data for the environment, but setting redis memory or object store memory higher is not helping at all... the issue remains the same.
Can you share a script for reproducing the issue?
I have the same issue which runs with no redis at my gpu server.
ConnectionError: Error 104 while writing to socket. Connection reset by peer.
Can I remove the redis connection?
How big is your data set? I kept getting ConnectionResetError: [Errno 104] Connection reset by peer errors. Then, I followed the guide on working with large objects (mine were in the GB range) here: https://ray.readthedocs.io/en/latest/tune-usage.html#handling-large-datasets
and I no longer got that error.
I had the same issue and solved it following @flowersw suggestion on large datasets
Resolved, if not please reopen.
How big is your data set? I kept getting
ConnectionResetError: [Errno 104] Connection reset by peererrors. Then, I followed the guide on working with large objects (mine were in the GB range) here: ray.readthedocs.io/en/latest/tune-usage.html#handling-large-datasets
and I no longer got that error.
This is now located at https://docs.ray.io/en/latest/tune/tutorials/tune-usage.html#handling-large-datasets
Most helpful comment
How big is your data set? I kept getting
ConnectionResetError: [Errno 104] Connection reset by peererrors. Then, I followed the guide on working with large objects (mine were in the GB range) here: https://ray.readthedocs.io/en/latest/tune-usage.html#handling-large-datasetsand I no longer got that error.