Every now and again (usually after leaving the computer for several hours), I lose all ability to talk with the sway IPC, however it can work intermittently:

strace on the script when the above error has occurred, which seems like the sway server has gone:
For reference this is what normal output looks like:
So far the simplest way I have been able to reproduce this (albeit have only tried this once so maybe I just got lucky), was to have 5 terminals endlessly looping swaymsg -t get_tree and then leaving it until the next morning.
Need some pointers on what else to try in order to pinpoint what is going on.
Possibly related to #5043
Have been stepping through sway and swaymsg with gdb but haven't been able to pinpoint anything yet.
Interestingly, apart from get_tree and get_outputs, all the other commands work, even though they are all using the same socket fd... :/
OK I finally pinpointed it down.
The server is shutting down the IPC connection because the response is too large (just over 4MB) (ipc-server:912
I dumped the response contents using gdb:
https://raw.githubusercontent.com/ammgws/sway/temporary/gdbmemdump.prettyjson
The reason the response is so large is because the modes array for one of my three monitors has tens of thousands of entries:
rg "refresh\":" gdbmemdump.prettyjson | wc -l
73481
@emersion Any pointers on how to continue debugging this issue? I have no idea why it's reporting so many different (refresh) modes.
Eh, that's pretty interesting. It's a good thing that https://github.com/swaywm/wlroots/pull/1968 has been merged, otherwise all these modes would have been sent to each client.
Can you check whether https://github.com/ascent12/drm_info reports also these modes? And can you try reading the EDID with https://git.linuxtv.org/edid-decode.git/ too?
I restarted my computer before I could check the output of those commands, but it just happened again so here they are:
before=output of the command from 2 days ago
after=now
drm_info -j (it cuts off at 612 lines if not in JSON format)
before clock appears 107 times
after clock appears 107 times
Both appear to be the same to me.
This seems like an amdgpu bug to me.
Starting two days ago, I ran get_outputs every 5 minutes and this was the result: https://gist.github.com/ammgws/39fde4a420f448cf4b5cec4b28d3c890
You can see the output start growing in size after the March 3 00:47 entry.
There is a pattern with the size increase, as the exact same increments (94941, 92064) appear a lot, which means its probably just repeating the same data each time.
cat swaymsg_getoutputs_382 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
32
cat swaymsg_getoutputs_383 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
280
cat swaymsg_getoutputs_384 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
1303
cat swaymsg_getoutputs_385 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
2295
cat swaymsg_getoutputs_387 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
4310
cat swaymsg_getoutputs_388 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
5302
cat swaymsg_getoutputs_389 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
6325
cat swaymsg_getoutputs_390 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
7317
cat swaymsg_getoutputs_391 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
8340
cat swaymsg_getoutputs_392 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
9332
cat swaymsg_getoutputs_394 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
11347
cat swaymsg_getoutputs_395 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
12339
cat swaymsg_getoutputs_396 | jq -r '.[] | select(.name=="HDMI-A-1")' | rg "refresh" | wc -l
13362
...
etc
@emersion Was there something in the output that made you think that?
Yes, the kernel reporting duplicate modes (see drm_info output). Please report this to amdgpu.
OK, I will see if I can get around to reporting it to amdgpu sometime soon.
Although I wonder what triggers sway to start increasing the modes array since the output of drm_info looks unchanged from the start, yet the increase in get_outputs appears to start at a random time.
Wait, can you dump your configuration? Do you use output โฆ mode --custom?
My config is here
Not using --custom
I see. Are you sure the drm_info output doesn't grow?

I only ran it once before and after so I am not sure if it grew in between or not.
I can try recording that next time as well?
I could try get a sway debug log as well, but I need something will autorotate/trim the log file while sway is running, otherwise it gets too big too soon and I don't have much HDD space at the moment.
Would you happen to have any recommendations for that?
This time I ran drm_info at the same time. The output of drm_size doesn't balloon, but it does fluctuate a little bit (haven't checked the actual files yet but I think the suspect monitor is dropping in and out), and when those fluctuations happen it affects sway.
I may have a way to be able to trigger this without leaving it all day, so in that case I can provide a sway debug log as well.
Haven't had time to cull it down further, but this is my sway debug log (cut columns to 150 chars): https://gist.github.com/324282fc87fdc32bded8b49972bf4143
The monitor in question (HDMI-A-1) is connecting/disconnecting a lot. I guess every time the modeset is requested it keeps adding the modes to the ever-growing modes array?
rg "' .*connected" sway_cut.log | wc -l
409
rg "Requesting modeset for 'HDMI-A-1'" sway_cut.log | wc -l
202
This is the size of get_outputs at 5 min intervals:
2020-03-04 19:27 12977
2020-03-04 19:32 12977
2020-03-04 19:37 12977
2020-03-04 19:42 12977
2020-03-04 19:47 12981
2020-03-04 19:52 12981
2020-03-04 19:57 8891
2020-03-04 20:02 113647
2020-03-04 20:07 205711
2020-03-04 20:12 300652
2020-03-04 20:17 392716
2020-03-04 20:22 487657
I will have a further look tomorrow to see if there is a smoking gun in the log.
sort swaymsg_getoutputs_41 | uniq | wc -l
116
sort swaymsg_getoutputs_1 | uniq | wc -l
113
sort swaymsg_getoutputs_41 | uniq -c | sort -nr | bat
1 โ 38665 {
2 โ 38662 },
3 โ 9979 "height": 480,
4 โ 9978 "refresh": 60000
5 โ 9972 "width": 720,
6 โ 8732 "width": 1280,
7 โ 8728 "refresh": 50000
8 โ 7483 "refresh": 59940
9 โ 7481 "width": 1920,
10 โ 7481 "height": 1080,
11 โ 6235 "height": 720,
12 โ 4988 "width": 640,
13 โ 3741 "height": 576,
14 โ 2496 "width": 800,
15 โ 2496 "height": 600,
16 โ 2494 "width": 1024,
17 โ 2494 "height": 768,
18 โ 1253 "width": 1440,
19 โ 1248 "height": 1024,
20 โ 1247 "width": 1680,
21 โ 1247 "refresh": 60317
22 โ 1247 "refresh": 60020
23 โ 1247 "refresh": 60004
24 โ 1247 "refresh": 59901
25 โ 1247 "refresh": 59883
26 โ 1247 "refresh": 56250
27 โ 1247 "height": 900,
28 โ 1247 "height": 800,
29 โ 1247 "height": 1050,
30 โ 1246 "refresh": 70082
31 โ 1246 "refresh": 70069
32 โ 1246 "refresh": 66667
33 โ 1246 "refresh": 59910
34 โ 1246 "height": 400,
35 โ 15 },
36 โ 13 "width": 0,
37 โ 11 "y": 0,
38 โ 11 "height": 0
39 โ 10 "x": 0,
40 โ 9 ],
41 โ 3 "window_rect": {
42 โ 3 "window": null,
43 โ 3 "urgent": false,
44 โ 3 "type": "output",
45 โ 3 "transform": "normal",
This suggests to me that the same modes are duplicated over and over again.
To be honest I'm a bit lost in the code base, but I'm trying to track down whether these duplicate entries already exist in wlr_output->modes before it's used by the IPC code, or whether the duplication happens after that.
@emersion Would checking for duplicates while building the modes_array JSON be an acceptable solution? Or is it just a bandaid fix?
Would checking for duplicates while building the modes_array JSON be an acceptable solution? Or is it just a bandaid fix?
No, the root cause needs to be discovered.
I believe I can reproduce this issue - in my case, it appears to coincide with a noticeable cursor lag when moving between window boundaries. Debug log here: https://gist.github.com/kliu128/8125d498a470f682726bce1389f80264 I'm using one HiDPI 3000x2000 internal monitor plus two HDMI monitors connected through separate USB-C adapters, which it seems every Linux display manager has choken on (GNOME refuses to use DPMS to turn them off, KDE Wayland thinks one of them disconnects whenever it turns off) except for sway for the most part. The issue seems to manifest primarily after using DPMS to turn off the displays and then waiting a long time. It seems like this cursor lag issue has also been seen on reddit, e.g. https://www.reddit.com/r/swaywm/comments/hf9b8r/cursor_lag_when_crossing_window_boundaries/, but not sure if it's the same cause.
A temporary workaround I've found is to run swaymsg reload before turning the displays back on, as this seems to reset the modes array and make the problem go away.
I can easily reproduce by unplugging and plugging back HDMI cable in my AMD GPU connected to monitor through passive DVI-D to HDMI adapter. But swaymsg reload doesn't reset array in this case and drm_info shows no duplicates.
Most helpful comment
No, the root cause needs to be discovered.