Synapse: Synapse segfaults on alpinelinux

Created on 22 Sep 2016  路  33Comments  路  Source: matrix-org/synapse

Hi,
Im running Synapse v0.18.0 with Python 2.7.12. The users are authenticated agains our LDAP server.
When using Riot.im on iOS everything works ok, but the instant I login with Chrome, or Safari, the servers crash and I guess the DB gets corrupted because I can't even use the iOS App anymore.

The las request Riot make before the crash is always a synapse.rest.client.v2_alpha.sync, like:

2016-09-22 14:40:55,243 - synapse.access.https.8448 - 59 - INFO - GET-1- <IP_ADDRESS> - 8448 - Received request: GET /_matrix/client/r0/sync?filter=0&timeout=0&since=s1_2_0_1_1_1_1&access_token=<redacted>

Which produces the following log:

2016-09-22 14:40:55,243 - synapse.access.https.8448 - 59 - INFO - GET-1- 201.214.200.227 - 8448 - Received request: GET /_matrix/client/r0/sync?filter=0&timeout=0&since=s1_2_0_1_1_1_1&access_token=<redacted>
2016-09-22 14:40:55,245 - synapse.storage.txn - 224 - DEBUG - - [TXN START] {get_user_by_access_token-2}
2016-09-22 14:40:55,246 - synapse.storage.txn - 280 - DEBUG - - [TXN END] {get_user_by_access_token-2} 0.621094
2016-09-22 14:40:55,246 - synapse.util.logcontext - 243 - DEBUG - - Unexpected logging context: GET-1@7f045188cbb0 is not sentinel
2016-09-22 14:40:55,247 - synapse.rest.client.v2_alpha.sync - 114 - INFO - GET-1- /sync: user=DomainSpecificString(localpart=u'username', domain=u'matrix.example.com'), timeout=0, since='s1_2_0_1_1_1_1', set_presence='online', filter_id='0', device_id=u'UNNJJGIJAV'
2016-09-22 14:40:55,247 - synapse.metrics - 212 - INFO - - Collecting gc 0
2016-09-22 14:40:55,248 - synapse.storage.txn - 224 - DEBUG - - [TXN START] {insert_client_ip-3}
2016-09-22 14:40:55,249 - synapse.metrics - 212 - INFO - - Collecting gc 1
2016-09-22 14:40:55,249 - synapse.storage._base - 483 - DEBUG - - [SQL] UPDATE user_ips SET last_seen = ? WHERE access_token = ? AND ip = ? AND user_id = ? AND user_agent = ? AND device_id = ? Args=[1474555255247, 'MDAyMGxvY2F0aW9uIG1hdHJpeC56Ym94YXBwLmNvbQowMDEzaWRlbnRpZmllciBrZXkKMDAxMGNpZCBnZW4gPSAxCjAwMmRgPCAxNDc0NTU4NDUyNDQ4CjAwMmZzaWduYXR1cmUgU08gtiGusp7e3by4rWozEPvS4ZPLmNJvopHRCcuwXU8K', '201.224.209.256', u'@username:matrix.example.com', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36', u'U769GIJAV']
2016-09-22 14:40:55,275 - synapse.storage.txn - 280 - DEBUG - - [TXN END] {insert_client_ip-3} 27.266113
2016-09-22 14:40:55,276 - synapse.storage.txn - 224 - DEBUG - - [TXN START] {get_user_filter-4}
2016-09-22 14:40:55,276 - synapse.storage.txn - 280 - DEBUG - - [TXN END] {get_user_filter-4} 0.373047
2016-09-22 14:40:55,278 - synapse.util.logcontext - 243 - DEBUG - GET-1- Unexpected logging context: sentinel is not GET-1@7f045188cbb0
2016-09-22 14:40:55,278 - synapse.storage.txn - 224 - DEBUG - GET-1- [TXN START] {get_updated_account_data_for_user-5}
/start.sh: line 57:    16 Segmentation fault      (core dumped) python -m synapse.app.homeserver --config-path /data/homeserver.yaml

_Note_: I changed the real username, domain and IPAddress.

Before the Crash, Riot make the following request without problem:

  1. https://matrix.example.com/_matrix/client/r0/presence/username/status
  2. https://matrix.example.com/_matrix/client/unstable/keys/upload/JZBZGPVPPK?access_token=
  3. https://matrix.example.com/_matrix/client/r0/voip/turnServer?access_token
  4. https://matrix.example.com/_matrix/client/r0/pushrules/?access_token
  5. https://matrix.example.com/_matrix/client/r0/user/user/filter?access_token
  6. https://matrix.example.com/_matrix/client/r0/sync?filter=0&timeout=30000&_cacheBuster=1474554706574&access_token
  7. https://matrix.example.com/_matrix/client/unstable/sendToDevice/m.new_device/m147455436634.0?access_token=
  8. https://matrix.example.com/_matrix/client/r0/user/username/account_data/m.direct?access_token
  9. https://matrix.example.com/_matrix/client/r0/publicRooms?access_token=

Tha last return an empty {} because there is no public rooms. And the next request is the one that crash the server:

https://matrix.example.com/_matrix/client/r0/sync?filter=0&timeout=30000&since=s1_2_0_1_1_1_1&access_token=

Yesterday I debuged this and found that the crash happen when the json.loads method runs, of the file storage/account_data.py:

global_account_data = {     
  row["account_data_type"]: json.loads(row["content"]) for row in rows
}

The SegFault Info of GDB is:

Core was generated by `python -m synapse.app.homeserver --config-path /data/homeserver.yaml'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f04551a3a30 in JSON_DecodeObject () from /usr/lib/python2.7/site-packages/ujson.so

And the trace:

#0  0x00007f04551a3a30 in JSON_DecodeObject () from /usr/lib/python2.7/site-packages/ujson.so
#1  0x00007f04551a198f in JSONToObj () from /usr/lib/python2.7/site-packages/ujson.so
#2  0x00007f045b330cd8 in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#3  0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#4  0x00007f045b330e5a in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#5  0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#6  0x00007f045b2d6e02 in ?? () from /usr/lib/libpython2.7.so.1.0
#7  0x00007f045b2ba13c in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#8  0x00007f045b33237f in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#9  0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#10 0x00007f045b2d6e02 in ?? () from /usr/lib/libpython2.7.so.1.0
#11 0x00007f045b2ba13c in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#12 0x00007f045b33237f in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#13 0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#14 0x00007f045b2d6e02 in ?? () from /usr/lib/libpython2.7.so.1.0
#15 0x00007f045b2ba13c in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#16 0x00007f045b33237f in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#17 0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#18 0x00007f045b2d6e02 in ?? () from /usr/lib/libpython2.7.so.1.0
#19 0x00007f045b2ba13c in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#20 0x00007f045b33237f in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#21 0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#22 0x00007f045b2d6e02 in ?? () from /usr/lib/libpython2.7.so.1.0
#23 0x00007f045b2ba13c in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#24 0x00007f045b33237f in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#25 0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#26 0x00007f045b2d6e02 in ?? () from /usr/lib/libpython2.7.so.1.0
#27 0x00007f045b2ba13c in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#28 0x00007f045b33237f in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#29 0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#30 0x00007f045b330e5a in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#31 0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#32 0x00007f045b330e5a in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#33 0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#34 0x00007f045b330e5a in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#35 0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#36 0x00007f045b2d6e02 in ?? () from /usr/lib/libpython2.7.so.1.0
#37 0x00007f045b2ba13c in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#38 0x00007f045b33237f in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#39 0x00007f045b330dfc in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#40 0x00007f045b330dfc in PyEval_EvalFrameEx () from /usr/lib/libpython2.7.so.1.0
#41 0x00007f045b332b4c in PyEval_EvalCodeEx () from /usr/lib/libpython2.7.so.1.0
#42 0x00007f045b2d6e02 in ?? () from /usr/lib/libpython2.7.so.1.0
#43 0x00007f045b2ba13c in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#44 0x00007f045b2c5d07 in ?? () from /usr/lib/libpython2.7.so.1.0
#45 0x00007f045b2ba13c in PyObject_Call () from /usr/lib/libpython2.7.so.1.0
#46 0x00007f045b32ac88 in PyEval_CallObjectWithKeywords () from /usr/lib/libpython2.7.so.1.0
#47 0x00007f045b357a95 in ?? () from /usr/lib/libpython2.7.so.1.0
#48 0x00007f045b6674ce in ?? () from /lib/ld-musl-x86_64.so.1
#49 0x0000000000000000 in ?? ()

Most helpful comment

For anyone who is following this discussion. We added a fix for stacksize issues within python which should fix the ujson segfault.

Its in alpine edge and backported to v3.7. Hope somebody can verify this so we can close this issue.

ref: https://github.com/alpinelinux/aports/commit/2f35283fec8ec451fe5fb477dd32ffdcc0776e89

All 33 comments

I think this is an issue with the docker image. I suggest raising it at https://github.com/silvio/docker-matrix.

The problem isn't docker related. This problem exists on all alpine installations without docker.

I have this issue too, on Synapse v0.18.7, would you please reopen the issue?

It's an issue with alpine linux, not synapse.

okay, thanks

Has this issue been reported on their side somewhere? I can not find anything in their bugtracker.
Edit: After searching some more, and still not finding anything, I posted a bug report: https://bugs.alpinelinux.org/issues/6765

Yes, same problem. Can synapse avoid using ujson?

@ptman I don't think it will. Dendrite however will most likely not use ujson, once that is ready. Until then, I guess you'll just need to avoid alpine linux for running synapse.

I also tried to report this problem to ujson https://github.com/esnme/ultrajson/issues/254

We're not going to stop using ujson just because alpinelinux ship a broken version of it, no.

(or rather, whatever brokenness it is that alpinelinux has)

@richvdh I second that, but I think it will be a good idea to Throw a better error when the json parsing fails.

A segfault takes out the whole python process. We can't catch it and recover from within python.

hah! well, i guess that's one way to work around the segfaulting dep :D

How much faster than json is ujson anyway?

The following workaround seems to be working for me: https://github.com/esnme/ultrajson/issues/254#issuecomment-314862445

Despite wishful thinking on my part, this doesn't seem to have been fixed at ultrajson or alpinelinux, so it looks like we're going to have to find a workaround for this.

How much faster than json is ujson anyway?

"much", according to erik.

@richvdh hasn't the workaround already been found? s/ujson/json/g
It's a simple enough patch.

also, there is an open PR, they seem to be working on it over at ujson: https://github.com/esnme/ultrajson/pull/281

@richvdh hasn't the workaround already been found? s/ujson/json/g

We'd prefer not to drop all use of ujson, for performance reasons.

Could anyone who has been having trouble with this see if the patch in https://github.com/matrix-org/synapse/commit/e961569e9330da2108fbf75d8d51938a953c8469 helps them?

I've switched to Debian as a base long ago, but I still have the dockerfile from when I tried to use alpine. Do you want it?

I was trying to avoid setting up docker environments and hoping somebody else could test it...

I've tested your commit in this image. So far so good. I will come back here if I notice something bad.

For anyone who is following this discussion. We added a fix for stacksize issues within python which should fix the ujson segfault.

Its in alpine edge and backported to v3.7. Hope somebody can verify this so we can close this issue.

ref: https://github.com/alpinelinux/aports/commit/2f35283fec8ec451fe5fb477dd32ffdcc0776e89

@clandmeter Seems to work great, thanks
@richvdh e961569 no longer seems necessary

Anyone interested can take a look at https://github.com/ptman/synapse-docker and https://hub.docker.com/r/ptman/synapse/

great news. @silvio any chance you could bump your docker images to include the fix?

@richvdh: @silvio has stopped maintaining the docker image, and it was Debian based for more than a year now anyway.

@jcgruenhage @richvdh : This is true, the current state of the dockerimages are using debian. I have added @andreaspeters as contributor to my docker repository. Maybe he wants to switch back to alpine based distributions.

@silvio @richvdh @jcgruenhage : I my fork I changed to "debian:stable-slim". Before I will merge it with the official docker-matrix version, I need some peoples who like to test it (https://hub.docker.com/r/avhost/docker-matrix/tags/). Just now, I have no plan to change back to alpine.

thanks all. either way it sounds like we can finally close this bug.

Was this page helpful?
0 / 5 - 0 ratings