Che: Plugin brokering sometimes fails due to WebSocket closing

Created on 26 Nov 2019 · 3Comments · Source: eclipse/che

I've got a relatively large VS Code extension (~30 MB) that I'm trying to load into my Che workspace. You can find the meta.yaml for the associated Che plugin here

We're finding that our internal OCP4 clusters have a relatively slow connection to the download server where the extension is hosted (http://download.eclipse.org/), and thus the plugin brokering sometimes times out. So we've set CHE_WORKSPACE_PLUGIN__BROKER_WAIT__TIMEOUT__MIN to a larger value (like 15 minutes), to prevent this.

However, we're still finding that plugin brokering is still occasionally failing even after setting a larger timeout, but this time because the Websocket appears to get disconnected and then the plugin broker crashes. We see the following in the logs for the plugin broker:

2019/11/26 16:44:23 Copying VS Code plugin ''
2019/11/26 16:44:23 Copying VS Code extension archive from '/tmp/vscode-extension-broker782205207/codewind-theia.vsix' to '/plugins/eclipse.codewind-plugin.latest.oapumhxgpi.codewind-theia.vsix' for plugin ''
2019/11/26 16:44:23 Trying to send event of type 'broker/log' to closed tunnel 'tunnel-1'

Which corresponds to this line in the plugin broker calling log.Fatal (which causes the plugin broker to exit):
https://github.com/eclipse/che-plugin-broker/blob/21952b6098bd8edab883c290561f6e1cd08d22da/common/connect.go#L35

log.Fatalf("Trying to send event of type '%s' to closed tunnel '%s'", e.Type(), tb.tunnel.ID())

Should the WebSocket attempt to reconnect here instead rather than crashing? Is there anything that can be done to prevent the WebSocket from disconnecting?

areplugin-broker lifecyclstale severitP2

Source

johnmcollier

Most helpful comment

@johnmcollier a possible workaround for your issue: using an offline plugin-registry.

@amisevsk I am setting severity/P2 because I think that you are in the middle of a refactoring of the plugin broker and it may address this problem (hence there is no real need to add this issue to next sprint backlog). But I may be wrong and we may need a P1 here to make sure that it gets included in the next sprints.