Aspnetcore: Blazor signalr connection breaks after syncing large dom / dom element

Created on 23 Apr 2019  路  27Comments  路  Source: dotnet/aspnetcore

Describe the bug

On login in our application we load a logo which is displayed. Normally this is quite small but since it is user specific it sometimes can be large. With preview 4 the server connection is broken when this is too large. The client then "freezes" - i.e. does not respond to clicks. The clicks do generate websocket messages but no responses. After a timeout the auto reconnect establishes a new connection and everything is then ok (until the next large dom sync).

This worked fine in preview 3.

To Reproduce

I tried to reproduce with a small update to the template project but could not. Let me know if I should try again. It is 100% reproducible in our application.

I've artificially generated various size images and it seems to break when the image is between 25440 and 25450 bytes. Tracing in chrome the websocket message for this dom sync is between 43772 and 43784 bytes (working / non-working).

Please let me know if there is anything I can do to get more diagnostics on this for the case where it fails.

Done area-blazor bug

All 27 comments

It sounds like you are hitting the buffer size limit in SignalR.

@anurse Is there any reason to not remove/increase the limit on the outbound buffer size?

Is there any reason to not remove/increase the limit on the outbound buffer size?

The main reason for the limit is to ensure that clients can't over-allocate server resources. I think I vaguely remember that this may not be as much of an issue in recent previews because of how Pipelines tracks the limit now (+ @davidfowl)

The main reason for the limit is to ensure that clients can't over-allocate server resources.

I can understand limiting the inbound buffer size for this reason, but doesn't the server completely control how much data it wants to send to the client?

Turns out in preview4 we shipped with a 32 MB limit right? Are you sure you're hitting that?

@tn-5 does anything show up in the server logs to indicate the nature of the error?

doesn't the server completely control how much data it wants to send to the client?

I'm having deja vu about this conversation. I feel like we talked about this before and removed the outbound limit. Or maybe Blazor configured it away... I can't find the issue though.

Turns out in preview4 we shipped with a 32 MB limit right? Are you sure you're hitting that?

This is true. Due to a typo we actually had a 32 MB limit in preview 4, so it doesn't seem too likely this would be the case.

doesn't the server completely control how much data it wants to send to the client?

Yes, so you can set the limit higher if you'd like.

After some more testing I'm convinced it is not related to a buffer size per se. At this point I'm thinking it is either a timing issue or something related to the actual dom diff code itself. I've tested very large dom syncs (total dom + individual component) - this all works 100%. In this specific instance the reproducible error happens when I request the data from an async api. I'll investigate further to see if it can be reproduced more reliably.

I've managed to reproduce this reliably with some small changes to the template project. it seems to be a time related issue linked to a large dom sync. The large dom makes some code take a bit longer and then the signalr messages sequences are different. It happens specifically when navigating from one page to another.

Trace detail

Working case:

trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[11]
      Sending payload: 75 bytes.
trce: Microsoft.AspNetCore.Components.Browser.Rendering.RemoteRenderer[101]
      Begin remote rendering of components on client bsz_lKfQEEbDuvRBtUAOmw.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[11]
      Sending payload: 86 bytes.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[9]
      Message received. Type: Binary, size: 123, EndOfMessage: True.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[9]
      Message received. Type: Binary, size: 93, EndOfMessage: True.
dbug: Microsoft.AspNetCore.SignalR.Internal.DefaultHubDispatcher[1]
      Received hub invocation: InvocationMessage { InvocationId: "", Target: "BeginInvokeDotNetFromJS", Arguments: [ 6, Microsoft.AspNetCore.Components.Server, NotifyLocationChanged, 0, ["http://localhost:49993/"] ], StreamIds: [  ] }.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[9]
      Message received. Type: Binary, size: 26, EndOfMessage: True.
dbug: Microsoft.AspNetCore.Components.Server.Circuits.RemoteUriHelper[0]
      Location changed to 'http://localhost:49993/'.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[11]
      Sending payload: 86 bytes.

Failure case:

trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[11]
      Sending payload: 75 bytes.
trce: Microsoft.AspNetCore.Components.Browser.Rendering.RemoteRenderer[101]
      Begin remote rendering of components on client S5PSgzO0YLBrGnYHx1rSHQ.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[11]
      Sending payload: 86 bytes.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[9]
      Message received. Type: Binary, size: 123, EndOfMessage: True.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[9]
      Message received. Type: Binary, size: 93, EndOfMessage: True.
dbug: Microsoft.AspNetCore.SignalR.Internal.DefaultHubDispatcher[1]
      Received hub invocation: InvocationMessage { InvocationId: "", Target: "BeginInvokeDotNetFromJS", Arguments: [ 6, Microsoft.AspNetCore.Components.Server, NotifyLocationChanged, 0, ["http://localhost:49993/"] ], StreamIds: [  ] }.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[9]
      Message received. Type: Binary, size: 26, EndOfMessage: True.
trce: Microsoft.AspNetCore.Http.Connections.Internal.Transports.WebSocketsTransport[11]
      Sending payload: 86 bytes.
dbug: Microsoft.AspNetCore.Components.Server.Circuits.RemoteUriHelper[0]
      Location changed to 'http://localhost:49993/'.

You'll see in the failure case there is a websocketstransport message between the "invoke notifylocationchanged" and the "remoteurihelper - location changed to..." message.

Reproduce

  • dotnet new blazorserverside
  • Add large dom element to counter.razor
  • <img src="data:image/png;base64, @dummydata" alt="[LOGO]" />
  • Initialise to large data (500k should reliably break it) - functions block:
string dummydata = "";
protected override async Task OnInitAsync()
{
    Random rnd = new Random();
    int i = rnd.Next(32, 127);
    dummydata  = new string((char)(i), 500000); 
}

At this point the application works fine, navigating between different pages works as expected

What seems to then break it is when navigation occurs from another layout (it may also be that there are more different elements to sync). Anyway -, to reproduce:

  • Create a layout file Layout1.razor (no specific content required):
@inherits LayoutComponentBase
<div>@Body</div>
  • Change the fetchdata.razor file to use this layout
  • Add a button on fetchdata to enable navigation back to the counter page (due to the layout change this is required):
    <button onclick="@(() => UriHelper.NavigateTo("counter"))">Counter</button>
  • This also requires: @inject IUriHelper UriHelper

In this scenario the issue reproduces reliably:

  1. Run application
  2. Navigate to "fetchdata" page
  3. Navigate to "counter" page
  4. Clicks are not being processed. After a timeout the connection is reestablished and it works again

@danroth27 I think I saw this the other day when trying out a separate bug.

I think what is happening is that the server takes a long time to process the event/render and the signalr connection gives up along the way.

I will investigate more, but I鈥檓 not sure we will be able to do anything reasonable here.

I was rendering 100.000 elements at the time, so I didn鈥檛 give it much importance given that I was able to reliably render at least 1000 and that if you render large payloads these things are going to happen eventually.

I also have a proposal based on this to not produce large batch updates but to split them into smaller ones to improve responsiveness.

For the record, I would avoid delivering assets this way.

  • <img src="data:image/png;base64, @dummydata" alt="[LOGO]" />

For the record, I would avoid delivering assets this way.

I'd definitely agree with @javiercn here. SignalR isn't really designed with large message payloads in mind. It's designed for frequent small-to-medium size updates. Generally, when users ask about transferring large files "through" SignalR, we strongly recommend hosting the content at some URL (blob storage, CDN, etc.) and sending the URL through SignalR instead of the actual content.

I agree in general that this should not be done, at least not without a good reason.

In our use case this logo is a client logo in a large multi-tenant installation. The logo is stored in the database and retrieved from an API and then displayed. Normally the logo should be small (< 20k) and retrieved only once (since it is present on the main layout across all pages).

We are designing our Blazor app primarily to be used client side (wasm) which means the code will directly receive this data from the api using our normal authentication etc.

We only found this issue originally since there was a misconfiguration on one tenant (resulting in a very large logo - several 100k).

@javiercn I am concerned about the way this is exhibiting itself. In the failure case all the data is actually transferred correctly and the client side seems ok - it still keeps sending signalr messages to the server. It seems more likely that some state in the signalr handling on the server side is getting confused than a a simple timeout (we are talking total transfer times of several ms here, not seconds). I am guessing that this is rather indicative of a bug related to receiving responses out of an expected sequence.

@pranavkm didn鈥檛 you fix this issue the other day in the message pack implementation?

In our use case this logo is a client logo in a large multi-tenant installation. The logo is stored in the database and retrieved from an API and then displayed.

Generally we would also recommend not storing the image content in the database (instead storing it on blob storage and storing a link) but I realize you may not want to redesign your entire app :).

I'd still advise you consider using a standard server-side ASP.NET Core middleware to allow you to fetch a tenant's image using standard HTTP given the necessary data (tenant ID, authentication, etc.) and then use that URL in the image tag. data URLs require transferring the entire image on every render which means you can't take advantage of browser caching. If, instead, you fill in a URL like <img src=/api/tenant/{tenantId}/logo />, the browser can cache that image appropriately and save a lot of wasted bandwidth (even for small logos) on full-page refreshes.

This is a little tangential of course, since we do still need to investigate the Blazor issue further. Just throwing my design thoughts out there 馃樃

@davidfowl I've tested again with latest nightly (11681) and the problem is still there, so it is not fixed.

When I tested now I realised that it only happens when the navigation is performed by using NavigateTo from the UriHelper.

In the repro example you can copy Mainlayout.razor to Layout1.razor. When navigating using the button click (which uses NavigateTo) it fails. When clicking the link in the navpane it works.

So it seems the issue is somehow related to the UriHelper implementation

I dunno if the fix is merged but there is an issue that @pranavkm found in the current impl. I dunno if it's related to this but it's a bug in the protocol parsing which might be related.

I've investigated this in more detail. The root cause is that there is an issue with the lock handling in the signalR HubConnectionContext.

Specifically, the use case I have fails if the following sequence of events happen:

  1. A call to WriteAsync does not complete immediately and it returns a CompleteWriteAsync task
  2. Immediately afterwards another message is sent, no immediate lock is obtained and it returns a CompleteWriteAsync task.

In this scenario task [2] is triggered when task [1] releases the lock (but before the actual return from CompleteWriteAsync). Upon completion of [2] the _writeLock count is 1 (which is correct). Control flows back to CompleteWriteAsync(to return) and at that point the _writeLock count is 0 (incorrect).

At this point the lock seems to be allocated (although it shouldn't be). No further comms is then allowed until the connection is reset.

I am not sure what the root cause of this is, it may be a subtle issue with the ValueTask being awaited (maybe it caches the same one & then awaits it multiple times which is not allowed?).

So to summarise this issue occurs when the CompleteWriteAsync is still busy and another message needs to be sent and ends up in WriteSlowAsync, awaiting the lock.

@anurse Can we get somebody to look at this? Thanks for looking deeper @tn-5

@BrennanConroy can you take a look?

Some more testing observations:

  • For the failure case the thread ids of the CompleteWriteAsync and WriteSlowAsync tasks are the same
  • I can get it to work reliably by changing the await _writeLock.WaitAsync(); to await _writeLock.WaitAsync(100000, default); where 100000 is an arbitrary large number.

In the second case the SemaphoreSlim implementation executes a different code path and the observed thread ids for the CompleteWriteAsync and WriteSlowAsync tasks are different.

Based on my testing it looks like this could be a workaround until someone with enough time can get to the bottom of this. Logically it almost seems as if there is an issue with the SemaphoreSlim implementation in this specific context, but given that this has been around a while it does seem a bit unlikely.

Adding area-signalr and co-assigning @BrennanConroy to do investigation from SignalR side.

FYI, I have not ignored this. I am able to repro the issue, unfortunately I am having debugger issues which is making it difficult to make progress here.

@BrennanConroy
I think the root cause is the await task; in CompleteWriteAsync. I think it should be await task.AsTask(); since the result of this call is wrapped in a new ValueTask by the caller.

That shouldn鈥檛 matter

I have the same error when I try to transfer a html page with a weight of 23,000 characters (no pictures, just html), the application crashes as described in the first post.

We're continuing to dig through this. Moving out to preview7 because we aren't confident we can have a fix for preview6.

We've identified the issue and @rynowak is working on a fix

Was this page helpful?
0 / 5 - 0 ratings