Apollo-ios: [Bug] 0.27.1 introduced an bug when two queries are sent one after another

Created on 17 May 2020 · 34Comments · Source: apollographql/apollo-ios

For this function in ApolloClient,
@discardableResult public func fetch(query: Query,
cachePolicy: CachePolicy = .returnCacheDataElseFetch,
context: UnsafeMutableRawPointer? = nil,
queue: DispatchQueue = DispatchQueue.main,
resultHandler: GraphQLResultHandler? = nil) -> Cancellable

[Bug] 0.27.1 introduced a bug when calling the function above, one after another, with different queries. One of the two queries would not call the "resultHandler" callback. Sometimes the one that does call the callback would come back with this error: dataForRequestNotFound.

This bug does not reproduce in 0.24.1. I will check whether it reproduces in 0.25.1 and report back.

bug networking-stack

Source

MaximillionMai

Most helpful comment

We can no longer reproduce the issue on 0.28.0. @designatednerd thank you so much!

yuriiik on 10 Jun 2020

🎉5

All 34 comments

@MaximillionMai Can you confirm which version this breaks on for you? That will help me do a comparison between that version and the previous version to figure out what might have changed here.

designatednerd on 17 May 2020

Hi @designatednerd! We're experiencing the same issue. Testing shows that it reproduces in 0.27.1 and does NOT reproduce in 0.24.1 and 0.25.1.

yuriiik on 19 May 2020

I have just noticed this too, in v0.27.1. In my case it's when calling watch(...) simultaneously for two different queries. I receive a dataForRequestNotFound error occasionally for one of the queries.

sfoulston on 19 May 2020

We've been seeing this occasionally also, have not had time to look into it with more depth as we were assuming it was the dogshit wifi we have currently. Just a series of normal fetches in our case

sneakyness on 19 May 2020

Looks like the problem is here - the data object that's been dequeued for the task which has just finished is nil, which is weird, since we create an empty data object for each task before we start it.

This was introduced in 0.27.0 so anything prior to that will not be affected.

Thought: Are any of you calling cancel directly on the Cancelable returned from URLSessionClient?

designatednerd on 19 May 2020

Nope! The only place I reliably see it is when I'm calling a series of fetch requests one after another, at one point I kick off a request in parallel for each of the objects received, but the request that fails isn't always one of those.

sneakyness on 19 May 2020

@sneakyness Got a rough guess on how many requests at once you've got going before you see this issue?

designatednerd on 19 May 2020

Also, that error has an associated value that should have the originalRequest from the task that we couldn't pull data for - are you able to tell if there's any pattern to what's failing vs what's working?

designatednerd on 19 May 2020

2-3 completed (each one must finish before the next can start), then 1+ fetch for one type while also making a singular request of another. The 1+ is commonly 4-10+, making more does not seem to change how sporadic it is. I'll start logging that now but IME it's always either of the last two

sneakyness on 19 May 2020

Thanks. Woof, sporadic bugs are my faaaavorite to troubleshoot 😆

designatednerd on 19 May 2020

Unfortunately I'm in the middle of an API Model refactor, if I'm lucky I should be able to automate firing this process over and over to try and reproduce by end of week

sneakyness on 19 May 2020

👍1

Just pushed up a branch where I added a test that checks if 300 identical URL requests hitting the client at the same time (from different threads) create different IDs, and it passes.

Throwing some breakpoints in and doing some po's, it looks like it does create one instance per ID of the data in the self.datas property for URLSessionClient, and none of them are nil. That puts a kibosh on my theory of "Maybe different requests are winding up with the same ID if they're being sent at the same time, causing the data to be deallocated when the first request finishes."

The only place the data for a particular task ID gets removed is either in clearTask or clearAllTasks - clearAllTasks only gets hit when the session invalidates, which would give you a different error. clearTask only gets hit either a) If you cancel via URLSessionClient's cancel(task:) method or b) If the task actually completes.

I've added this level of detail because I'm basically stumped. Clearly, something's causing the data for a particular URLSessionTask identifier key to get nulled out. What the hell is doing that? If anyone's got thoughts, would love to hear them.

designatednerd on 19 May 2020

I'm using DispatchQueue.concurrentPerform, not super sure on how different that is from .userInitiated.

I'm not doing this specifically for the network requests, there's a decent amount of objects being initialized and sorted, just happened to be a clean spot to snag an id I needed for the request.

Actually I've just realized that it might get called multiple times per index, do you think that's causing it? 🤡

sneakyness on 19 May 2020

Actually I've just realized that it might get called multiple times per index

I'm not sure what you mean by this

designatednerd on 19 May 2020

If it's parallel can't it rarely call the same index twice? Especially in cases where the number of tasks is >8 (afaik 8 is the max number of threads it'll kick off)

sneakyness on 19 May 2020

I still am missing something - what is "it" that could be calling the same index multiple times in this context - do you mean the URLSession creating tasks with the same identifier, or something else?

designatednerd on 19 May 2020

concurrentPerform, so it might be calling the same fetch multiple times, but that doesn't explain why it would fail with "dataForRequestNotFound" instead of eventually just succeeding

sneakyness on 19 May 2020

Ah! That makes more sense. I updated the test to use concurrentPerform and yep, multiple tasks with the same ID are getting generated.

designatednerd on 19 May 2020

(basically if multiple tasks with the same ID are getting generated, then my "I will unique this based on the task identifier" strategy falls apart entirely)

designatednerd on 20 May 2020

My case may be different from others in that the first request was fine. If we know for sure this is the cause, I can do whatever to make sure it's only called once.

If there was an implementation of fetch/watch/etc like fetchOnce that either returned a request if there's no task ID, or did nothing/emit something that said this was the case, would that be a good bandaid for the others' cases pending a bigger change?

sneakyness on 20 May 2020

The first request would probably be fine in what you were describing because it's not trying to make multiple requests at the same time, it's once you start making multiple concurrent requests that the ID incrementor fails. Kind of wish they'd gone for a UUID instead of an auto-incrementing int for that identifier...

I'll think on that - for now I need to look at what to do about the fact that the ID isn't unique if it's hit concurrently repeatedly, because I thought Apple was handling that and they ain't. 🙃

designatednerd on 20 May 2020

👋Can I ask a couple of you folks to try out the fix/multifetch branch to see if the changes there make a difference? I've updated the dictionary to key off the URLSessionTask rather than its identifier, so this should stop the issue where something gets ripped out of the dictionary too early. The tests are passing but I'd love to get confirmation this is working in the real world before I PR it.

designatednerd on 21 May 2020

Just tested the fix/multifetch branch. Unfortunately the issue still persists. In our case we just make two identical serial calls to fetch function on the same queue, and for one of these invocations resultHandler is not called back. It happens roughly 4/5 times, sometimes it works as expected.

yuriiik on 22 May 2020

this sounds like completion handler overwritten somewhere without cancelling and new call created without canceling old one

RolandasRazma on 22 May 2020

@RolandasRazma That's definitely what was happening when using the task identifier as the key, since the task identifier would not get incremented if the session was getting hit quickly enough, and everything for the particular key would get hammered by the next task.

@yuriiik - Are you making other calls at the same time, or just the two identical calls? Also, thank you for testing, it's a bit frustrating that the concurrentPerform test didn't uncover this 😞

designatednerd on 22 May 2020

@designatednerd – just two identical calls, one after another.

yuriiik on 22 May 2020

The var is different in mine but otherwise they are the same call. Applied branch, saw the bug once, set a breakpoint, never saw it again 👻

The process of triggering the request was streamlined, I was able to run it on repeat long enough to forget about it and still got nothing!

sneakyness on 22 May 2020

@sneakyness Did it continue to work fine after you removed the breakpoint? I'm thinking it might be a Heisenbug (in that it changes when observed, probably because the breakpoint slows things up enough to stop whatever was happening concurrently before) if it goes back to not working. If it randomly broke once and then worked forevermore, I have no goddamned idea 🙃

@yuriiik Something occurred to me - can you confirm that you're not using Query Watchers? There's a bug in the current version where we're super-agressively cancelling in-flight fetches if a cache result comes in that's fixed with #1156 (which is merged but not shipped). If you are using query watchers, do you mind pulling master and letting me know if that fixes the issue?

designatednerd on 23 May 2020

That lines up with what I figured was going on earlier, the only time it was happening was either after the backend had been rebuilt and the query did a cold run, or if we were in the "best worst zone" of wifi coverage which caused a similar amount of delay.

This weekend I've got some nebulous and grand refactoring planned for the backend, I suspect once the schema tidies up this will disappear. If I manage to cause it again after that I will let you know. I'm inclined to say that this error is a "smell" and that ultimately it's best to find a way to get the result you want with less requests, even if it shouldn't happen

sneakyness on 23 May 2020

@designatednerd Sorry to get back to you late! It was indeed introduced in version 0.27.1.
For me, it repros when I call two different query calls one right after another.

MaximillionMai on 24 May 2020

@designatednerd we're not using Query Watchers, so that's not the case I think.

yuriiik on 24 May 2020

OK, I think I got it. Basically: Concurrency is hard. 🙃

Please see #1227.

designatednerd on 27 May 2020

1227 was merged and is now available as part of `0.28.0`. If you are still experiencing any problems around this, please open a new issue indicating what version you are using.

Thanks for uncovering this!

designatednerd on 28 May 2020

We can no longer reproduce the issue on 0.28.0. @designatednerd thank you so much!

yuriiik on 10 Jun 2020

🎉5

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Apollo in an Objective-C project

jeromeDms · 5Comments

GraphQLError parsing issue in response?.parsedResponse

StanislavCekunov · 3Comments

Respsone timeout

hiteshborse12 · 4Comments

Add support for Automatic Persisted Queries

ashiemke · 5Comments

Generic parameter 'Query' could not be inferred

farice · 4Comments

Apollo-ios: [Bug] 0.27.1 introduced an bug when two queries are sent one after another

Most helpful comment

All 34 comments

1227 was merged and is now available as part of 0.28.0. If you are still experiencing any problems around this, please open a new issue indicating what version you are using.

Related issues

1227 was merged and is now available as part of `0.28.0`. If you are still experiencing any problems around this, please open a new issue indicating what version you are using.