Lighthouse: Incorrect main document used when retrieving page load errors

Created on 21 Oct 2020 · 5Comments · Source: GoogleChrome/lighthouse

Provide the steps to reproduce

Run LH programmatically on page that redirects a similarly matching URL (e.g. same startsWith(request.url)), where the redirect response does not return text/html as its mime type.

_More Information_

I'm not sure how common this for non-authenticated redirects, but in our case we are running LH on authenticated pages using cookie-based login (similar to this doc, but with Node API).

The redirect response when hitting the requestedUrl (prior to auth) does not return HTML, and so the content-type response header is empty. It then redirects to the authentication server, and finally back to the originally requested document this time with the correct mime type text/html.

What is the current behavior?

LH reports error NOT_HTML even though document loaded in the final URL after redirecting is text/html.

I don't see any documentation on this, but according @brendankenny in #11430, it should be the final matching URL (that is an html document) when loading the page.

The error occurred starting in version 6.2.0 of LH from #11042.

I have a fix locally by using the _last_ matching HTML document (as opposed to the first) in findMainDocument(records, finalURL).

A naive one-line fix that works on my site in test is:

// should probably search backwards using loop
const mainResource = records.slice().reverse().find(request => finalURL.startsWith(request.url) &&
        URL.equalWithExcludedFragments(request.url, finalURL));

As for the logic when finalUrl is not supplied, I think the comment on line 452 still applies.

What is the expected behavior?

LH doesn't fail when valid redirect eventually responds with HTML on a matching finalUrl.

Environment Information

Affected Channels: Node
Lighthouse version: >=6.2.0
Chrome version: N/A
Node.js version: 12.9.2
Operating System: MacOS (and Docker Linux Container)

Related issues

#11042
#11430

P1.5 bug needs-discussion

Source

bradfrosty

Most helpful comment

Hey @patrickhulce, I should have a PR up soon. All good from my end to contribute and CLA has been signed.

bradfrosty on 30 Oct 2020

🎉1 👍1

All 5 comments

Thanks for filing @bfrost2893! A change to findMainDocument is definitely an ultimate solution, but is a sneakily massive change that affects many performance metrics (see https://github.com/GoogleChrome/lighthouse/issues/8984). We'll likely do that work in a major version bump in 8.0.0 far from now.

The more intermediate fix is better resolving the ultimate document just for this error detection similar to how our redirects audit catches client-side redirects.

patrickhulce on 21 Oct 2020

Makes sense re: findMainDocument. It sounds like using the "final document" instead of the "main document" causes some perf metrics to appear much faster, ignoring the cost of redirects?

I tested locally and confirmed it fixes our issue with a new function NetworkAnalyzer.findFinalDocument() based off the redirects audit as you said. Is this the kind of fix you were thinking?

// network-analyzer.js

class NetworkAnalyzer {
  // ...

  static findFinalDocument(networkRecords, finalURL) {
    const documentRequests = [];
    let networkRecord = networkRecords.find(record => record.url === finalURL);
    while (networkRecord) {
      documentRequests.push(networkRecord);
      networkRecord = networkRecord.redirectDestination;
    }

    if (!documentRequests.length) {
      return undefined;
    }

    return documentRequests[documentRequests.length - 1];
  }
}

and then in GatherRunner

// gather-runner.js

class GatherRunner {
  // ...

  static getPageLoadError(passContext, loadData, navigationError) {
    const {networkRecords} = loadData;
    /** @type {LH.Artifacts.NetworkRequest|undefined} */
    let mainRecord;
    try {
      mainRecord = NetworkAnalyzer.findMainDocument(networkRecords, passContext.url);
    } catch (_) {}

    // Modification: retrieve `finalRecord` using `NetworkAnalyzer.findFinalDocument()`
    let finalRecord; 
    try {
      finalRecord = NetworkAnalyzer.findFinalDocument(networkRecords, passContext.url);
    } catch (_) {}

    const networkError = GatherRunner.getNetworkError(mainRecord);
    const interstitialError = GatherRunner.getInterstitialError(mainRecord, networkRecords);

    // Modification: pass `finalRecord` instead of `mainRecord`
    const nonHtmlError = GatherRunner.getNonHtmlError(finalRecord);

    // ... rest of function
  }
}

Should getNetworkError() and getInterstitialError() ever expect finalRecord as opposed to mainRecord?

bradfrosty on 21 Oct 2020

Is this the kind of fix you were thinking?

Roughly, yes :) interested in a PR if you've come this far already? 😃

Should getNetworkError() and getInterstitialError() ever expect finalRecord as opposed to mainRecord?

I can't currently think of a case where it would matter. Those errors are for fatal page errors so the redirect problem (the difference between your case and current findMainDocument) shouldn't apply.

patrickhulce on 21 Oct 2020

@patrickhulce sorry for the delay. Figuring out the legal approval on my end before signing the CLA and submitting a contribution 🙂

bradfrosty on 27 Oct 2020

Hey @patrickhulce, I should have a PR up soon. All good from my end to contribute and CLA has been signed.

bradfrosty on 30 Oct 2020

🎉1 👍1

Was this page helpful?

0 / 5 - 0 ratings