puppeteer 🚀 - Different behavior between { headless: false } and { headless: true }

There could be any number of things going on. They could be looking for the Headless added to the UA string and blocking that. Or they could be using some techniques to detect automated access and prevent it.

If it works in non-headless and fails in headless then the site itself is doing something to prevent automated access. So you'd need to figure out what that is and work around it or move on. Some things are easy to get around (like modifying the UA string) while others are non-trivial to bypass.

Garbee on 2 Sep 2017

I am also facing the same issue.

When Headless is false

page url ===> http://lvh.me:3000/dashboard

When Headless is true

page url ===> about:blank
(node:29206) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 1)

kaushiksundar on 3 Sep 2017

👍1

Can anyone provide an actual example file to run that reproduces this issue?

Garbee on 3 Sep 2017

I will try to find something public that I can post. My example is confidential so I can't share it.

optikalefx on 3 Sep 2017

@Garbee just FYI, I'm setting the UA, so I don't think that's it. And I'm performing things like delays and mouse movement etc. Since the only difference is the headless: true it leads me to believe that there is something going on in the lib, and not on the site that I'm scraping. But I will keep trying and hopefully will find an example to post.

Are there other kinds of debugging maybe that can help point to where an issue might be?

optikalefx on 3 Sep 2017

👍4

@Garbee Here is the code. This happens only for localhost if I give the actual website URL (http://www.google.com... etc) it is working for both options.

const browser = await puppeteer.launch({headless: true});
  const page = await browser.newPage();
  await page.goto('localhost:3000', {
          networkIdleTimeout: 1000,
          waitUntil: 'networkidle',
          timeout: 3000000
        });
  console.log(page.url());

Output:
about:blank

Expected output:
localhost:3000

If headless is false I am getting the expected output.

kaushiksundar on 3 Sep 2017

I'll thicken the plot. I've started debugging the POST requests to my amazon login. When headless is set to true, Amazon is making an additional POST request that I don't recognize. That doesn't exist when headless is set to false. So that says to me something else is changing with this setting that I don't yet know.

optikalefx on 3 Sep 2017

I've also inspected the request and response for both headless and non-headless. They seem to be identical in nature.

optikalefx on 3 Sep 2017

In non-headless mode, screenshots work differently because my screen is in HiDPI mode (MacBook Retina). Here's one of the 'different' screenshots:
example

LoganDark on 3 Sep 2017

Remember the protocol is required for urls in goto.

@LoganDark that is a different issue completely. Please file your own for triage and discussion.

Garbee on 3 Sep 2017

👍3

Different issue? Well, I didn't know that because of the title.

LoganDark on 3 Sep 2017

👍1

Reading the issue description as well, nothing stands out to me that would make my issue completely different. Here are the parts that made me think my issue did belong here:

I'm curious to know what changes there are between running as headless true vs false.

So I'm trying to figure out what headless: true is doing that is different from when it's not headless.

LoganDark on 3 Sep 2017

👍2

@Garbee Yes giving the protocol in goto solves the issue.

await page.goto('http://localhost:3000', {
          networkIdleTimeout: 1000,
          waitUntil: 'networkidle',
          timeout: 3000000
        });
console.log(page.url)

If I don't give the protocol for google.com, am getting an error Error: Protocol error (Page.navigate): Cannot navigate to invalid URL undefined whereas for the above case I am getting about:blank. The error handling it done differently for localhosts.. Shouldn't it be giving the protocol error?

await page.goto('www.google.com', {
          networkIdleTimeout: 1000,
          waitUntil: 'networkidle',
          timeout: 3000000
        });
console.log(page.url)

kaushiksundar on 3 Sep 2017

@LoganDark Sorry about the poorly worded title for the issue. There is nothing I can do about that. Your issue is with screenshot functionality while this was opened about some navigational problems. They are entirely distinct separated issues. Therefore a new issue is required to focus on your problem.

@kaushik-sundar Throwing an error for missing the protocol is a good idea IMO. I'll need to look into it though as it could be non-trivial to setup well due to the number of allowed protocols.

Garbee on 3 Sep 2017

My apologies on the title, but I do agree that protocol issue is separate. My issue is more related to something about the request from the browser is different when headless is on vs off, causing the site in question to act differently.

optikalefx on 3 Sep 2017

Here is a gist of the problem. With params.isHeadless as false the browser opens and the form successfully logs in, whereas with it false I get an auth error page (which I actually _cannot_ replicate through normal means no matter what kinds of correct/incorrect credential permutations I try to use).

Since the problem is behind an auth wall (or rather, the act of authenticating itself) I cannot share the _exact_ code with my own credentials. However if you have or create your own vendorcentral account you should be able to see this behavior.

I wrote the code in such a way that it works for some other services as well, such as imgur. For this, just change params.url (to https://imgur.com/signin for example). It works on Imgur, which implies that Amazon _is_ doing something explicit, however we have been as of yet unable to determine what that is, because as @optikalefx has said we have tried sporadic mouse movement, delayed typing, etc.

Note: I'll open another unrelated issue for this eventually as I need to do more research and experimentation, but I found that page.press('Enter') does not actually press the enter key. At least for me and my environment.

rosshadden on 3 Sep 2017

but I found that page.press('Enter') does not actually press the enter key

Try page.press('Return') as well..?

LoganDark on 4 Sep 2017

@LoganDark That didn't work either. I probably shouldn't have brought it up here at all, completely unrelated. Let's ignore it.

rosshadden on 4 Sep 2017

I'm curious to know what changes there are between running as headless true vs false.

@optikalefx The major change is a user agent - chrome headless identifies itself as HeadlessChrome. Try running the following script in headless and headful modes:

const puppeteer = require('puppeteer');

(async() => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  console.log(await page.evaluate(() => navigator.userAgent));
  browser.close();
})();

User agent is sent with every request as a user-agent header. If there's a need, user-agent could be changed with the page.setUserAgent method.

In non-headless mode, screenshots work differently because my screen is in HiDPI mode (MacBook Retina). Here's one of the 'different' screenshots:

@LoganDark please, file a separate issue.

Here is a gist of the problem.

@rosshadden try overriding user-agent in your gist. If this doesn't help, please file a separate issue.

aslushnikov on 5 Sep 2017

👍1

From @Garbee:

@LoganDark that is a different issue completely. Please file your own for triage and discussion.

From @Garbee again:

Therefore a new issue is required to focus on your problem.

From @aslushnikov

@LoganDark please, file a separate issue.

Yeah, 3 times already I've been told to file a different issue.

I haven't. And I won't right now.

Stop telling me to.

LoganDark on 6 Sep 2017

👎13

@aslushnikov we need to re-open this ticket IMO. I'm sorry that this issue had unrelated things in it. Setting the user-agent doesn't change anything - as in something is still different about the request. The result of that user-agent log after it's set is exactly what I set it to.

Can you think of anything else that changes when headless is set to true? Something that Amazon is able to detect? Maybe something about cookies? Maybe you could guide me in the right direction in the code and I can look through myself. Being unfamiliar with the codebase would make having a quick guidance very helpful.

optikalefx on 6 Sep 2017

👍7

There are a few ways Amazon can be detecting headless access. Nothing can really be done internally about them if Amazon is implementing any techniques like this.

The only primary difference is the Headless in the UA string. Beyond that, everything should be functioning the same from the user perspective of headless, as stated before.

Garbee on 6 Sep 2017

@Garbee super interesting. So, why can't we just define things like language, plugins etc? I can't set things on navigator, but I can polyfill other methods to prevent detection. Maybe you guys can set the navigator settings?

optikalefx on 6 Sep 2017

It looks like I can polyfill navigator using

Object.defineProperties(navigator, {
     'plugins': {
         value: ['adBlock'],
          writable: true
     }
});

optikalefx on 6 Sep 2017

Well I polyfilled everything in that article, and it passes all of those tests after the goto statement. But it still is getting caught. quite interesting.

optikalefx on 6 Sep 2017

@aslushnikov While my gist doesn't have a UA set, setting it was the first thing @optikalefx tried when we discovered this problem. What I can do is update my gist with setting the UA and the polyfills/workarounds we have tried since.

rosshadden on 6 Sep 2017

@optikalefx @rosshadden Chrome headless is built atop of content/ layer and doesn't include chrome/ layer, whereas chrome headful includes both content/ and chrome/ layers. So naturally, there might be multiple subtle ways to detect headless.

More on chromium architecture could be found here:

aslushnikov on 13 Sep 2017

As mentioned in the article @Garbee posted the headless version does not have languages set on the navigator object.

Note also that the headless version will not have languages set in its Accept-Language Header. Some sites (ASP.NET in my experience) require this header to be set. Other sites are looking for this header specifically to identify headless browsers.

I copied the value from an example request generated by my normal chrome install. There is probably a more minimal setting for this header that works.

await page.setExtraHTTPHeaders({
    'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8'
});

koreus7 on 10 Jan 2018

👍33 ❤11 🎉6 👀3 👎1

@koreus7 - Solution worked for Amazon issue reported by @optikalefx

hvaoc on 20 Apr 2018

Full code to Scrap Amazon behind Login Wall - Optimized and Works in Headless Mode (Avoid BOT detection)

// Get addressess from Amazon Address Book

const puppeteer = require('puppeteer');

(async () => {

  // Syntactic Sugar

  const Navigate = async (url) => {
    await page.goto(url);
  }

  const EnterText = async (selector, text) => {
    await page.click(selector);
    await page.keyboard.type(text);
  }

  const ClickNavigate = async (selector, waitFor = -1) => {
    await page.click(selector);
    if (waitFor >= 0) {
      await page.waitFor(waitFor*1000)
    }
    else {
      await page.waitForNavigation();
    }
  }

  // Main Flow

  const C_HEADELESS = true
  const C_OPTIMIZE = true
  const C_SLOWMOTION = 0 // slow down by X ms

  const browser = await puppeteer.launch({
    headless: C_HEADELESS,
    slowMo: C_SLOWMOTION
  });
  const page = await browser.newPage();

  // To ensure Amazon doesn't detect it as a Bot
  await page.setExtraHTTPHeaders({
    'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8'
  });

  // No unwanted resources
  if (C_OPTIMIZE) {
    await page.setRequestInterception(true);
    const block_ressources = ['image', 'stylesheet', 'media', 'font', 'texttrack', 'object', 'beacon', 'csp_report', 'imageset'];
    page.on('request', request => {
      //if (request.resourceType() === 'image')
      if (block_ressources.indexOf(request.resourceType) > 0)
        request.abort();
      else
        request.continue();
    });
  }

  // Creds
  const USER_EMAIL = "YOUR_EMAIL_HERE"
  const USER_PASSWORD = "YOUR_PASSWORD_HERE"

  // Home Page constants
  const U_HOMEPAGE = 'https://amazon.com'
  const U_LOGIN_PAGE = 'https://www.amazon.com/ap/signin?clientContext=135-8638983-8261231&openid.return_to=https%3A%2F%2Fwww.amazon.com%2Fa%2Faddresses&openid.identity=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&openid.assoc_handle=usflex&openid.mode=checkid_setup&marketPlaceId=ATVPDKIKX0DER&openid.claimed_id=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0%2Fidentifier_select&pageId=usflex&openid.ns=http%3A%2F%2Fspecs.openid.net%2Fauth%2F2.0&openid.pape.max_auth_age=900&siteState=clientContext%3D143-3525329-4850620%2CsourceUrl%3Dhttps%253A%252F%252Fwww.amazon.com%252Fa%252Faddresses%2Csignature%3Dnull'
  const S_LOGIN_LINK = '#nav-link-accountList'

  // Optimzed the flow to reach address book faster, trick is to manually try to go to Target page before login and will be hit
  // by the Amazon Login Wall, capture the URL which will now have return page set to openid.return_to field in the url
  // This helps to land on the target page direclty after login without having to browse through heavy Home page
  // Caution: Trying to go to Address Book directly (any page with sensitive information) will challenge the user with additional password screen.

  // Commented, since this is now optimized
  // ------------------------------------------
  // // Go to Home Page
  // await Navigate(U_HOMEPAGE)
  //
  // // Go to Login Page
  // await ClickNavigate(S_LOGIN_LINK, 1)
  // ------------------------------------------

  // Go directly to Login Page
  await Navigate(U_LOGIN_PAGE) // USER-ACTION

  // Login Page constants
  const S_EMAIL_TEXT = '#ap_email'
  const S_CONTINUE_BUTTON = '#continue'
  const S_PASSWORD_TEXT = '#ap_password'
  const S_SIGNIN_BUTTON = '#signInSubmit'

  // Login - Step 1
  await EnterText(S_EMAIL_TEXT, USER_EMAIL); // USER-ACTION
  await ClickNavigate(S_CONTINUE_BUTTON); // USER-ACTION

  // Login - Step 2
  await EnterText(S_PASSWORD_TEXT, USER_PASSWORD); // USER-ACTION
  await ClickNavigate(S_SIGNIN_BUTTON); // USER-ACTION

  // Enter password again - Secondary Protection - This is required only if you try to land on the page with sensitive information directly
  await EnterText(S_PASSWORD_TEXT, USER_PASSWORD); // USER-ACTION
  await ClickNavigate(S_SIGNIN_BUTTON); // USER-ACTION

  // AddressBook constants
  const U_ADDRESSBOOK = 'https://www.amazon.com/a/addresses'
  const S_ADDRESS_TILE = '.normal-desktop-address-tile'

  const S_ADDRESS_FULLNAME = '#address-ui-widgets-FullName'
  const S_ADDRESS_LINEONE = '#address-ui-widgets-AddressLineOne'
  const S_ADDRESS_LINETWO = '#address-ui-widgets-AddressLineTwo'
  const S_ADDRESS_CITYSTATEPOSTALCODE ='#address-ui-widgets-CityStatePostalCode'
  const S_ADDRESS_COUNTRY = '#address-ui-widgets-Country'
  const S_ADDRESS_PHONENUMBER = '#address-ui-widgets-PhoneNumber'
  const S_ADDRESS_NODEFAULT = '.address-section-no-default'
  const S_ADDRESS_DEFAULT = '.default-section'
  const S_ADDRESS_DEFAULT_FRESH = '#ya-myab-fresh-address-icon'
  const S_ADDRESS_DEFAULT_AMAZON = '#ya-myab-default-shipping-address-icon'

  // Commented, since this is now optimized
  // ------------------------------------------
  // // Go to AddressBook
  // await Navigate(U_ADDRESSBOOK)
  // ------------------------------------------

  // Get All Addresses
  const allAddressElements = await page.$$(S_ADDRESS_TILE);

  const getAddresses = allAddressElements.map(async (addressElement) => {

    let defaultAddressforAmazon = false
    let defaultAddressforFresh = false

    const defaultAddressElement = await addressElement.$(S_ADDRESS_DEFAULT)
    if (defaultAddressElement !== null) {
      const defaultAddressForAmazonElement = await defaultAddressElement.$(S_ADDRESS_DEFAULT_AMAZON)
      defaultAddressforAmazon = defaultAddressForAmazonElement ? true: false

      const defaultAddressForFreshElement = await defaultAddressElement.$(S_ADDRESS_DEFAULT_FRESH)
      defaultAddressforFresh = defaultAddressForFreshElement ? true: false
    }

    const fullNameElement = await addressElement.$(S_ADDRESS_FULLNAME)
    const fullName = await (await fullNameElement.getProperty('innerHTML')).jsonValue();

    const addressLineOneElement = await addressElement.$(S_ADDRESS_LINEONE)
    const addressLineOne = await (await addressLineOneElement.getProperty('innerHTML')).jsonValue();

    const addressLineTwoElement = await addressElement.$(S_ADDRESS_LINETWO)
    const addressLineTwo = addressLineTwoElement ? await (await addressLineTwoElement.getProperty('innerHTML')).jsonValue() : '';

    const cityStatePostalCodeElement = await addressElement.$(S_ADDRESS_CITYSTATEPOSTALCODE)
    const cityStatePostalCode = await (await cityStatePostalCodeElement.getProperty('innerHTML')).jsonValue();

    const countryElement = await addressElement.$(S_ADDRESS_COUNTRY)
    const country = await (await countryElement.getProperty('innerHTML')).jsonValue();

    const phoneNumberElement = await addressElement.$(S_ADDRESS_PHONENUMBER)
    let phoneNumber = await (await phoneNumberElement.getProperty('innerHTML')).jsonValue();
    phoneNumber = phoneNumber.split(':')
    phoneNumber = phoneNumber[1].trim()

    return {
      FullName: fullName,
      AddressLineOne: addressLineOne,
      AddressLineTwo: addressLineTwo,
      CityStatePostalCode: cityStatePostalCode,
      Country: country,
      PhoneNumber: phoneNumber,
      DefaultAddressforAmazon: defaultAddressforAmazon,
      DefaultAddressforFresh: defaultAddressforFresh
    }

  });

  let addresses = await Promise.all(getAddresses)
  console.log(addresses)
  await browser.close();
})();

hvaoc on 20 Apr 2018

🎉15 ❤14 👍11

This is an absolute pearl. Thanks for sharing the code above.

mercmobily on 30 May 2018

I would also like to add, for our implementation, we turned on 2FA, and will keep it on. We have setup a number with Twilio or a Twilio like service to receive the SMS code, and then our login script receives that code from Twilio to enter into the 2FA. We require this b/c sometimes Amazon asks for it, and rather than a re-try sometimes code, we just always assume 2fa.

optikalefx on 30 May 2018

For what it's worth I've also found that adding the following user agents override can help smooth over differences in some cases:

await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36')

The UA I've provided is just an example. You can use any valid UA that matches an existing browser.

jondlm on 7 Jun 2018

👍12 ❤10 🎉7 🚀1 👎1

I noticed another difference, when in non-headless mode the address seems to change localhost to 127.0.0.1 which means it's difficult to assert on the URL.

felixfbecker on 8 Jun 2018

as @jondlm said, UserAgent option make headless selenium work do same with non-headless selenium. thx.

roeniss on 21 Aug 2018

@koreus7 setting the languages works like a charm!

stefpe on 12 Nov 2018

I get it works by adding this 2

await page.setExtraHTTPHeaders({
    'Accept-Language': 'en-US,en;q=0.9'
});
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36');

Must thanks to @koreus7 & @jondlm , it won't if miss out any 1 of it.

P/S: I was trying to access this site www.blibli.com

jslim89 on 9 Apr 2019

👍8 🎉1

I've made a fake user agent generator that works pretty fine!

function* generateUserAgent() {
  let webkitVersion = 10;
  let chromeVersion = 1000;

  const so = [
    'Windows NT 6.1; WOW64',
    'Windows NT 6.2; Win64; x64',
    "Windows NT 5.1; Win64; x64",
    'Macintosh; Intel Mac OS X 10_12_6',
    "X11; Linux x86_64",
    "X11; Linux armv7l"
  ];
  let soIndex = Math.floor(Math.random() * so.length);

  while (true) {
    yield `Mozilla/5.0 (${so[soIndex++ % so.length]}) AppleWebKit/537.${webkitVersion} (KHTML, like Gecko) Chrome/56.0.${chromeVersion}.87 Safari/537.${webkitVersion} OPR/43.0.2442.991`;

    webkitVersion++;
    chromeVersion++;
  }
}

const userAgents = generateUserAgent();

// ...
await page.setUserAgent(userAgents.next().value);

endel on 28 Sep 2019

👍2

So headless true/false change user agent and other stuffs?
i have two different test that works on headless:false mode but fails on headless:true mode due to rendering differences of fonts and due to time needed to make a button clickable, but i cannot share due to confindential website.
I think headless true/false should not change rendering process.
Should i consider to set a common user agent to make behaviour more consistent?
thanks.

andreabisello on 26 Nov 2019

My case is completely the opposite of the OP's situation. I got an Amazon's robot check while headless mode:false, and bypass while headless mode:true. I solved this issue thanks to @koreus7 Many thanks 👍

heathera2016 on 27 Nov 2019

Using @koreus7 and @jondlm comments solved my problem

gdossant on 14 Dec 2019

Recently, I had the same experience of getting blocked because of using headless browser. While scraping a popular website. Even after adding proper headers and user agent it didn't work out.

Finally used puppeteer-extra with stealth mode plugin which fixed the problem.

This thread helped me a lot to figure out what all could go wrong.

Thanks @Garbee @optikalefx

Bhabaranjan19966 on 24 Dec 2019

@Bhabaranjan19966 so this https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra with this https://www.npmjs.com/package/puppeteer-extra-plugin-stealth ? i will try, thanks.

andreabisello on 24 Dec 2019

not working for me : headless and gui mode renders page in a little different way

andreabisello on 24 Dec 2019

@Bhabaranjan19966 so this https://github.com/berstend/puppeteer-extra/tree/master/packages/puppeteer-extra with this https://www.npmjs.com/package/puppeteer-extra-plugin-stealth ? i will try, thanks.

Yes, those are the two repositories fixed my problem. @andreabisello

Bhabaranjan19966 on 27 Dec 2019

🎉1

I'm having this same issue with peapod.com right now. In headful mode, my program runs successfully. In headless mode, I'm screenshotting to debug and see that the link is clicked, spinner is activated, but the page never changes. How can I debug this better? @aslushnikov , could you provide me some guidance?

pgibler on 18 Apr 2020

Recently, I had the same experience of getting blocked because of using headless browser. While scraping a popular website. Even after adding proper headers and user agent it didn't work out.

Finally used puppeteer-extra with stealth mode plugin which fixed the problem.

This thread helped me a lot to figure out what all could go wrong.

Thanks @Garbee @optikalefx

The stealth mode did the trick for me too! TYVM

mewtcor on 24 Apr 2020

😄1 👍1

None of these suggested solutions work on Mac OS X. To reproduce:

Change your system language to something other than en-US or en, so that applications use that locale.
Test a browser extension or web site that is internationalised by selected user locale.
It is impossible to test or change the browser locale to en-US on non-headless mode at least.

What I am trying to do, is setup testing with Puppeteer for my browser extension Spellbook.

I have the first test now passing on Mac OS X (using some Finnish strings), and it is probably failing on other systems when you do yarn run test:puppeteer, because I use every method of setting the locale: https://github.com/peterhil/spellbook/commit/3480a73ed841f81cfac1ab99137820ea2aa5b6d6

peterhil on 7 May 2020

For what it's worth I've also found that adding the following user agents override can help smooth over differences in some cases:
await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36')
The UA I've provided is just an example. You can use any valid UA that matches an existing browser.

Add this just below where page is defined.

harshvats2000 on 27 Dec 2020

Puppeteer: Different behavior between { headless: false } and { headless: true }

Most helpful comment

All 49 comments

Related issues