Site-kit-wp: Improve handling for domain names containing unicode bidirectional control (and other special) characters

Created on 16 May 2020  ·  24Comments  ·  Source: google/site-kit-wp

Bug Description

From a report in the forums, the Site Kit service is not fully compatible with unicode domain extensions. In this case, the Hebrew .קום.

The service doesn't proceed for the site (in punnycode) after pressing "Add Site" at Step 3 _Set up Search Console_ :

stuck at step 3 set up search console

However, the unicode domain is added to Search Console:

unicode added to search console

Steps to reproduce

  1. On a site with a unicode extension.קום, proceed through the setup process
  2. At Step 3 _Set up Search Console_ , click on "Add Site"
  3. Process stops
  4. Once fixed and setup is complete, data requests to Search Console fail with an error.

Additional details

  • The domain as stored and returned by Search console are in Unicode and may include bidirectional control characters
  • We Also need to review how other services handle these characters to ensure our matching works as expected (for example in Analytics).
  • In general I expect that when requesting data from APIs we may need to strip these characters.

Additional Context

  • PHP Version: 7.3.12
  • OS: MacOS
  • Browser: Chrome
  • Plugin Version: 1.8.0
  • Device: MacBook Air

Also addressed unicode compatibility in #794


_Do not alter or remove anything below. The following sections will be managed by moderators only._

Acceptance criteria

  • Domain names with RTL characters including the TLD should work correctly in Site Kit.
  • In particular, during "Step 3" of setup the domain check should work and allow the user to succeed.
  • Data requests for the domain to the Search console API should work as expected
  • Analytics, PSI and AdSense setup and data requests should also work correctly.

Implementation Brief

  • introduce a class Util\Google_URL_Normalizer with a single public normalize_url method that takes a $url parameter normalizes it by removing any trailing slash, removing directional control characters, decoding unicode characters and converting to lowercase:
public function normalize_url( $url ) {
        // Remove bidirectional control characters. 
        $url = preg_replace( array( '/\xe2\x80\xac/', '/\xe2\x80\xab/' ), '', $url );
        $url = $this->decode_unicode_url_or_domain( $url );
        $url = untrailingslashit( $url );
        $url = strtolower( $url );

        return $url;
}

  • Call this function on any url passed before making Search Console API data requests:

    • In /Users/adamjs/repositories/site-kit-wp/includes/Modules/Search_Console.php-

      In create_data_request

  • Change $data_request['page'] = $data['url']; to $data_request['page'] = $url_normalizer->normalize_url( $data['url'] );
  • Change $site_url = $data['siteURL']; to $site_url = $url_normalizer->normalize_url( $data['siteURL'] );

  • Adjust the domains when used for comparisons in the legacy setup flow:

    • in /Users/adamjs/repositories/site-kit-wp/includes/Core/Util/Google_URL_Matcher_Trait.php
    • Adjust is_url_match, remove the existing domain calls to decode_unicode_url_or_domain and untrailingslashit, call normalize_url instead:
$url  = $url_normalizer->normalize_url( $url );
$compare = $url_normalizer->normalize_url( $compare );
  • Similarly adjust is_domain_match:
$domain  = $url_normalizer->normalize_url( $domain );
$compare = $url_normalizer->normalize_url( $compare );

Testing

  • Add unit tests for normalize_url, we have example data for the tests in the existing test_is_url_match and test_is_domain_match.

QA Brief

  • To test, domain must contain directional control characters.
  • Note that the fix here also consolidated previous fixes for unusual domains so QA should include testing for:

    • domains set up in WP with CamelCase

    • domains registered with Unicode characters (eg türkish.com) and entered in wp settings as



      • unicode or as


      • punicode (http://xn--trkish-3ya.com/).



  • Support should have access to unicode/directional test sites set up if you need them. for CamelCase.com, changing WordPress settings is sufficient for testing.

Changelog entry

  • Improve support for URLs containing unicode, mixed case, and bidirectional control characters when requesting and sending data to Search Console.
Search Console Site Verification P1 Bug

All 24 comments

@ernee Closing this as resolved. Please feel free to re-open if this is still an issue.

@ernee - I did some research against staging and the test site you provided and discovered the following:

  • The domain is stored as Punicode in WP, and added as such to Search Console via API.
  • When we query the Search Console API for the list of sites, the site is included in the list, however it contains Unicode Bidirectional Control Characters. This is because the URL characters are in Hebrew (RTL) while the http://, . and / parts are in english (LTR).

eg: "siteUrl": "http://\u202bביקדקוםה\u202c.\u202bקום\u202c/",

Because the URL contains these control characters it does not pass our comparison tests and the system determines that user does not have the site in their Search Console account, and sends them back to Step 3 to add it.

Here are some more details on these control characters and how to handle them: https://www.w3.org/International/articles/inline-bidi-markup/uba-basics

I tested a change in staging that strips these control characters before comparing the domains and was able to complete step 3 on your test site.

When I returned to Site Kit with setup complete, I discovered a new, related issue - the Search Console API rejects the data request as coming from an invalid URL. Site Kit shows an error:

image

I was able to reproduce this error using the API explorer:

image

Removing the control characters fixes the issue:

image

Therefore I expect we need to remove the control characters on the plugin side as well when sending queries. Moving this to AC which I will work on next.

Added AC, moving to IB

I started work on the IB then realized I need to do some additional testing with other APIs to see how we handle these domains.

For example with Analytics, when we search for a matching domain, will the match work to find an existing set up domain? Does this affect the 'New Account' provisioning flow? With Analytics, once we have the correct Account/Property/View selected, our API requests use these IDs, not the domain name - except for requests that include the page url (wp-admin data and details page). Do these work correctly with the control characters present?

Similar questions arise with AdSense and even PSI - we need to test each API to make sure each part of the setup and data retrieval works as expected when domains contain the control characters.

@felixarntz - I'd like to suggest that we refocus this ticket to only address these domains for the setup including the initial Search Console data request. I can then create follow up tickets for each API and we can research them separately. What do you think?

@adamsilverstein Sounds good, we can limit this issue to Search Console. So this would mean if I understand correctly:

  • Handle such domain names in any Search Console report API requests
  • Handle such domain names in the URL / domain comparison in Site Kit which happens as part of the legacy setup flow

Does that sound right?

Yep, that sounds right.

  • Note: check if data request work with lower case and unicode characters.
  • Ensure comparison helpers match service code: lowercase, unicode and control character removal.

@adamsilverstein Assigning back to you to amend the IB after your findings regarding the above.

For the IB so far: I would suggest to add the new logic to Google_URL_Matcher_Trait which has the related existing logic already. A normalize_url method could be added there which contains the bidirectional control character piece, but potentially (see above) also the unicode-to-punicode and lowercase parts.

@adamsilverstein

In includes/Context.php add a new helper normalize_url_for_request

I meant this function shouldn't go into Context, it should live in a more specific place. Maybe we could introduce a class Util\Google_URL_Normalizer with a single public normalize_url method. Then this class could be used within Util\Google_URL_Matcher_Trait and also anywhere else where this normalization is needed (e.g. sending Search Console data requests).

Another point, do we really need different normalize_url functions for a request vs for a comparison? Regarding our conversation the other day, what was the outcome of testing data requests with unicode URLs and camelcase URLs?

I meant this function shouldn't go into Context, it should live in a more specific place. Maybe we could introduce a class Util\Google_URL_Normalizer with a single public normalize_url method.

Good suggestion, I'll update the IB.

Another point, do we really need different normalize_url functions for a request vs for a comparison?

I will do more testing to be sure, in my testing previously we do need to remove control characters before sending the domain to reporting requests with the Search console API. We need to test the requirements for each API, checking upper/lower case, unicode and control characters in the domain.

@felixarntz Updated IB as per your suggestion.
Tested Search console requests again using the API explorer. Verified:

  • URLs with control characters do not work in request, i get an error: {Domain} is not a valid Search Console site URL. Removing them fixes the issue.
  • Uppercase/lowercase are not an issue when sending data request to search console: case is ignored.
  • I'm still testing unicode vs. punicode data requests, unfortunately my test sites don't appear to have any search console data making the testing difficult. However, I do not get the not a valid Search Console site URL error when trying unicode or punicode. Also, when adding a domain to search console, it treats the punicode and unicode versions as the same domain (only shows once in the UI). I think we are fine here.

@jamesozzie or @ernee Can you confirm that since we fixed the unicode domain verification setup issue (with domains like "http://türkish.com/") we haven't heard of any follow up issues with getting data for these domains? ie. once users were unblocked from setup, did they start getting data from search console?

@adamsilverstein I do not recall reports of issues getting data for unicode domains.

@adamsilverstein I do not recall reports of issues getting data for unicode domains.

Thanks for confirming. testing is challenging because our test site doesn't generate real traffic.

@adamsilverstein The IB here still includes different normalize_url_for_request and normalize_url_for_comparison functions. Can we not have a single one that covers both (Util\Google_URL_Normalizer::normalize_url)? If camel-case is ignored for the request, we can still normalize it the same way, and I'd imagine the same about the Unicode/Punycode. When you test for the latter, if you don't get any error back, that should mean it work fine (i.e. is ignored). If it was a problem, I'd expect you to either get an "invalid SC URL" error or a permissions error because it would be interpreted as a different SC property. But if you get a success response with "zero" data, that should all be good.

In that case, I think we should rely on a single normalization method and use that throughout.

In that case, I think we should rely on a single normalization method and use that throughout.

Agreed, I will update the IB with a single normalize_url function.

  • Updated IB with a combined normalize_url function.
  • Added note about adding tests.

IB ✅

@adamsilverstein Can you expand the QA brief to also test with the other two types of URLs (camel-case and Unicode in general)?

After spending more time on this, adding tests and trying to reproduce the original bug, I found some issues with the implementation outlined here.

Moving this back to IB to clean up the approach.

@felixarntz - This is ready for review. The trickiest part here was removing the control characters from the string. My original str_replace approach only if I pasted the actual control character into the code which was confusing (you can't see them, and they do strange things to your cursor when you use arrow keys to navigate thru them). In the end I settles on an approach that uses preg_replace and the UTF-8 conversions of these characters.

Tested

Installed, activated and setup SK on a site with unicode characters
image

Confirmed no errors displayed. Ran a series of simple regression on the site, no errors found.

Passed QA ✅

I did some follow up testing here after sending some traffic to our test site and still see some potentially related issues:

  • ~When I reset the plugin and re-activated Analytics, the setup process did not preselect my domain as it should.~
  • ~I don't see the Analytics snippet on the page, even after enabling it (and visiting the site not logged in)~
  • I don't see Search Console data even though the API response seems to show some data.

The setup issues I was seeing were resolved by re-installing the develop version. Still checking on the "Zero Data" state, may be just because there is so little data.

The missing search console data seems unrelated the to issue here and more to do with the data on the test site. I will continue to send some traffic there and then test in the upcoming days when that data becomes visible in the API.

✅ Confirming I new see data in the Dashboard on the test site.

Was this page helpful?
0 / 5 - 0 ratings