Google-cloud-ruby: Non-UTF8 strings in Google-Cloud-Language response

Created on 12 Feb 2019  Â·  3Comments  Â·  Source: googleapis/google-cloud-ruby

Issue

We have been getting poorly encoded strings from Google::Cloud::Language::V1::AnalyzeEntitiesResponse in production sparingly on February 11th, and much more frequently on February 12th (today).
This error has never happened on thousands of API calls before yesterday.
After some investigation, it looks like a fancy apostrophe ’ is being returned as \xE2\x80 but only when it's the last character in a string.

Environment details

  • OS: Mac OS 10.13.6
  • Ruby version: 2.5.1
  • Gem name and version: google-cloud-language (0.31.2)

Steps to reproduce

Use the analyze entities endpoint with the following text:

Don’t Discount Women’s Colleges – Women’s colleges are often perceived as isolated and uptight, but in reality, they are just as academically challenging and rigorous as co-ed campuses.

The returned payload:

{"entities"=>
  [{"name"=>"Women's Colleges -- Women's", "type"=>"PERSON", "metadata"=>{}, "salience"=>0.6430798768997192, "mentions"=>[{"text"=>{"content"=>"Women’s Colleges – Women\xE2\x80", "begin_offset"=>17}, "type"=>"PROPER", "sentiment"=>nil}], "sentiment"=>nil},
   {"name"=>"colleges", "type"=>"ORGANIZATION", "metadata"=>{}, "salience"=>0.15619903802871704, "mentions"=>[{"text"=>{"content"=>"colleges", "begin_offset"=>50}, "type"=>"COMMON", "sentiment"=>nil}], "sentiment"=>nil},
   {"name"=>"reality", "type"=>"OTHER", "metadata"=>{}, "salience"=>0.12969617545604706, "mentions"=>[{"text"=>{"content"=>"reality", "begin_offset"=>111}, "type"=>"COMMON", "sentiment"=>nil}], "sentiment"=>nil},
   {"name"=>"campuses", "type"=>"LOCATION", "metadata"=>{}, "salience"=>0.07102492451667786, "mentions"=>[{"text"=>{"content"=>"campuses", "begin_offset"=>184}, "type"=>"COMMON", "sentiment"=>nil}], "sentiment"=>nil}],
 "language"=>"en"}

For some reason the API is dropping an s following the apostrophe and even within the same string "Women’s Colleges – Women\xE2\x80" - there is a discrepancy between apostrophes.

Code example

> client = Google::Cloud::Language.new(credentials: my_credentials)
> test_str = "Don’t Discount Women’s Colleges – Women’s colleges are often perceived as isolated and uptight, but in reality, they are just as academically challenging and rigorous as co-ed campuses."
> payload = client.analyze_entities({content: test_str, type: :PLAIN_TEXT})
> payload.as_json ## returns the above Ruby hash snippet
> payload.as_json.to_json
JSON::GeneratorError: partial character in source, but hit end
> object_in_db.update(some_json_field: payload.to_json)
JSON::GeneratorError: source sequence is illegal/malformed utf-8

I know the gem for Google-Cloud-Language has not had any new releases and we've been using the gem successfully for months with no issues. So I suspect that the issue is not with the ruby gem but instead with the NLP API itself.

Since we don't have a Google support contract, this seemed like the best way to inform Google of the issue. Hopefully this can be insightful to others stumbling across this issue.

language p1 bug

Most helpful comment

Everything looks good now! Thanks for the update.

All 3 comments

Thank you so much for posting this. I was able to reproduce this. I've filed a bug with the Natural Language API team. I will post updates when they are available.

@bendillinger I believe this has been resolved. Can you confirm that the api response is now as expected?

Everything looks good now! Thanks for the update.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

arslanmurtaza picture arslanmurtaza  Â·  4Comments

take picture take  Â·  4Comments

echan00 picture echan00  Â·  4Comments

premist picture premist  Â·  3Comments

jeremywadsack picture jeremywadsack  Â·  3Comments