We have been getting poorly encoded strings from Google::Cloud::Language::V1::AnalyzeEntitiesResponse in production sparingly on February 11th, and much more frequently on February 12th (today).
This error has never happened on thousands of API calls before yesterday.
After some investigation, it looks like a fancy apostrophe ’ is being returned as \xE2\x80 but only when it's the last character in a string.
Mac OS 10.13.62.5.1google-cloud-language (0.31.2)Use the analyze entities endpoint with the following text:
Don’t Discount Women’s Colleges – Women’s colleges are often perceived as isolated and uptight, but in reality, they are just as academically challenging and rigorous as co-ed campuses.
The returned payload:
{"entities"=>
[{"name"=>"Women's Colleges -- Women's", "type"=>"PERSON", "metadata"=>{}, "salience"=>0.6430798768997192, "mentions"=>[{"text"=>{"content"=>"Women’s Colleges – Women\xE2\x80", "begin_offset"=>17}, "type"=>"PROPER", "sentiment"=>nil}], "sentiment"=>nil},
{"name"=>"colleges", "type"=>"ORGANIZATION", "metadata"=>{}, "salience"=>0.15619903802871704, "mentions"=>[{"text"=>{"content"=>"colleges", "begin_offset"=>50}, "type"=>"COMMON", "sentiment"=>nil}], "sentiment"=>nil},
{"name"=>"reality", "type"=>"OTHER", "metadata"=>{}, "salience"=>0.12969617545604706, "mentions"=>[{"text"=>{"content"=>"reality", "begin_offset"=>111}, "type"=>"COMMON", "sentiment"=>nil}], "sentiment"=>nil},
{"name"=>"campuses", "type"=>"LOCATION", "metadata"=>{}, "salience"=>0.07102492451667786, "mentions"=>[{"text"=>{"content"=>"campuses", "begin_offset"=>184}, "type"=>"COMMON", "sentiment"=>nil}], "sentiment"=>nil}],
"language"=>"en"}
For some reason the API is dropping an s following the apostrophe and even within the same string "Women’s Colleges – Women\xE2\x80" - there is a discrepancy between apostrophes.
> client = Google::Cloud::Language.new(credentials: my_credentials)
> test_str = "Don’t Discount Women’s Colleges – Women’s colleges are often perceived as isolated and uptight, but in reality, they are just as academically challenging and rigorous as co-ed campuses."
> payload = client.analyze_entities({content: test_str, type: :PLAIN_TEXT})
> payload.as_json ## returns the above Ruby hash snippet
> payload.as_json.to_json
JSON::GeneratorError: partial character in source, but hit end
> object_in_db.update(some_json_field: payload.to_json)
JSON::GeneratorError: source sequence is illegal/malformed utf-8
I know the gem for Google-Cloud-Language has not had any new releases and we've been using the gem successfully for months with no issues. So I suspect that the issue is not with the ruby gem but instead with the NLP API itself.
Since we don't have a Google support contract, this seemed like the best way to inform Google of the issue. Hopefully this can be insightful to others stumbling across this issue.
Thank you so much for posting this. I was able to reproduce this. I've filed a bug with the Natural Language API team. I will post updates when they are available.
@bendillinger I believe this has been resolved. Can you confirm that the api response is now as expected?
Everything looks good now! Thanks for the update.
Most helpful comment
Everything looks good now! Thanks for the update.