Docs: Microsoft.ML sentiment-analysis results issue with different versions of Microsoft.ML

Created on 15 Aug 2018 · 25Comments · Source: dotnet/docs

Seems to be an issue with this sample

The sample works with Nuget Package Microsoft.ML 0.3.0
When running 0.4.0 you get different results.

That is, both results in the console are positive for 0.4.0. The first result should be negative if it is supposed to be the same as 0.3.0, which I'd assume it would.

Document details

⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

ID: c8e22825-c4ee-5a36-f911-8ad456970ecc
Version Independent ID: ad0bf640-7dfc-1184-077c-358bb14277a7
Content: Use ML.NET in a sentiment analysis binary classification scenario
Content Source: docs/machine-learning/tutorials/sentiment-analysis.md
Product: dotnet-ml
GitHub Login: @JRAlexander
Microsoft Alias: johalex

Area - ML.NET Guide P1 product-question waiting-on-feedback

Source

mathewknott

Most helpful comment

tried the full dataset on v0.6.0

Accuracy: 50.00 %
Auc: 50.00 %
F1Score: NaN

Sentiment Predictions

Sentiment: Please refrain from adding nonsense to Wikipedia. | Prediction: Positive
Sentiment: He is the best, and the article should say that. | Prediction: Positive

Hayko-1 on 12 Oct 2018

👍2

All 25 comments

Same issue, get both as positive even if I use exact wording from sample, i.e. 'RUDE'
Also noted that if use 2.0 in csproj file get lowere percentages than if use 2.1.
Tried bringing in the csproj, project.cs and SentimentData.cs files from samples on GitHub but got the same, thats how I saw the test results differ based on .net core version

LogOg on 15 Aug 2018

Indeed, output with 0.3.0 version:

PredictionModel quality metrics evaluation
------------------------------------------
Accuracy: 66.67%
Auc: 94.44%
F1Score: 75.00%

Sentiment Predictions
---------------------
Sentiment: Please refrain from adding nonsense to Wikipedia. | Prediction: Negative
Sentiment: He is the best, and the article should say that. | Prediction: Positive

Output with 0.4.0 version:

PredictionModel quality metrics evaluation
------------------------------------------
Accuracy: 61.11%
Auc: 85.19%
F1Score: 72.00%

Sentiment Predictions
---------------------
Sentiment: Please refrain from adding nonsense to Wikipedia. | Prediction: Positive
Sentiment: He is the best, and the article should say that. | Prediction: Positive

Note, evaluation metrics values have also changed.

(I've run the code that is in the dotnet/samples repo)

It's not clear from the release notes what might have caused the observed change.

The used trainer is FastTreeBinaryClassifier. Its history shows mostly doc changes with some API update changes.

One more guess is that this commit might have introduced different behavior.

@OliaG do you know who might explain the observed difference in the tutorial output?

pkulikov on 15 Aug 2018

@GalOshri, or @sfilipi, any insight on this?

JRAlexander on 15 Aug 2018

This is likely due to the datasets being very small so random fluctuations can cause large differences in metrics (the 5% difference in accuracy just means that 1 out of the 18 examples in the test data flipped).

The other factor is that the tree parameters ({ NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 }) are very small which limits the performance of the trees. When I switch to default parameters, the AUC matches between 0.3 and 0.4. The accuracy is still different but only corresponds to one example's prediction being flipped.

I tried the same pipeline on a much larger dataset and the AUC/accuracy differed by around 0.3% (which likely is still due to the relatively small tree parameters and current inability to set a random seed).

GalOshri on 15 Aug 2018

@GalOshri

This is likely due to the datasets being very small so random fluctuations can cause large differences in metrics

Is it possible to produce such random fluctuations running against the fixed version of the Microsoft.ML? What causes the results be different in versions 0.3 and 0.4, but stay stable within one version?

pkulikov on 15 Aug 2018

@GalOshri the code uses TextFeaturizer that seems to have been influenced by dotnet/machinelearning#548 (am I correct here?). So, two sets of numerical features produced by 0.3 and 0.4 versions might be different. Thus, trained models differ. Is my understanding correct?

pkulikov on 15 Aug 2018

Good point. It seems like the implementation changed which likely affects the predictions. @zeahmed, can you please verify this?

GalOshri on 16 Aug 2018

Due the change in https://github.com/dotnet/machinelearning/pull/548, one of the prediction changed.

Previously, it was 12/18 = 0.666666
Now, it is 11/18 = 0.61111

The dataset is so small to make any conclusion. However, @justinormont may help us on this.
@justinormont, Do we have benchmarks for it? Can you confirm if the change in TextFeaturizer has adversely effected the benchmarks?

zeahmed on 16 Aug 2018

@zeahmed is this file considered as benchmarks? That file was changed as part of dotnet/machinelearning#548 as well.

pkulikov on 16 Aug 2018

@pkulikov, no these are just tests. Benchmarks are different.

zeahmed on 16 Aug 2018

I get more accurate results with the previous version (0.3.0) ? nothing else changed in my solution (sentiment analysis)

OnlyChildRecords on 4 Sep 2018

I tried running the solution today, it didn't work initially for me with given hyperparameters, but when I changed them to NumLeaves = 50, NumTrees = 50, MinDocumentsInLeafs = 20 , it gave me correct results.

mangeshw on 23 Sep 2018

👍1

Thanks, @mangeshw. @GalOshri - should I update the tutorial accordingly?

JRAlexander on 24 Sep 2018

Running the sample code provided on GitHub on ML.net 0.5.0 causes predictions to be positive regardless of the sentiment text being provided. I've even used negative texts from the training set, still returns a positive prediction.

However, adding data from the training set to the testing set causes the accuracy to go up in the metrics output, so my theory is there's something wrong within the Predict method only.

eddieysong on 27 Sep 2018

Looks like it's just the output displaying "Positive" for negative sentiments and vice versa..
https://github.com/dotnet/samples/pull/329

eddieysong on 27 Sep 2018

Microsoft.ML 0.6.0

Accuracy: 94.44%
Auc: 98.77%
F1Score: 94.74%

Sentiment: Please refrain from adding nonsense to Wikipedia. | Prediction: Positive
Sentiment: He is the best, and the article should say that. | Prediction: Negative

sstcvetkov on 6 Oct 2018

The last code snippet has error:
foreach (var item in sentimentsAndPredictions)
{
Console.WriteLine($"Sentiment: {item.sentiment.SentimentText} | Prediction: {(item.prediction.Sentiment ? "Negative" : "Positive")}");
}
Console.WriteLine();

Should be

foreach (var item in sentimentsAndPredictions)
{
Console.WriteLine($"Sentiment: {item.sentiment.SentimentText} | Prediction: {(item.prediction.Sentiment ? "Positive" : "Negative")}");
}
Console.WriteLine();

PredictionModel quality metrics evaluation

Accuracy: 94.44%
Auc: 98.77%
F1Score: 94.74%

bolajiniy on 7 Oct 2018

👍1

@bolajiniy
Try taking a look at the training set, it should be pretty obvious that 0 = positive and 1 = negative.
Please reference https://github.com/dotnet/samples/pull/329
Let me know if you still believe it to be an error.

eddieysong on 8 Oct 2018

Perhaps we should increase the size of the sample. The dataset is currently sized for a unit test; I don't think its small size is valuable for example code. Perhaps we should increase the size of the sample for use in the example. If the same dataset is used in unit tests, we should ensure the runtime of the unit tests are not adversely impacted. The only point of the unit tests are to see if _something_ changed; they are _not_ expected to make useful models.

Similar issues were mentioned in the main repo: https://github.com/dotnet/machinelearning/issues/708#issuecomment-425706329:

Try running on the full-sized dataset: https://aka.ms/tlc-resources/benchmarks/WikiDetoxAnnotated160kRows.tsv

The wikipedia-detox-250-line-data.tsv dataset is a 250 row sample of the original 160k rows. Training on this small of a sample won't create a useful model.

If someone wants to try a run on the full dataset, this would give the upper bound for how well a specific pipeline (transforms+learner) will perform on the dataset. Running on any small sample of the dataset will do worse (plus noise). We could find an optimal trade-off between sample size (which determines runtime) and output accuracy. This would let us pick a more optimal size dataset for the example.

justinormont on 8 Oct 2018

👍1

tried the full dataset on v0.6.0

Accuracy: 50.00 %
Auc: 50.00 %
F1Score: NaN

Sentiment Predictions

Sentiment: Please refrain from adding nonsense to Wikipedia. | Prediction: Positive
Sentiment: He is the best, and the article should say that. | Prediction: Positive

Hayko-1 on 12 Oct 2018

👍2

Hi Team, I am use the sample dataset with 0.6.0 version, i see that output displaying opposite prediction results.

Output:
Warning: Format error at (83,3)-(83,4011): Illegal quoting
Processed 251 rows with 0 bad values and 1 format errors
Warning: Format error at (83,3)-(83,4011): Illegal quoting
Processed 251 rows with 0 bad values and 1 format errors
Warning: Format error at (83,3)-(83,4011): Illegal quoting
Processed 251 rows with 0 bad values and 1 format errors
Not adding a normalizer.
Making per-feature arrays
Changing data from row-wise to column-wise
Bad value at line 1 in column Label
Warning: Format error at (83,3)-(83,4011): Illegal quoting
Processed 251 rows with 1 bad values and 1 format errors
Processed 250 instances
Binning and forming Feature objects
Reserved memory for tree learner: 12771792 bytes
Starting to train ...
Not training a calibrator because it is not needed.
--------------end of datapath------------------
Bad value at line 1 in column Label
Processed 19 rows with 1 bad values and 0 format errors

PredictionModel quality metrics evaluation

Accuracy: 94.44%
Auc: 98.77%
F1Score: 94.74%

Sentiment Predictions

Sentiment: Please refrain from adding nonsense to Wikipedia. | Prediction: Positive
Sentiment: He is the best, and the article should say that. | Prediction: Negative
Sentiment: Hi There, Thanks for writing to us! Please can you explain bit more on your ask? | Prediction: Negative
Sentiment: Hi this is ridiculous | Prediction: Positive

Code:
Console.WriteLine($"Sentiment: {item.sentiment.SentimentText} | Prediction: {(item.prediction.Sentiment ? "Negative" : "Positive")}");

From output we see that there are warnings in the data reading at 83 line Illegal quoting, so i added a double quote at the end of the 83 line, that output displaying Negative prediction results for all the statements.

Let me know if i am doing any thing wrong?

ankasani on 27 Oct 2018

I'm getting this with v0.7.0:

=============== Evaluating Model accuracy with Test data===============
Model quality metrics evaluation
--------------------------------
Accuracy: 94.44%
Auc: 98.77%
F1Score: 94.74%
=============== End of model evaluation ===============
=============== Prediction Test of loaded model with a multiple samples ===============
Sentiment: This is a very rude movie | Prediction: Toxic | Probability: 0.5297049
Sentiment: He is the best, and the article should say that. | Prediction: Toxic | Probability: 0.9918675
=============== End of predictions ===============

Changing "He is the best, and the article should say that." to "Very good" makes it not toxic. Seems the training data is not big enough.

rubo on 10 Nov 2018

I'd much prefer it if the sample was updated so that it would work as described. As it is, it gives the impression that something's wrong with ML or the user didn't follow the example through correctly.

Anyway, I wrote a quick gist that transforms the full Wikipedia toxicity dataset (see https://meta.wikimedia.org/wiki/Research:Detox/Data_Release) to the file format expected by the example and the numbers seem to make more sense as far as the Toxic detection is concerned. Hope it turns out to be useful to someone else.

nickntg on 7 Dec 2018

Just for the record, the output with 0.8.0:

=============== Prediction Test of loaded model with a multiple samples ===============
Sentiment: This is a very rude movie | Prediction: Toxic | Probability: 0,5297049
Sentiment: Please refrain from adding nonsense to Wikipedia. | Prediction: Not Toxic | Probability: 0,1813062
Sentiment: He is the best, and the article should say that. | Prediction: Toxic | Probability: 0,9918675
=============== End of predictions ===============