if train and test data are biased to for example one class than training process will be biased
http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html
Please make sure that the boxes below are checked before you submit your issue. Thank you!
This isn't a Keras issue.
https://www.quora.com/Which-balance-strategy-to-learn-from-my-very-imbalanced-dataset
so when data is un balanced there is may be as one of strategies
When you update weights during a minibatch during training, consider the proportions of the two classes in the minibatch and then update the weights accordingly.
so if I have much more negative labelled samples then positive , then it may be good to create batches to have the same number of negative and positive samples. Or if I want to enforce negatives then I need the option to put more negatives to batches
Of cause I may create data to be balanced by adding the same negatives many times , trick keras?
some info for answer may be found in
https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/keras-users/LYo7sqE75N4/9K2TJHngCAAJ
I have tried to "balance" out the classes by setting the class_weight=class_weight={0:1, 1:100000}.
https://www.quora.com/In-classification-how-do-you-handle-an-unbalanced-training-set
http://stackoverflow.com/questions/30486033/tackling-class-imbalance-scaling-contribution-to-loss-and-sgd
http://metaoptimize.com/qa/questions/11636/training-neural-networks-using-stochastic-gradient-descent-on-data-with-class-imbalance
http://wiki.pentaho.com/display/DATAMINING/SMOTE
https://www.quora.com/Whats-a-good-approach-to-binary-classification-when-the-target-rate-is-minimal
http://scikit-learn.org/stable/auto_examples/svm/plot_separating_hyperplane_unbalanced.html#example-svm-plot-separating-hyperplane-unbalanced-py
https://github.com/fchollet/keras/issues/177
Loss scaling would happen inside objectives.py functions, using a class_weight parameter set in model.fit or model.train. The amount of changes needed to get it rolling would be minimal.
the problem is not so simple as seems to be, I put more links
http://ro.uow.edu.au/cgi/viewcontent.cgi?article=10491&context=infopapers
A supervised learning approach for imbalanced data sets
Giang H. Nguyen
University of Wollongong, [email protected]
Abdesselam Bouzerdoum
University of Wollongong, [email protected]
Son Lam Phung
University of Wollongong, [email protected]
and more links
# http://arxiv.org/pdf/1508.03422.pdf
#Cost-Sensitive Learning of Deep Feature
#Representations from Imbalanced Data
http://www.cs.utah.edu/~piyush/teaching/ImbalancedLearning.pdf
Learning from Imbalanced Data
and even phd thesis
http://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=4544&context=etd
A balanced approach to the multi-class imbalance
problem
Lawrence Mosley
Iowa State University
this example how to deal with unbalanced data http://pastebin.com/0QHtPGzJ , but still this is not working for my task
How are doing with your task? I met a same imbalance problem of classification from time series data sets, the proportion of the minority is about 0.2%. I tried the oversampling like SMOTE, but it didn't work.
@Sandy4321 @danielgy
Training
model.fit(X_train, Y_train, nb_epoch=5, batch_size=32, class_weight = 'auto')validation_data instead of validation_split in fit(). That way you can provide an unbalanced validation set and val_loss becomes a better measure of real performance. (Not sure if this isn't implicitly taken care of with validation_split). Evaluation:
Results are bad
Test on unblanced test set.
Average Precision aka AUC of Precision Recall Curve (AUC of PR)
In contrast to AUC this measure incorporates class imbalance.
AUC_PR = average_precision_score(y_true=y_test, y_score=model.predict(X_test), average='weighted')
Let us know how this works for extreme class imbalance.
There is nothing related to class_weight = 'auto' in Keras code. Don't use it! Check https://github.com/fchollet/keras/issues/5116.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
@fabioperez
Thanks for pointing out the 'auto' mistake. Edited post. Nice catch.
@Sandy4321
probabilities into binary via a threshold = 0.5. Your model may learn this class imbalance and put the actual best prediction threshold above or below 0.5. The ROC measure for example tries all thresholds and is quite robust against class imbalance. It can also give you things like the break even point etc. You can easily calculate the optimal threshold automatically, no need to learn it -- if your point is decision automation anyway.Happy to hear what you found.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.
Most helpful comment
This isn't a Keras issue.