Far-Butterscotch-436

Far-Butterscotch-436 t1_j0a9083 wrote

5% imbalance isn't bad. Just use a cost function that uses a metric to handle imbalance. Ie, the weighted average binomial deviance and you'll be fine.

Also you can create downsampling ensemble to compare performance and compare. Don't downsample to 50/50, try for at least 10%

You've got a good problem, lots of observations with few features

20

Far-Butterscotch-436 t1_ivuhdm4 wrote

Easy, use all the training data, use smaller label weights for the uncertain data. But keep in mind, if the data is uncertain how can you trust it??? If you say the label is uncertain is there a probability that the label is incorrect? How will you measure performance on your uncertain data vs certain? Boosting algorithms will certainly overfit , it will be difficult.

1