-
Notifications
You must be signed in to change notification settings - Fork 65
Description
Description
I'm trying to see the reliability diagram of data which is sampled.
Since my dataset is very imbalanced (rows with positive targets are less than 1% of the unsampled data). A common approach to dealing with imbalanced dataset is to use random undersampling of the negative targets. Therefore, each row of a negative target actually "represents" 20x as many rows.
When I try to plot the reliability plot my probabilities are all way off. This is to be expected as I didn't account for the sampling to create the diagram. Obviously the probability computed without accounting for the sampling does not reflect the true probability I want my calibration to output. By using sample weights we can fix this issue (Most of sklearn's models have support for the sample_weight parameter.)
What I Did
I fixed this issue by changing a few lines in the plot_reliability_diagram function.
I added an optional parameter weights=None.
if weights is None:
weights = np.ones_like(x)
mean_count_array = np.array([[np.average(y[digitized_x == i], weights=weights[digitized_x == i]),
sum(weights[digitized_x == i]),
np.average(x[digitized_x == i], weights=weights[digitized_x == i])]
for i in np.unique(digitized_x)])