Skip to content

Allow use of sample weights for dealing with imbalanced data #32

@amits-biga

Description

@amits-biga

Description

I'm trying to see the reliability diagram of data which is sampled.
Since my dataset is very imbalanced (rows with positive targets are less than 1% of the unsampled data). A common approach to dealing with imbalanced dataset is to use random undersampling of the negative targets. Therefore, each row of a negative target actually "represents" 20x as many rows.
When I try to plot the reliability plot my probabilities are all way off. This is to be expected as I didn't account for the sampling to create the diagram. Obviously the probability computed without accounting for the sampling does not reflect the true probability I want my calibration to output. By using sample weights we can fix this issue (Most of sklearn's models have support for the sample_weight parameter.)

What I Did

I fixed this issue by changing a few lines in the plot_reliability_diagram function.
I added an optional parameter weights=None.

    if weights is None:
        weights = np.ones_like(x)
    
    mean_count_array = np.array([[np.average(y[digitized_x == i], weights=weights[digitized_x == i]),
                                  sum(weights[digitized_x == i]),
                                  np.average(x[digitized_x == i], weights=weights[digitized_x == i])] 
                                  for i in np.unique(digitized_x)])

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions