How To Deal with Imbalanced data? — part 1

Goutham Chandrasekaran
5 min readAug 13, 2020

--

So you think that you found the perfect data to build your ML model?, all the labels and data-points on the dataset matches your criteria or you built a custom dataset through inspection and web-scraping techniques. But what if the dataset is highly imbalanced (i.e 80% of data belongs to one class). Let’s see how to deal with this problem.

Art of Balancing [source : google]

What’s an Imbalanced data and what is the problem with it?

As said earlier, Imbalanced data is where the majority of the data points belong to a single class/label. Machine learning models require balanced dataset to learn features from the data and result in a model that has good accuracy. If the data is imbalanced, then the model may not learn from the minority class, Hence the model becomes biased towards a single class.

source : imgflip meme generator

Examples of Imbalanced data

  • credit card fraud detection / fraud detection in general
  • Spam detection in emails
  • Analyzing pay-per-click stream advertisements
  • web-scraped data (As a part of custom data creation)
  • Disease detection systems

In all the above examples, there is high possibility of data imbalance. Let’s take Spam detection, where we predict whether the email is spam or not and decide whether the mail goes to spam folder or inbox. (e.g Out of 100 emails, may be 75 will come under not spam category the rest 25 being spam or vice versa.)

Similarly out of thousands of card transactions, a few of those may be fraudulent. So straightaway we can see the chance for imbalanced data.

How to deal with it?

Important Note: This blog only focuses on addressing the imbalance in the dataset and not on data pre-processing and model building part.

There are generally two ways of dealing with imbalanced data and they are Oversampling and Downsampling techniques.

let’s see the Oversampling techniques and leave the Downsampling techniques for part-2 of this blog

Oversampling the data

The python package index (PyPi) has a library for sampling the data, called imbalanced-learn. This library supports various oversampling and downsampling techniques, Let’s see the list of Oversampling techniques available on imbalanced-learn

  • Random minority over-sampling with replacement
  • SMOTE — Synthetic Minority Over-sampling Technique
  • SMOTENC — SMOTE for Nominal Continuous
  • bSMOTE(1 & 2) — Borderline SMOTE of types 1 and 2
  • SVM SMOTE — Support Vectors SMOTE
  • ADASYN — Adaptive synthetic sampling approach for imbalanced learning
  • KMeans-SMOTE

let’s discuss about the most popular and effective oversampling technique, SMOTE

SMOTE

The Synthetic Minority Oversampling Technique or SMOTE for short, generates synthetic data for the minority class using k-nearest neighbors.

According to this paper, synthetic samples are generated by taking the difference between the point from minority class and its nearest neighbor and multiplying this difference by a random number between 0 and 1. This method improves the quantity and quality of the minority class, by using the original data from the minority class.

Let’s see an example of where we can use this technique, We will take a toy dataset of credit card fraud detection with 100 rows in total.

Dataset:

df.head()

This toy dataset has only 100 rows and let’s see the class distribution below.

sns.countplot('Class', data=df)
plt.title('Class Distributions \n (0: No Fraud || 1: Fraud)', fontsize=14)
df.Class.value_counts()

As we can see, only 12 entries are Fraud transactions.

with SMOTE, we are not over-sampling the minority class with replacements, but we are synthetically generating new values for the minority class for the purpose of balancing the dataset.

Below we have mentioned k_neighbors = 4 in SMOTE , which means that we are considering 4 nearest points to the original class to oversample the data. If we only have 2 data points in minority class( i.e 98 rows are Not Fraud and only 2 are Fraud transactions), then we should keep k_neighbors = 1, because only 2 data-points are original and the new synthetically generated data will be using only the original data from the minority class to create new data-points.


#imports
from imblearn.over_sampling import SMOTE
sm = SMOTE(k_neighbors=4)#split label and data-points
X = np.array(df.loc[:, df.columns != 'Class'])
y = np.array(df.loc[:, df.columns == 'Class']).reshape(-1, 1)
#fit smote to X and y
Xnew, ynew = sm.fit_sample(X, y)

Now we have fitted the data to smote and now lets visualize the result to

check the class distribution

#plottingpd.Series(ynew).T.value_counts().plot(kind='bar')

Here we can see that SMOTE has increased the size of minority class and transformed our imbalanced dataset into balanced dataset.

print("Length of data before sampling : ",len(df))
print("Length of data after SMOTE : ",len(df_new))

We can see that SMOTE has added 76 more rows in favor of minority class (Fraud in our case). By using this balanced data, we can build a model that has good accuracy and good recall value.

End Note

This blog focused on the intuition behind SMOTE and how to apply it to tackle the imbalanced datasets.

As this is a toy dataset, we did not build model and compare the accuracy, recall and F1 score, However in the part-2 of this blog, we will use a proper Imbalanced dataset, build models and compare the accuracy before and after doing oversampling and also look into Downsampling techniques which will be useful in some cases.

--

--