Once you start collecting data to train classification models in machine learning, you might notice that your dataset is imbalanced.
This is a usual phenomenon and occurs more often than we think. In this article, let’s discuss what problems can be caused by an imbalanced dataset and have a glance at the possible solutions.
What is an imbalanced dataset?
In classification techniques, we would have tons of data points and label pairs.
Labels are the class associated with each data point.
If the distribution of the labels is not moderately uniform, then the dataset is called imbalanced.
Let’s understand this with some examples:
Case 1: In a two-class classification problem, let’s say you have 100k data points. It is imbalanced if only 10k data points are from class 1 and rest of them are from class 2. The distribution ratio here is 1:9.
For the same case, what if 45k data points are from class 1 and the remaining 55k are from class 2? Though it is slightly imbalanced, it can be considered as a balanced dataset as it will not impact the effectiveness of the algorithm.
Case 2: In case of multi-class classification, the distribution of data points could be dominated by a few classes. In this case, the dataset is imbalanced.
What is the problem with an imbalanced dataset?
Data is power. Any machine learning algorithm, even the strongest, is useless without data to feed on. An algorithm is a logical arrangement that learns to perform a certain action or deliver a certain result using data. Without a good dataset, even the best algorithm cannot really deliver good results. On the contrary, even a weak algorithm can deliver very good results with brilliant datasets that have a good number of data points.
If a data set is biased towards one class, the algorithm will also be biased towards the same class.
An algorithm learns from the data it is fed. If we provide more examples of a particular class while feeding data, the machine learning model will learn more from those examples. In the worst case, whatever data you feed, the model will presume it belongs to the highly distributed class (class with more data points).
On the basis of this example, it can be stated that an imbalanced dataset can make the model dumb.
Workarounds for an imbalanced dataset
The simplest approach: Can you collect more data? Easily?
The easier way to tackle the imbalanced dataset problem is to collect more data for the class with low distribution ratio. However, this simple approach has its own disadvantages. Collecting more data is a costly and time-consuming process in most of the cases.
If data collection involves less time and money, this is the preferred approach, but many times, collecting more data is not a feasible option. Let’s take a look at alternative approaches.
Upsampling and downsampling
In scenarios where collecting more data is not an option, upsampling the minority class or downsampling the majority class will do the trick.
Example: You have 100k data points for a two-class classification problem. Out of these, 10k data points are associated with the positive class and 90k are associated with the negative class.
Upsampling the minority class: Here, we will take 10k data points and will replicate them 9 times. Thus, we would have 90k examples for both positive and negative classes, and dataset size would be 180k data points.
Downsampling the majority class: For this approach, we will choose 10k data points randomly from the majority class. Then we will have 10k data points from each class and total dataset size will be 20k data points.
While downsampling, we may lose significant information, whereas during upsampling, we may get redundant information. If this problem is impacting performance, we may choose ensemble learning techniques: bagging and boosting.
In bagging, we build multiple (lots of) samples by bootstrap sampling and train multiple models on each sample, and at the end, we consider the majority to predict the class of a data point. Random forest is one such bagging technique.
On the other hand, in boosting, we build multiple weak learners and take a weighted sum of all the weak learners, which will result in strong predictive algorithms. This article explains boosting in details.
Note that ensemble-based algorithms are computationally more expensive.
Data augmentation is a technique to artificially build the data. Consider this: what if we can build 80k data points from 10k positive data points?
With data augmentation, we can build similar augmented data points based on original data points.
One simple technique for data augmentation is perturbation where we add a small noise (e.g. 0.000273) while keeping the class label same.
SMOTE algorithm is a data augmentation technique that follows the above-mentioned method.
Change your perspective
This trick is a brilliant idea. Instead of considering logical hacks to solve a classification problem, change your perspective. So, if we are dealing with imbalanced datasets where we have very few examples for one of the classes, we can use this fact to convert a classification problem into an anomaly detection problem.
Once we form the problem in terms of anomaly detection, we can use anomaly detection algorithms to solve it.
A final trick
Measuring the effectiveness of an algorithm that trained on an imbalanced dataset is tricky.
For classification, we generally use accuracy as a performance metric. As mentioned earlier, imbalanced data may make the model dumb: whatever you feed, it will always predict majority class.
If we consider the above-discussed scenario, the dumb model will always predict the negative class as 90% of the data is from the negative class.
For this situation, if we choose accuracy as a performance metric, the model will have 90% training accuracy (because 90% data points are negative and thus model predicts the right class 90/100 times) even if it only predicts negative class.
It is always advisable to use F1 score, precision, recall, or area under ROC curve as a performance metric.
To tackle the imbalanced dataset problem, the approach you use depends on the type of data, the nature of the problem, and the availability of resources. But, one thing that needs to be considered is choosing the right performance metric.
eInfochips has provided organizations with highly-customized solutions that run on advanced machine learning algorithms. If your business is working with machine learning and needs a partner to integrate it with your products or solutions, get in touch with us.