Data PreProcessing for Machine Learning Made Easy. Part 3

Fahad Akbar

5 min readMar 7, 2020

Clubbing Infrequent Levels & Dealing With Untrained Levels

This is the third tutorial of python package PreProcess1.The second tutorial can be found here:

Data PreProcessing for Machine Learning Made Easy. Part 2

Label Encoding & Dealing with Zero / Near Zero Variance

medium.com

Clubbing Infrequent Levels

Occasionally you will come across categorical variables that have sufficient variance, however, there are certain levels in a feature that don't really appear frequently. Since there aren't many examples, it is likely that the model will not learn something from it. For example, say we have a data set with 100 examples and one categorical feature is City_Name, that has 20 cities in it (so there is a good amount of variation), but if you do a frequency distribution, you realize that there are two cities (say Winnipeg & Cooksville) that only appeared twice in the entire data set (total 4 examples out of 100). Because of their infrequent appearance, our model is more likely to make errors when making a prediction about these cities.

One solution is to club such examples together. In our example, we will replace Winnipeg & Cooksville levels with one combined level named “other_city”. The end result would be 4 examples for “other_city”, which is better than having two cities with only two examples each.

When applied, PreProcess1 will take care of it automatically. Under the hood, we do a frequency distribution, and then simply club all the levels that are below a user-defined threshold. In the path, by default, the threshold is 5 % bottom values. Any levels below this threshold will be clubbed together as “others_infrequent”. Let's see that in action!

Most & Least Frequent Levels in the feature ‘native-country’

Holand-Netherlands, Scotland, Hungary & Honduras seem to reside at the bottom of the distribution, so let's turn the club_rare_level feature on and see how they are dealt with.

let's check how the transformed data look like (remember it will be one hot encoded)

We can see that Holand-Netherlands, Scotland, Honduras, Hungary were all combined to make a new level (native-country_others_infrequent) as they were all below the threshold individually.

One important note is that clubbing rare levels will only work if there are at least TWO levels below the threshold. If there is only one infrequent level, it doesn't make sense to club it with a level that is frequent enough!

You may also have noticed a feature in the transformed data set named “native-country_not_available”. This implies that ‘native-country’ had missing values and were imputed with the “not_available” strategy, as this is the default option in the path.

Dealing With Untrained Levels

Though you may not find it very frequent in your normal data science academic routine, it can very well become a headache when you get a level in the test data which was not in the training data set. A complete stranger to the model! To illustrate, say while predicting /evaluating on test data set ‘native-country’ has a level “Afghanistan” that was not available in the training data set in the first place. If we do not take care of it, any model will refuse to predict and will crash your entire pipeline, not a very pleasant sight!

Since it is a deal-breaker, the default behaviour of PreProcess1 is to totally ignore any such examples in the test data (even if you don't ask it to do so). So, if Afghanistan were to appear in the test data, and it was not a part of training data, PreProcess1 will simply ignore that level. (This is also true if one or multiple new columns appear in the test data set). This is the least we can do to keep things running.

Alternatively, you can turn on the feature to treat untrained levels. This will add the unknow level with either the least frequent or most frequent level (user-specified) that was available in the training data set. The default setting is to club it with the least frequent level.

In the future, we will add more advanced criteria which are based on some sort of similarity measures. for now, let's see it in action:

Above is our training data set, zooming in on ‘workclass’ column, we can see that there are 8 levels, and ‘never-worked’ is the least frequent level, only appearing 7 times in the data set.

Now assume that we have the test data available, let's look at the ‘workclass’ column in it:

A closer examination of ‘workclass’ feature reveals that there now 9 levels . ‘social_assistance’ is a new level that was not a part of the original training data set.

let's apply the untrained level treatment in the path with the ‘least frequent’ option. We can determine the expected behaviour of the treatment though. We expect that ‘social-assistance’ will be added to the ‘Never-worked’ level since that was the least frequent level in the training data set. As a result, in the test data set, ‘Never-worked’ samples will increase from 2 to 5. Keep in mind that after transformation, ‘work-class’ feature will be split into 8 features due to one-hot-encoding.

As you can see, there is no column for ‘social_assistance’ and level ‘Never-worked’ now has 5 samples. This means our result is according to our expectations.

That's it for today, stay tuned for more exciting options!