Data PreProcessing for Machine Learning Made Easy. Part 2

Label Encoding & Dealing with Zero / Near Zero Variance

This is the second tutorial of PreProcess1. The first part can be found here:

Today we will talk about label encoding and about features where there isn’t enough information available for model to learn from it.

Label Encoding:

We often find text features in our data set. Some times they are plain nominal features such as color (green, black, yellow etc.) but occasionally we come across text features that are just more than nominal. They have some sort of sequence or hierarchy that yields information that could be useful for machine learning. A simple example could be some sort of rating as “good”, “bad” & “average”. We can sense right away that although the feature isn’t numeric still we can quantify the characteristics as “good” is better than “average”, and “average” is better than “bad”. A subtle notion is that we don’t really know how much “good” is better than “average” and so on.

Nominal text features should simply be one-hot encoded. This type of feature, however, should be coded in such a way that it preserves the sequential/hierarchical information. The solution is label encoding. In PreProcess1, the user will simply provide the name of the feature to encode & levels in the feature w.r.t their sequence/hierarchy. This should be provided in a dictionary format (you can provide more than one columns at once).

There may be very well a case where features you selected have missing values in it. The beauty of the PreProcess1 (path) is that label encoder comes after missing value imputation, so it will very seamlessly take care of the missing values according to the missing value imputation strategy defined by the user.

Below is the code :

Zero or Near Zero Variance

Say you have a data set with 100 samples/rows and one categorical feature is “city”. When you did an EDA (exploratory data analysis) you realized that “city” only contains one name, Toronto. In machine learning language this is referred to as “Zaro Variance”. Information in the feature does not change at all and does not warrant any learning form such a feature. We can safely remove these features.

You can extend the same concept to “near-zero variance”. Say in the above example, Toronto appeared 97 times and NewYork appeared three times. In this case, there is variance but still not enough.

PreProcessing1 will take care of it for you. For those who are more curious, I will explain how it works under the hood, but you can safely skip this section if you want to.

**Near zero variance is determined by:**

1. Count of unique points divided by the total length of the feature is lower than a user-specified threshold. In the path, it is set to 10%

2. The most common point (count) divided by the second most common point (count) is greater than a user-specified threshold. In the path, this threshold is set to 20

Once both conditions are met, the feature is dropped. All we need is to enable the option in the path. It will automatically pick up the columns that meet the criteria.

Let’s try some coding!

That's it for today, stay tuned for more exciting preprocessing options in the upcoming tutorials

Below is Google’s Colab link:




I practice, learn and teach Data Science

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

AI Trust, Bias & Model Explainability using IBM Watson Openscale

AI Trust, Bias and Model Explainability features outline

4d Result Live — How to Win 4d Lotto

4d Result Live - How to Win 4d Lotto


4d Result Live — How to Win 4d Lotto

4d Result Live - How to Win 4d Lotto

ClinVar — a technical view

Naughty Dragon Integrates Berry Oracle

Introducing the Daily Dose of Data Science series to learn, unlearn and relearn various concepts…

Sample Superstore & Covid-19 in One Dataset

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Fahad Akbar

Fahad Akbar

I practice, learn and teach Data Science

More from Medium

Data Compression Algorithm

Build a penguin classifier using Machine Learning Pipeline

Recognizing Handwritten Digits with Scikit-Learn

Data Engineering — W’s