Data PreProcessing for Machine Learning Made Easy. Part 2

Label Encoding & Dealing with Zero / Near Zero Variance
This is the second tutorial of PreProcess1. The first part can be found here:
Today we will talk about label encoding and about features where there isn’t enough information available for model to learn from it.
Label Encoding:
We often find text features in our data set. Some times they are plain nominal features such as color (green, black, yellow etc.) but occasionally we come across text features that are just more than nominal. They have some sort of sequence or hierarchy that yields information that could be useful for machine learning. A simple example could be some sort of rating as “good”, “bad” & “average”. We can sense right away that although the feature isn’t numeric still we can quantify the characteristics as “good” is better than “average”, and “average” is better than “bad”. A subtle notion is that we don’t really know how much “good” is better than “average” and so on.
Nominal text features should simply be one-hot encoded. This type of feature, however, should be coded in such a way that it preserves the sequential/hierarchical information. The solution is label encoding. In PreProcess1, the user will simply provide the name of the feature to encode & levels in the feature w.r.t their sequence/hierarchy. This should be provided in a dictionary format (you can provide more than one columns at once).
There may be very well a case where features you selected have missing values in it. The beauty of the PreProcess1 (path) is that label encoder comes after missing value imputation, so it will very seamlessly take care of the missing values according to the missing value imputation strategy defined by the user.
Below is the code :



Zero or Near Zero Variance
Say you have a data set with 100 samples/rows and one categorical feature is “city”. When you did an EDA (exploratory data analysis) you realized that “city” only contains one name, Toronto. In machine learning language this is referred to as “Zaro Variance”. Information in the feature does not change at all and does not warrant any learning form such a feature. We can safely remove these features.
You can extend the same concept to “near-zero variance”. Say in the above example, Toronto appeared 97 times and NewYork appeared three times. In this case, there is variance but still not enough.
PreProcessing1 will take care of it for you. For those who are more curious, I will explain how it works under the hood, but you can safely skip this section if you want to.
**Near zero variance is determined by:**
1. Count of unique points divided by the total length of the feature is lower than a user-specified threshold. In the path, it is set to 10%
2. The most common point (count) divided by the second most common point (count) is greater than a user-specified threshold. In the path, this threshold is set to 20
Once both conditions are met, the feature is dropped. All we need is to enable the option in the path. It will automatically pick up the columns that meet the criteria.
Let’s try some coding!



That's it for today, stay tuned for more exciting preprocessing options in the upcoming tutorials
Below is Google’s Colab link:
https://colab.research.google.com/drive/14xACMqH1eYsqGkcHMWv_CzlaWVfzQQid