Data PreProcessing for Machine Learning Made Easy. Part 4

Fahad Akbar

5 min readMar 15, 2020

Dealing with High Cardinal Data & Extracting Features from Time

This is the fourth tutorial of python package PreProcess1.The third tutorial can be found here:

Data PreProcessing for Machine Learning Made Easy. Part 3

Clubbing Infrequent Levels & Dealing With Untrained Levels

medium.com

Dealing with High Cardinal Data

This is one of my favourites, and I enjoyed quite a bit while solving this. High cardinal simply means that you have too many levels in a categorical feature. Imagine you have data of a whole seller, who provides finished products to say 2500 retail stores. The data you have is a store wise weekly sales. This means there will be a column (feature) for store names / ID and there will be 2500 of them. Now imagine applying one-hot encoding to it. This one hot encoded feature will explode to 2500 columns. Needless to say, this is an issue. At a minimum, it will be very expensive to compute a model, and even if you are able to fit a model, results may not be impressive.

PreProcess1 will be able to handle it for you in a neat and clean fashion. There are two ways you can approach this issue within PreProcess1 (of course there will be many others, and I would love to hear it from you).

One, we can replace the levels with their frequency in the data. So say store “ABCD” appeared 250 times in the data, we will replace the level “ABCD” with integer 250. This way, you will convert your text feature into a numeric feature with some degree of relevance to the actual level. Let's call it the “count” approach.

The second method is a bit more interesting. We first reduce the dimensionality of the data by breaking it down into two components (using PLS). This essentially tries to “explain” the entire data through two newly calculated features (that are numerical in nature). We then calculate mean, median, min, max & standard deviation of these components, grouping them by the levels in the feature. Once applied, we have a dataset where every row represents the original level, and the features are the min, max, mean, median and std of the PLS components. Lastly, we apply the Kmeans clustering to this data set to group levels that are similar. The algorithm automatically tries to determine the number of optimum clusters. All we are left is to replace the levels in the code with their respective cluster. Once done, the original feature where you had 2500 levels, will now be reduced to a lower number. In simple words, we try to reduce cardinality by grouping/clustering the levels by finding similarities between them. We can call it the “cluster” approach.

In order to apply this option, we will need to turn on the option, select the strategy (count or cluster) and the column/columns you want to treat. Let us see that in action:

There are 42 levels (countries), we applied the cardinality treatment to this feature and below is the result (remember end result is always one-hot encoded).

‘native-country’ has been transformed to only two levels (clusters), so 42 levels got converted to 2 clusters. We can actually see the details as which level was mapped to which cluster

Let's apply the ‘count’ strategy and apply it to the ‘occupation’ column

since this converts levels into their respective frequency in the data, the occupation feature will be transformed into a numerical feature. Let's see how it looks like :

Extracting Sub Features From Time Feature

The concept here is simple. If the data have date/time features, this option will generate sub-features like a month, weekday, month-end, month-start & hours (if time is given). If data has time/date features, and if the path was able to detect it automatically (alternatively you pre-specified the feature as a date type in the path ), it will create sub-features and will drop the original time feature. Let's see it in action:

There is a date feature, let's apply the path :

The path was able to detect the date column as date type. So nothing more to do, sub-features will be automatically created. Let's check the transformed data.

As we can see, the ‘date’ feature has been broken down into months, weekdays, month-end, moth-start features.

That's it for today, stay tuned for more exciting features !