This article discusses the most important element in ensuring that a dataset is clean and machine learning ready to enable models to generate as accurate prediction results as possible.
Data preprocessing is an integral step in Machine Learning as the quality of the data and the important useful information that can be derived from it directly affects the learning ability of our models. Thus it is extremely important to clean and preprocess the data before feeding it into a machine learning model.
Obviously AI utilizes two major data preprocessing pipelines - starting from uploading the dataset to creating the training data that is fit to be fed into the model. Each preprocessing pipeline involves several steps to ensure proper cleaning and structuring of the input data to standard formats. All of this happens amazingly fast, typically within a couple of minutes, regardless of how big the dataset is.
Preprocessing Pipeline 1 consists of the following steps:
Defining column data types: A real world dataset can have multiple types of feature columns, such as, text/categorical columns having discrete values (Gender - male/female, Color - blue/red), numeric columns (Price, Age, Revenue, etc.), date/time (Birth date, Transaction dates), etc. It is important to label each column according to its data type for correct preprocessing of each column.
Removing duplicate rows/columns: Having rows or columns with the same values is not useful for prediction because they do not add any new information. Thus it is crucial to remove them beforehand.
Converting numerical columns to a single float type: For numeric columns it is best to have a single float type because it is easier to work with. We use float 32 data type.
Date/time columns are converted to timestamps in seconds
Categorical columns are converted to standard pandas category datatype and a list of unique categories are stored as well
Calculating the total number of rows, columns and the percentage of empty cells in each column
Updating datatypes (if required) - Since a user has the option to change the datatype manually on the review section of the platform. Sometimes, numerical columns having only discrete values like 0,1,2 are by default treated as text columns but can be changed back to numeric using the dropdown in the Overview section.
Calculating the categorical frequency and numerical histogram distribution of each feature column - for the generation of the distribution graphs and their impact on the prediction column
Labelling each column - Each column in a dataset is labeled as not fit for use depending on any one of the following conditions - if a column is more than 30% empty, if a column has one unique value, if a column has all unique values and if the number of categories in a text column is more than 2% of the total number of rows in the dataset.
For a column to be not fit for prediction, the aforementioned conditions apply, only the maximum number of categories allowed is 20 for prediction.
Labeling the entire dataset as fit for prediction or not - this depends on whether the basic requirements are met while uploading the dataset on the platform. For example, if a dataset has less than 1000 rows and 5 columns, it is not fit for prediction
Choosing prediction column (advanced view off) and id, prediction and feature columns, toggle buttons on (advanced view on)
Preprocessing Pipeline 2 consists of the following steps:
Removing outliers from feature columns
Dropping rows with even one null/nan/empty value and also columns that are more than 30% empty
Dropping columns with number of categories more than 2% of the number of rows in the dataset
Stripping units like $, lbs, etc.
Determine if upsample of minor class is required or not for classification tasks
Label encoding of feature columns and replacing each category with its index
Min-max normalization of each feature column, resulting in each value to be between 0 and 1 for each column
Splitting data into train and test sets - If upsampling of target column is required, then the platform performs stratified train-test split and upsamples just the training data, else the data is shuffled and split into train-test sets. Finally training data is ready to be fit to the model.
We use the standard 80-20 train-test split of the data, where 80% of the data is used for training the models and 20% of the data is kept for testing purposes.