This article discusses in detail the Advanced View section of our platform where we focus on filtering and transforming the uploaded data. After uploading the dataset and before starting to predict, you get the option to choose the prediction column and the Advanced View toggle is off by default.
When we toggle the advanced view it gives the option to have a detailed view of all the columns and the default pre-processing steps that happens in the backend when we start predicting. This section provides more manual control over the elements of the dataset and how they can be structured. We recommend sticking with the defaults as these are standard machine learning best practices to further clean your data and ensure that it is machine learning ready.
First, we need to check if the Identifier (ID) column is correct. The platform chooses the first column in a dataset as the ID column by default. If your dataset does not have the ID column as the first column, then you can drag and drop the correct ID column.
Similarly, we need to check if the Prediction column is correct. The platform chooses the last column in a dataset as the prediction column by default. Again, you can always drag and drop any column throughout these sections.
The Date Column to Expand section gives the option to select any date column and expand them to the corresponding day, month and year columns.
Next, in the “Columns to use” section we have all the feature columns (as shown in the Overview section earlier).
All the not fit for use columns will be automatically placed under the “Columns that won’t be used” section. Those columns will not be used for training purposes and will be dropped by the platform. In general, you can drag and drop feature column/s that you don’t want to use for training the model.
All the columns have filters associated with them, i.e., the user has the flexibility to use a certain range of values/number of rows in a column/s by choosing the corresponding column filter and specifying the values in the filters section
Across the top section there are the toggles for basic preprocessing of the data. All of them are on by default except Auto-Impute values.
Remove outliers: helps in removing unwanted spikes in the data, that will affect the prediction accuracy badly. For example, if there is a numeric column that has most of its values within the range of 500 - 1000 and just a few values in the range of 10000s, then those values (datapoints) are considered outliers
Normalization: ensures that the scale of all numeric columns are the same, because having different scales for different numeric columns affects prediction accuracy, since we cannot compare the contribution of feature columns that are of different scales
Upsample/Downsample: this feature is required for classification tasks only. It is applied on the prediction column to achieve a fair distribution of all classes and reduce biases in the data when there is imbalance in the number of examples for the majority and minority classes. Else there are too few examples available for the model to learn the effective decision boundary for the minority class. We specifically perform only upsampling (also known as oversampling) of the minority class to handle this imbalance in the dataset. There are two methods to upsample - Simple (happens by default) and using SMOTE (toggle on automatically when Simple toggle is turned off).
In Simple method, the minority class observations are randomly duplicated with replacement to match the number of observations in the majority class
SMOTE (Synthetic Minority Oversampling Technique), is an advanced method where synthetic/new datapoints are synthesized from existing datapoints. This data augmentation technique is used for tabular data and can be highly effective in achieving the required balance between the target classes.