This article details all the dataset prerequisites for AutoML predictions before uploading it to Obviously AI. The platform requires this to ensure that the dataset is clean and machine learning ready thus improving the quality of the generated predictions.
In general, the bigger the dataset, the better the prediction results, since the models will be learning most of the patterns present in the data and will be better equipped to generate meaningful and accurate predictions.
Hence structuring the data in the correct format is an important criterion before starting to make any predictions. In this article, we discuss some of the best practices and dataset requirements, in order to achieve high quality data and start building cutting edge machine learning models in a matter of simple clicks.
The must have (basic) dataset requirements are as follows:
File size: Depending on how data is stored and what data source is used to upload it to the platform, the file size differs. For example, a CSV file is required to be within 25 MB size.
Dataset has at least 1000 rows: This ensures that we have enough rows of historical data to generate meaningful predictions. In general, the more rows, the better but your dataset needs to have at least 1000 rows
Dataset has minimum 5 and maximum 100 columns: This ensures we have enough columns to generate meaningful predictions. Currently, our platform allows a minimum of 5 and maximum of 100 columns in a dataset. Most of the real world datasets are within the 100 column range, thus 100 is a reasonable number chosen as the upper limit on the number of columns.
First row of the dataset must be column names: This is in regard to the structure of the dataset, where we must have the feature/column names as the first row in the dataset to correctly identify the data in each column.
First column of the dataset must be an identifier column: It is important to uniquely identify each row of the dataset to separate one customer/datapoint from the other, for example, names, customer/user ID, etc.. Thus one ID column must be added (preferably as the first column) if not present in the dataset already. An ID column can also be as simple as just a column with the numbers 1,2,3,4,...
Data should be aggregated into a single file or table: In case data is present across multiple files/tables, it must be merged into one file/table and then uploaded into the platform so that all the required feature columns are present in the training data.
Data should have as few missing values as possible: Presence of even one missing value results in dropping the entire row from the dataset. Also, columns that are >30% empty will be labelled as not fit for use and automatically dropped by the platform. Missing values can be filled up with 0/-1 (for numeric columns) and ‘Unknown’ (for text columns).
No Personal Identifiable Information (PII): Personal Identifiers such as, email, phone, address etc. are not useful information for prediction, since they are different for every customer/user. Similarly, any text column having all unique values will be labeled as not fit for use and automatically dropped by the platform. Thus it is best to remove them from the data beforehand.
No long text phrases, use discrete values instead: Text columns having long text, feedback, comments, etc. are not useful for prediction. Discrete values can be used instead of long phrases. Discrete values mean information in a set. For example, Churn (Yes or No), Flight status (on-time, delayed, cancelled), etc.
Some of the advanced dataset requirements are detailed as below:
Remove highly correlated columns: Columns that are similar to each other, such as, Revenue and Profits columns, do not provide any new information. We recommend keeping only one of them, preferably the one that you think is more related to the prediction column. Currently, the platform automatically selects the columns that have >80% correlation and there is the option to keep both columns or one of them.
Remove outliers: Presence of outliers adversely affects the prediction performance. We can look at the distribution graph for each column on the platform (under “My Datasets”) to identify outliers and may be removed by using the corresponding filter of the column in the “Advanced View” section. For example, as shown on the screenshot here, the 5 datapoints (3+1+1) constitute outliers.
Date columns format: It is recommended that all date columns in a dataset should follow the standard YYYY-MM-DD or YYYY/MM/DD formats.
Strip units and special characters from columns: Avoid using special (non-English) characters such as ä, é, etc. in column names. Also remove units such as lbs, %, $, etc. from feature columns.
Derive additional columns from a single column: Sometimes it is helpful to derive additional columns from a particular column to improve the prediction accuracy. However, the original column needs to be removed then so that only the derived columns remain and there is no redundant information.
For example, if there are two columns for transaction start and end dates, a new column can be derived that has the number of days the transaction was active. Currently, Day, Year and Month can be derived from the Date column on our platform.
Create additional columns from comma separated values: It is necessary to separate comma separated values in a column to individual values, else the system will treat it as a single piece of text.
For such scenarios, expand the values to individual columns of yes/no values or 0/1 numbers as shown in the example below.
Maintain a balance among different labels in the prediction column: For highly imbalanced data (applicable for classification tasks only), the platform automatically oversamples the minority class. We recommend maintaining a fair balance of all labels in the prediction column if possible so that we already have a balanced dataset to start with.