This article details how you may prepare your data before uploading it to the Obviously AI platform.
If you're an Obviously AI Pro customer, you get access to a Dedicated Data Scientist who will do data prep for you, including: Merging, Enrichment, Statistical Work and Adding Business Logic. To explore our offerings, request a demo today.
Structuring the data in the correct format is an important step before starting to make any predictions to achieve high performance. We discuss some of the best practices to prep your data, in order to achieve high quality data and start building cutting edge machine learning models in a matter of simple clicks.
Data Prep for AutoML
These requirements are must be met to get the best ROI.
Dataset size: at least 1000 rows and 5 columns
First row must be column names
At least one identifier column (e.g., customer id, name, etc.)
Data must be aggregated into one single file or table
Missing/empty values as few as possible
Personal Identifiable Information (PII) columns (e.g., phone, email, address, etc.) not required
Categorize long phrases to discrete values (e.g., Flight status: on-time, delayed or canceled; Churn: yes or no)
These are nice to have requirements, they will further enhance your model.
A single missing value will cause the entire row to be dropped
For numeric columns (price, salary, age, etc.) impute data with 0 or -1
For text/categorical columns (gender, country, etc.) impute data with “Unknown”
Outliers are those instances in your data that appear rarely, i.e., have very few rows
For example, in a 1000 row dataset a categorical color column, such as red, blue, green can have 900, 97, 3 instances respectively. Thus, Green is an outlier and will not be considered for training the data
Outliers are managed automatically on the platform
If possible maintain at least 10 instances of each class in a categorical column to avoid losing data
All dates must be of the standard YYYY-MM-DD or YYYY/MM/DD format
Managing special characters:
Remove units and special characters: 1st > 1, $100 > 100, 10 lbs > 10, etc.
These need to be setup by someone slightly technical. If you need help, ping your dedicated data scientist support.
Create new columns:
The quality of your dataset is often enhanced by deriving new columns from existing columns
For example, deriving age from date of birth, deriving duration from start and end dates of customer subscription or employment period, etc.
Once new columns are created, the redundant columns should not be considered for training the data, since they are redundant information
Create additional columns from comma separated values:
Columns having values in comma separated format is treated as a long piece of text instead of different values
For example, “Google, Apple, Facebook” are 3 separate values but will be treated as one
Separate columns can be created for each value and populated with 0/1 depending on their existence for a particular row
When the prediction column of your data does not have enough datapoints for all the classes present in your data then the dataset is considered imbalance
This is particularly for classification tasks only
Imbalanced datasets induces bias in the prediction results
Our platform automatically handles data imbalance but it is highly recommended to have well balanced data
Data prep for Time Series
Time series forecasting is an important area of machine learning because there are so many prediction problems that involve a time component. Time series adds an explicit order dependence between observations. In univariate time series forecasting, only one variable varies over time. Thus, it is crucial that the dataset is tailored accordingly to generate accurate predictions
Time series data requires the following basic requirements:
Dataset must have at least one date column
Dates must be sequential such as hourly, daily, monthly, etc.
Dates must be formatted to YYYY-MM-DD or YYYY/MM/DD
The dataset can have multiple columns, but the prediction column must be numeric
For example, tracking car sales over each month