All Collections
Importing Data
Preparing your data for AI
Preparing your data for AI
Obviously AI avatar
Written by Obviously AI
Updated over a week ago

This article details how you may prepare your data before uploading it to the Obviously AI platform.

If you're an Obviously AI Pro customer, you get access to a Dedicated Data Scientist who will do data prep for you, including: Merging, Enrichment, Statistical Work and Adding Business Logic. To explore our offerings, request a demo today.

Structuring the data in the correct format is an important step before starting to make any predictions to achieve high performance. We discuss some of the best practices to prep your data, in order to achieve high quality data and start building cutting edge machine learning models in a matter of simple clicks.

Data Prep for AutoML

Basic requirements

These requirements are must be met to get the best ROI.

  • Dataset size: at least 1000 rows and 5 columns

  • First row must be column names

  • At least one identifier column (e.g., customer id, name, etc.)

  • Data must be aggregated into one single file or table

  • Missing/empty values as few as possible

  • Personal Identifiable Information (PII) columns (e.g., phone, email, address, etc.) not required

  • Categorize long phrases to discrete values (e.g., Flight status: on-time, delayed or canceled; Churn: yes or no)

Intermediate requirements

These are nice to have requirements, they will further enhance your model.

  • Data Imputation

    1. A single missing value will cause the entire row to be dropped

    2. For numeric columns (price, salary, age, etc.) impute data with 0 or -1

    3. For text/categorical columns (gender, country, etc.) impute data with “Unknown

  • Outlier Management:

    1. Outliers are those instances in your data that appear rarely, i.e., have very few rows

    2. For example, in a 1000 row dataset a categorical color column, such as red, blue, green can have 900, 97, 3 instances respectively. Thus, Green is an outlier and will not be considered for training the data

    3. Outliers are managed automatically on the platform

    4. If possible maintain at least 10 instances of each class in a categorical column to avoid losing data

  • Date Format

    1. All dates must be of the standard YYYY-MM-DD or YYYY/MM/DD format

  • Managing special characters:

    1. Remove units and special characters: 1st > 1, $100 > 100, 10 lbs > 10, etc.

Advanced requirements

These need to be setup by someone slightly technical. If you need help, ping your dedicated data scientist support.

Data Enrichment

  • Create new columns:

    1. The quality of your dataset is often enhanced by deriving new columns from existing columns

    2. For example, deriving age from date of birth, deriving duration from start and end dates of customer subscription or employment period, etc.

    3. Once new columns are created, the redundant columns should not be considered for training the data, since they are redundant information

  • Create additional columns from comma separated values:

    1. Columns having values in comma separated format is treated as a long piece of text instead of different values

    2. For example, “Google, Apple, Facebook” are 3 separate values but will be treated as one

    3. Separate columns can be created for each value and populated with 0/1 depending on their existence for a particular row

Data imbalance

  • When the prediction column of your data does not have enough datapoints for all the classes present in your data then the dataset is considered imbalance

  • This is particularly for classification tasks only

  • Imbalanced datasets induces bias in the prediction results

  • Our platform automatically handles data imbalance but it is highly recommended to have well balanced data

Data prep for Time Series

  • Time series forecasting is an important area of machine learning because there are so many prediction problems that involve a time component. Time series adds an explicit order dependence between observations. In univariate time series forecasting, only one variable varies over time. Thus, it is crucial that the dataset is tailored accordingly to generate accurate predictions

  • Time series data requires the following basic requirements:

    1. Date Format:

      1. Dataset must have at least one date column

      2. Dates must be sequential such as hourly, daily, monthly, etc.

      3. Dates must be formatted to YYYY-MM-DD or YYYY/MM/DD

    2. Prediction column:

      1. The dataset can have multiple columns, but the prediction column must be numeric

      2. For example, tracking car sales over each month

To see Obviously AI in action, checkout this demo video OR enroll in the No-Code AI University for free to become a certified no-code AI expert.

Did this answer your question?