This article discusses all the dataset prerequisites for univariate time series forecasting before uploading it to Obviously AI. This ensures your dataset is clean and machine learning ready and ultimately generates high quality predictions.
Time series forecasting is an important area of machine learning because there are so many prediction problems that involve a time component. Time series adds an explicit order dependence between observations. In univariate time series forecasting, only one variable varies over time. Thus, it is crucial that the dataset is tailored accordingly to generate accurate predictions within just a few minutes.
The data requirements for time series are as follows:
File size: Depending on the data source of the file, the data size limit changes. Currently, we support CSV files (size limit of 25 MB) , integration with different databases (MySQL, PostgreSQL, SQL Server, etc.) and uploading files from dropbox (size limit within 1GB).
Date column: The date column must be the first column in the dataset. The recommended date formats are YYYY-MM-DD or YYYY/MM/DD. There is no restriction to the column name as such, it can be anything, by default the first column will be treated as the date column by the platform. Most of the real world time series datasets are of the following levels: Daily/Weekly/Monthly/Quarterly and is also supported by our platform.
Prediction column: One time-dependent variable that is the prediction column in the dataset. In general, it is recommended to just have the date and prediction columns in the dataset. But, the platform also gives the option to choose the appropriate column for time series prediction, in case the dataset has other columns. This is also helpful when there are multiple time dependent variables and each of them can then be used as a prediction column in turn, if required.
Additionally, you need to choose the data level, aggregate function and seasonality of your data. This is easy, since if your data is at daily level, you choose Data Level as “Day” and the “Seasonality” is automatically set to 7 (can be changed to 30, 365 as well, recommended is 365). Similarly, for monthly/weekly/quarterly levels, the seasonality automatically sets to the corresponding value.
The aggregate function allows adding/averaging the data over the specified data level. As shown you can choose either sum or mean from the dropdown (default being sum).
Data should have as few missing/empty values as possible: It is recommended that there should be few/no missing data points in the dataset. This ensures that the prediction results are robust and accurate. There are 2 types of missing data:
Missing values: This means that there is no corresponding value in the prediction column for a particular date. Data points with missing values are currently dropped by the platform.
Gaps in data: This means that there are gaps in the data, for example, monthly sales data but data for some months are entirely missing. Currently, instead of imputing gaps in the data, the platform does not consider them at all.