This article introduces the different popular machine learning models. Machine Learning is widely divided into supervised and unsupervised learning. The supervised learning technique involves Classification, Regression and Time Series.
Classification is the process of categorizing a given set of data into classes. It is a type of supervised learning technique where a model learns from a labeled dataset to be able to predict the future labels. The classes are also known as targets, labels and categories. The main goal of a classification problem is to identify the category to which a new data will fall under. A major part of machine learning deals with classification, that is, we want to determine which group a particular observation belongs to. The ability to precisely classify observations is extremely valuable for various business applications such as predicting whether a particular user will buy a product or forecasting whether a given loan will default
Typical classification example scenarios are predicting whether a customer will churn or not churn, presence of disease or no disease in a patient, classifying an email as spam or non spam, etc.
Types of classification models
Binary classification - where the given dataset has 2 classes only as in the case of spam detection or churn prediction
Multi-class classification - where the given dataset has more than 2 classes, such as, predicting different sizes of T-shirts (Large, Medium, Small)
Multi-label classification - when the given dataset has two or more classes, where one or more classes may be predicted for each example data, such as, a news article can be about sports, entertainment and location at the same time
Imbalanced classification - when the number of instances for each class in a dataset is unequally distributed, such as, say for churn prediction, number of datapoints for churn is 900 and non churn is only 100 in a dataset that has 1000 rows
Why are classification problems hard?
Having an imbalance dataset - thus the model tends to be biased towards the majority class
Scarcity of data - thus the model does not have enough data to learn patterns from
Poor quality of data - there can be instances where you have enough data but the data itself is not a good representative of the problem (having completely unrelated feature columns)
Popular classification algorithms
Random Forest Classifier
eXtreme Gradient Boost (XGBoost) classifier
Light Gradient Boosting Machine (LightGBM) classifier
Regression is a supervised learning method that helps in finding the correlation between a dependent variable (target) and the corresponding independent variables (features). Specifically, regression analysis helps in understanding how the value of a dependent variable changes with respect to an independent variable when other independent variables are held constant. Regression means predicting real/continuous values such as price, weight, age, temperature, salary, etc. One important term used in regression is residual which means the difference between the actual and predicted values (i.e., error) and the main goal of regression is to find the best fit line that minimizes this error for each observation in the data
Types of regression models
Simple regression where we have only one independent variable. Simple regression can be further divided into linear and non-linear regression analyses
Multiple regression where we have 2 or more independent variables. Multiple regression is further divided into linear and non-linear regression analyses
In linear regression the relationship between the independent and dependent variables is given by a straight line. Whereas for multiple regression the relationship between the independent and dependent variables is described through a curve
Since regression involves predicting real numbers, it is highly useful for real world problems where we require accurate future predictions, such as weather conditions, marketing trends, sales prediction, etc.. Thus by performing regression, we can find trends in the data and we can determine the most important factor, least important factor and how each factor is affecting the other factors.
Why are regression problems hard?
Excessive non-constant variance - if the difference between the actual and predicted values increases with the fitted values then prediction intervals will tend to be wider than they should be at low fitted values and narrower than they should be at high fitted values
Multicollinearity problem - multicollinearity happens when two or more independent variables are correlated with each other and also with the dependent variable, thus having redundant information in the dataset. This tends to underestimate the statistical significance of the independent variables
Presence of Outliers - an outlier is an observation in the dataset that has a large residual compared to other observations in the data and thus impacts the predictive performance of the model. Thus it is crucial to detect outliers and removing them from the dataset
Popular regression algorithms
Random forest regressor
Gradient boosting regressor
eXtreme Gradient Boost (XGBoost) regressor
Light Gradient Boosting Machine (LightGBM) classifier
Time series analysis is a specific way of analyzing a sequence of observations collected over an interval of time. Depending on the frequency, a time series can be yearly (e.g., annual budget), quarterly (e.g., expenses), monthly (e.g., air traffic), weekly (e.g., sales), daily (e.g., weather), etc.. The image below represents the monthly car sales over a period of 2 years
Types of time series models
Univariate - it refers to a time series modeling that consists of a single observation recorded sequentially over equal time increments. Here, only one variable varies over time, for example, recording the temperature of a room every hour
Multivariate - it refers to a time series modeling where multiple time dependent variables are present, such as, recording the room temperature, humidity, pressure every hour by a sensor
In most cases, a multivariate time series problem can be designed as multiple univariate time series problems and research has shown that univariate time series problems generate more reliable results than multivariate time series problems
Why are time series problems hard?
Dependence on time the fact that time series involves the time component makes it difficult to solve, it is not just a binary task
In time series, observations are not mutually independent, rather a single chance event may affect all later observations
Presence of anomalies in the data, i.e., when the data does not follow a particular trend, it either has a frequency that is too high to model or is unevenly spaced through time
Presence of outliers, that is, corrupt or extreme out of range values that need to be identified and handled
Lack of uniformity/continuity in the data, i.e., when there are abrupt missing values in the data
Time series components
Trend: whether observations increase or decrease over time
Seasonality: observations stay high then drop off and pattern repeats from one period to next
Cycles: series with cyclic component, business cycles, e.g. recessions
Noise: random variation that gives the time series plots irregular zigzag appearances
Popular time series algorithms
Holt Winters Exponential Smoothing (Additive, Multiplicative)
Double Exponential Smoothing
Autoregressive Integrated Moving Average (ARIMA)
Seasonal Autoregressive Integrated Moving Average (SARIMA)
To see Obviously AI in action, checkout this demo video OR enroll in the No-Code AI University for free to become a certified no-code AI expert.