Apteo's platform allows you to automatically train machine learning models on your data and use those models to create predictions. You can create multiple different models, store them, and serve them, all from your Apteo workspace.
While you don't need any data science expertise to use the platform's predictive tools, it is helpful to understand how your data is expected to be structured and what you need to do to create predictions once your models are created. This document will walk through how to use the predictor and suggestions on how to best structure your data to make the most of your models.
Structuring Your Data To Optimize Your Results¶
By default, the system will use every column in every record (outside of your objective column) to train your models. The system will perform some basic transformations on your data to ensure it's structured in a way that it can understand. For example, it might turn the values stored in a text column into numbers, or it might take a date column and separately extract the month or week.
But in order to maximize the robustness and accuracy of your models, you should put in the work up front to structure your data in a way that will make it easy for a machine to find the most important patterns in your data. This type of work is known as feature engineering in the world of data science. Enterprise customers can contact us to receive assistance with this process to maximize their model accuracy.
Examples: How to Structure Your Data
Here are a few examples of changes you can make to your data to help your models learn:
- Predicting Likelihood of Purchase: If you have a dataset that you would like to use to predict how likely a customer is to purchase a product, it could be helpful to have several additional columns that count how many purchases that customer has made in the previous day, week, month, etc.
- Predicting Retail Sales and Accounting for the Holidays: If you'd like to predict retail sales based on historical data, it could be helpful to have a column to indicate when sales are recorded during the time period leading up to the holidays. The value of the column could be True or False. This is known as an indicator column.
- Predicting Fraud: If you want to predict an unlikely event, like a fraudulent credit card transaction, it generally helps to upsample (artificially add more instances of fraudulent transactions to the dataset) or downsample (artificially reduce the number of instances where there was no fraud) your data. This allows machines to more easily pick up on patterns that impact the case that you truly care about, fraud, especially since those cases are normally hard to pick up on.
Warning: Predicting Time Series Data
When you're trying to predict something in the future based on historical patterns, it's important that the values in your objective column reflect the future values that were observed relative to all of the other columns in the past. For example, if you want to use interest rates and trucking miles today to predict the stock price of a trucking company next week, your dataset should be structured in such a way that every row records interest rates and trucking miles 7 days prior to the stock price observed in the following week. By structuring your data like this, the machine is able to learn the patterns in your current data that predict future values.
Standard Apteo Transformations¶
The Apteo system performs some basic transformations on your data to ensure it can be processed by machine learning algorithms, as well as to optimize their accuracy. We implement the following transformations on your data:
- Date Transformations: When your dataset has a date column, the system transforms that column into multiple other columns, accounting for the week of year, month, seasonality, and cyclicality
- Text Columns: Transforms text columns into binary varialbes, known as one-hot encoding
- Outliers: Removes values that are more than 6 standard deviations away from their mean
- Data Scaling: Scales the range of each column in a way that attempts to minimize the impact of outliers
- Data Imputation: Imputes missing numerical data with the median value of a data column
Apteo trains multiple learning models on your data, evaluates them using 10-fold cross-validation, and then marks the model with the best accuracy as the model to use for your data. The section below details the specifics of this process.
Regression vs. Classification¶
When you select your key metric, the Apteo platform determines whether to run a regression model (predict numerical values) or a classification model (predict a category) based on the following rules:
- If your objective column is a number, and more than 10% of the values in that column are unique, the system use regression
- Otherwise, it will use classification
We use the following learning algorithms by default:
- Linear regression
- Random forest
- Gradient boosting
- 2-layer neural network
- 2-layer neural network
- Random forest
- Decision tree
By default, the system uses 10-fold cross-validation to evaluate the accuracy of each model. Regression models are evaluated on root mean squared error, and classification models are evaluated on the Jaccard score. We also record the following metrics for each cross-validation fold where applicable:
- Number of training records
- Number of testing records
- Mean absolute error
- Mean squared error
- Jaccard score
- The standard deviation of numerical predictions
- The standard deviation of label values
- Log loss
- F1 score
- Confusion matrix
- Root mean squared error
- Training time
- R-squared of the predictions vs. the labels
- Correlation coefficient of the predictions vs. the labels
- P-value of the above correlation
- Median absolute error
- Mean squared log error
- Max error
- Mean absolute percentage error
- Median absolute percentage error
- Mean percentage error
- Feature importance values