“I love it when a plan comes together!” – John Smith 1928 – 1994
Before a plan can come together, you have to make one. Here are some key points to consider during development of a data mining study.
-
Defining the aims
-
Start by choosing the subject and relevant aspects (customers, widgets or potatoes)
-
Give thought to essential criteria and the phenomenon you will be predicting
-
The aims must lead to actions and should be very precise
-
-
Listing the existing data
-
Start kicking over the rocks
-
Look everywhere, IT systems (CRM/ERP), Spreadsheets, archived data (past marketing campaigns)
-
Make like a salmon and swim upstream, drill into aggregated data – find the source (enhanced interrogation maybe?)
-
-
Collecting the data
-
Now you make like a seagull. “Mine… mine… mine!”
-
Start down your above list and get it someplace useful (aka, a database)
-
Remember, one record (row) for each statistical event (one record, one potato)
-
-
Exploring and preparing the data (welcome back… keep reading, you’ll understand)
-
Cleanup on aisle three – check the origin, replace missing or incorrect data
-
You will also need to find aberrant values or outliers that are too far from the normally accepted values – the 600 pound potato… unless it’s an Idaho spud
-
Consider reducing the number of categories for categorical data
-
-
Population segmentation
-
Depending on the aims of your study you may need to segment the data-set
-
Check out Cluster Analysis for more info
-
-
Drawing up and validating the predictive models
-
Now here’s the meat and potatoes – using techniques such as;
-
Calculation of a score – logistic regression, linear discriminant analysis and decision trees
-
Building and comparing models – checking error rates in a confusion matrix (yeah, it’s a thing) or by superimposing their lift curves or receiver operating characteristic (ROC) curves
-
-
Synthesizing Predictive Models of Different Segments
-
This step is only necessary if the data-set was segmented
-
Compare the scores of the different segments
-
-
Rinse and repeat – go back to exploring and preparing the data, refine the methods and repeat until the results are medium rare… er, satisfactory
-
Deploying the models
-
Implement the models on a computer system so that users can… well, use it
-
-
Training the model users
-
Users should know the aim, principles of the tools, the how (without drinking from the techno-jargon fire-hose), limitations such as, “these are decision support tools, not tools for automatic decision-making.”
-
-
Monitoring the models
-
Make a one-off analysis of the results
-
Some tools – like the algorithms used to calculate credit scores – should be monitored continuously for both its correct operation and its use
-
-
Enriching the models
-
Continuous improvement – sharpen the saw
-