Development of a Data Mining Study

“I love it when a plan comes together!” – John Smith 1928 – 1994

Before a plan can come together, you have to make one. Here are some key points to consider during development of a data mining study.

  • Defining the aims

    • Start by choosing the subject and relevant aspects (customers, widgets or potatoes)

    • Give thought to essential criteria and the phenomenon you will be predicting

    • The aims must lead to actions and should be very precise

  • Listing the existing data

    • Start kicking over the rocks

    • Look everywhere, IT systems (CRM/ERP), Spreadsheets, archived data (past marketing campaigns)

    • Make like a salmon and swim upstream, drill into aggregated data – find the source (enhanced interrogation maybe?)

  • Collecting the data

    • Now you make like a seagull. “Mine… mine… mine!”

    • Start down your above list and get it someplace useful (aka, a database)

    • Remember, one record (row) for each statistical event (one record, one potato)

  • Exploring and preparing the data (welcome back… keep reading, you’ll understand)

    • Cleanup on aisle three – check the origin, replace missing or incorrect data

    • You will also need to find aberrant values or outliers that are too far from the normally accepted values – the 600 pound potato… unless it’s an Idaho spud

    • Consider reducing the number of categories for categorical data

  • Population segmentation

    • Depending on the aims of your study you may need to segment the data-set

    • Check out Cluster Analysis for more info

  • Drawing up and validating the predictive models

    • Now here’s the meat and potatoes – using techniques such as;

    • Calculation of a score – logistic regression, linear discriminant analysis and decision trees

    • Building and comparing models – checking error rates in a confusion matrix (yeah, it’s a thing) or by superimposing their lift curves or receiver operating characteristic (ROC) curves

  • Synthesizing Predictive Models of Different Segments

    • This step is only necessary if the data-set was segmented

    • Compare the scores of the different segments

  • Rinse and repeat – go back to exploring and preparing the data, refine the methods and repeat until the results are medium rare… er, satisfactory

  • Deploying the models

    • Implement the models on a computer system so that users can… well, use it

  • Training the model users

    • Users should know the aim, principles of the tools, the how (without drinking from the techno-jargon fire-hose), limitations such as, “these are decision support tools, not tools for automatic decision-making.”

  • Monitoring the models

    • Make a one-off analysis of the results

    • Some tools – like the algorithms used to calculate credit scores – should be monitored continuously for both its correct operation and its use

  • Enriching the models

    • Continuous improvement – sharpen the saw

Leave a Reply

%d bloggers like this: