As I go through various data analytics books, I slowly discover data mining tasks such as classification, regression, clustering, causal modelling and so on. But the same question keeps popping up in my mind, “How do I know when to do what task?”
So in this post, let’s join me and explore the 9 popular data mining tasks, together with some common real-life applications. By the end of the post, I will also attempt to create a quick guide to frame business questions into data mining tasks to address my earlier question. Let’s go!
Among a limited number of mutually exclusive classes, which class can we put an individual into?
- Churn analysis: Consider each customer, who are likely to switch to a competitor? 2 mutually exclusive classes here include will churn and will not churn
- Credit application assessment: Given an application for credit cards, do we approve, reject or request further human evaluation based on personal details such as annual income and historical debts? 3 mutually exclusive classes here are approve, reject and flag for review.
Given a variable of interest, how much is the expected value of that variable?
- Sales forecast: Given historical monthly sales numbers for the past 3 years and other macroeconomic data, predict monthly sales of the next financial year.
- Pandemic prediction: Considering all information about a pandemic outbreak, calculate the projected death rate in Country A for the next 3 months.
- Electricity usage: Taking into account historical electricity usage of the city, predict the average electricity usage for each geographical area in the next summer
What are those customers/ products that are similar to a targeted one?
- Product recommendation: Ever seen Netflix’s recommendations for similar movies? How about major online stores showing a section of “You may also like this…” The underlying assumption is people who like one product will likely enjoy a similar offering.
- Targeted ads: Based on Web browsing activities and online purchasing history, online advertisers identify targeted users (who share similar profiles with existing customers) to show them specific advertisements. Clustering
At the first glance, how does the entire population organise themselves into different groups?
So what’s the difference between clustering and classification? Classification requires pre-defined classes whereas clustering doesn’t start with any existing grouping. If you are doing exploratory data analysis to understand similarity among things, then clustering will more likely be used. On the other hand, if you have a specific purpose in mind, which is to sort an individual into one of the pre-defined buckets, then classification is the way to go.
How about clustering versus similarity matching? Both tasks look at the similarity among things, but the purpose is different. Clustering looks for the different groups or segments that data is naturally organised themselves into whereas similarity matching identifies similar individuals to a target.
- Customer segmentation: How many different types of customers do I have?
- Employee training plan: Considering all current employees, how many groups of roles/ pathways are we planning for career development and professional training?
Association rule discovery
What items or events usually occur together?
- Market basket analysis: Supermarkets analyse past transactions to understand which products are usually purchased together. The goal is to improve their store layout, conduct in-store promotion activity for cross-selling or create an online product catalog.
- Bioinformatics – Protein Sequences: By observing the sequence of different amino acids present in a protein, researchers can better understand the composition of protein sequences to facilitate the synthesis of artificial proteins.
What is the typical behaviour of this specific individual or group?
- Fraud detection: The bank usually holds a profile of your typical spending behaviour. When someone makes transactions on your credit card without you knowing, the bank can automatically cancel your credit card and inform you about the suspicious transaction for potential refunds.
- Cybersecurity alert: Recently received a ‘suspicious sign-in prevented’ email from Google? Based on your historical log-in behaviour, Google has noticed the unusual activity and flag it out for your notice.
Based on the existing relationships, what are the missing links that are likely to exist?
- Social media’s friend suggestion: The logic goes something like this. “Since you and Julie have 15 mutual friends on Facebook, maybe you would like to add Julie as your friend?”
- Criminal intelligence analysis: Given the relationships between known terrorists and their social network, polices can identify possible missing links to new suspects to detect and prevent potential terror attacks.
From a huge set of data, what is the main gist or what are the key points?
- Sentiment analysis: By analysing all existing posts on social media, companies can determine the key topics related to the brands and the products.
What factors can actually influence the outcome?
- Product pricing: A company offer different service plans at different price points to determine the best price point for new service offerings. The key here is to determine how different price points affect the decision to subscribe to the service.
- Predictive maintenance: Does missing preventive maintenance for your car in the past 12 months lead to early breakdown of the car?
Wrapping Up: How do we frame our business questions into data mining tasks?
Here is a quick summary of how to do it.