Searching for an effective way to label and group all the data points in your data set? K-means clustering is a simple and elegant approach that has been used for partitioning a data set into K number of distinct and non-over-lapping clusters in order to reveal hidden patterns.
In Unsupervised Machine Learning algorithms, data set is comprised of observations with p number of features X1,X2,..,Xp and the target variable y is missing from the data set. Clustering methods serve as a powerful tool for finding the target variable. Target variable in an unsupervised machine learning problem is called data label.
So, what exactly is Unsupervised Machine Learning?
Suppose that you are working on a Customer Segmentation project in order to build business strategies and predict customer value. You have a very large number of customers with different purchase habits. In order to use company’s limited resources (budget and man-hour) wisely, these patterns should be revealed and strategies should be developed according to these patterns. Such cases are Unsupervised Machine Learning algorithms.
I have recently published a Kaggle notebook where I used K-means clustering approach with Python for Customer Segmentation. The Kaggle notebook is here.
How does K-means work?
The main goal is to reveal the hidden patterns within the observations, find the optimum number of clusters that the dataset should be divided into and finally put each observation in the most convenient cluster.
User picks a random number for K, that is the number of clusters that the data set would be divided into . The algorithm randomly picks K number of data points and calculates the distance between each observation and these data points.
In the above graphics, the dataset has been divided into 3 clusters. The distances between the observations and the 3 randomly picked data points have been calculated and each observation has been assigned to the nearest cluster. Each cluster and the observations that fall into the cluster are plotted in different colors for better visualization (blue, green and black). The center points of the clusters are called cluster centroids and they are being found by calculating the mean value of all the observations that fall within the cluster.
Then a second assignment is being processed and each observation is being assigned to the cluster whose centroid is the closest. The Euclidean distance formula is being used for finding the closest distance.
Is this K value the optimum number for clusters?
The main goal here is to find the optimum number of clusters. If lower than the optimum number, then the model would fail to represent all the patterns within the dataset. If the number is higher than the optimum value, this means that data points with similarities are being represented within several groups.
K-means is an iterative process and in order to find the optimum number of clusters, a range of K values are being used and the optimum one is being picked thanks to Elbow Method.
Within a range of k values, cost function is being calculated. Cost Function of K-means clustering is being defined by Sum of Squared Distance (SSD) between observations and respective centroid of cluster to which the observation belongs. Each K number and the corresponding SSD value are being visualized in a graphic. As K value increases, the SSD value tends to zero. The elbow of the curve is the optimum number of clusters.
In my Kaggle project, I used Scikit-learn for K-means calculations and Yellowbrick for the visualization of Elbow Method. By default, distortion metric (SSD) is being used for the Elbow Method.
When these overall metrics for each model are plotted, it is possible to visually determine the best value for k. If the line chart looks like an arm, then the elbow (the point of inflection on the curve) is the optimum value for k. In the above graphics, the optimum number of clusters is 6.
In the final step, the model is being built using the optimum K value that has been found and the model that would cluster the data set in optimum number of clusters is being built.
K-means is a powerful tool and it’s easy to implement in various industries. But it is very prone to outliers and this approach shouldn’t be used without business knowledge.
- Data Science and Machine Learning Bootcamp Lectures https://www.veribilimiokulu.com/bootcamp-programlari/veri-bilimci-yetistirme-programi/
- Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani. An Introduction to Statistical Learning : with Applications in R. New York :Springer, 2013.