skip to content
Ben Lau statistics . machine learning . programming . optimization . research

Clustering

8 min read Updated:

Interpretation of the clusters is the key objective. We can use summary statistical to describe the clusters, or use tree-based model to refit the predicted clusters given the features, and look at the tree splits to interpret the clusters. ref

Evaluation

Methods

DBSCAN

DBSCAN does clustering and outlier flagging at the same time. discussion

It first defines two parameters: eps and min_samples. eps is the radius from point A to search for any other point, i.e. maximum distance between two samples for one to be considered as in the neighborhood of the other. min_samples is the number of samples in a neighborhood for a point to be considered as a core point.

DBSCAN will starts from an arbitrary point, check if there are at least min_samples points within eps distance to the origin, if yes, expand the cluster by adding all points to the cluster, and pick an unvisited point within the cluster to repeat the process. Loop until all points in the cluster are visited. Then move to the next arbitrary point, and repeat the process.

If the cluster has more than min_samples, the origin point is considered as a core point. If a point is not a core point but is within eps distance to a core point, it is considered as a border point. If a point is not a core point and is not within eps distance to a core point, it is considered as an outlier.

K-means

Analytics Vidhya | An Introduction to K-Means Clustering

Typically, it uses the Euclidean distance, which may not be suitable for all data types.

Steps:

  1. Decide the number of clusters k
  2. Randomly pick k points to place the initial centroids
  3. Assign each data point to the nearest centroid
  4. After all data points are assigned, recalculate the centroids of the clusters by taking the mean of all data points in each cluster
  5. Repeat assigning and recalculating until the centroids do not change significantly or the maximum number of iterations is reached

As a result, it would:

  • Grouping similar data points together
  • Minimizing the distance between the data points and the centroid
  • Maximizing the distance between the centroids

K-means is an application of Expectation-Maximisation

K-means is not KNN

KNN and K-means are often confused. KNN is a classification algorithm which takes K nearest neighbors and then estimates the class of the unknown by its neighbors. K-means is where K centroids are calculated and then the points are tied to whatever centroid they’re closest to and then labeled.

k-prototype works well with mixed data types including categorical variables

Hierarchical clustering

Analyhtics Vidhy@What is Hierarchical Clustering in Python?

It tells you pairwise, what two things are most similar in each step. It does not need to pre-specify number of clusters, but it is computationally expensive. It is not suitable for large datasets. However, it provides a visual representation of the clusters in a dendrogram.

It can identify nested clusters, meaning it can find clusters within them. This is useful for datasets with a natural hierarchical structure (e.g., taxonomy of biological species).

It also offers flexibility in distance metrics (e.g., Euclidean, Manhattan) and linkage criteria (e.g., single, complete, average). This flexibility can improve clustering performance on different types of data.

It could be agglomerative or divisive. The former is bottom-up and the latter is top-down. Agglomerative starts with each data point as a cluster and then merges the closest clusters until a stopping criterion is met. Divisive starts with all data points in one cluster and then splits the cluster until a stopping criterion is met.

We could apply hierarchical clustering to data points by measuring the distance between them, if the rows are data points. We could also apply it to features by measuring the distance between them, if we turn the rows into features by using correlation matrix, and turning the correlation into distance by dij=1abs(ρij)d_{ij}=1-abs(\rho_{ij}). sklearn implementation

It can improve interpretability

  • Handle permutation importance multicollinearity? sklearn example Spoiler alert: seems like it makes no difference from just removing highly correlated features.
    • The problem is that the importances of highly correlated features would be largely underestimated since they contribute similar predictive powers with similar data adjustments. In short, hierarchical clustering groups any pair of highly correlated features into a node that could be further grouped, and keeps grouping until a threshold is reached. While grouping, one out of two features would be removed. At the end of grouping, only the features with lowest correlation would be kept. This way, the importances of the features could be reported accurately.
    • How does it differ from just removing highly correlated features based on a correlation threshold? Removing all the features with correlation >0.56>0.56 in the upper triangle of correlation matrix gives similar results to hierarchical clustering. So maybe just simply removing highly correlated features is enough? code
  • Handle observational causal inference multicollinearity airbnb paper
    • “While common solutions such as shrinkage estimators, principal component regressions, and partial linear regression are helpful in prediction problems, a crucial limitation hinders their applicability to causal inference problems -— they cannot provide the original causal relationships.”

Industries

  • Customer Segmentation with RFM Analysis and Clustering Algorithms. - Create RFM scores using quantiles on recency, frequency, and monetary value respectively, concatenating them instead of adding, and then cluster them using RFM without scores with KMeans. As a result, clustering helps us to segment customers into groups where as RFM scoring helps us to interpret the clusters.
  • How often you get to use unsupervised ML models? What is your go to model?
  • Has anyone here been successful applying clustering algorithms on real data?
  • Has Anyone Actually Used Clustering to Solve an Industry Problem?
    • Yes. I had to categorize new products based on a set of features so that we could accurately price it for the market.
    • Also a retail data scientist, we’ve clustered transactions to understand customer motivations too.
    • Clustered to learn important member features, characterize target populations for a new health program.
    • In IT Ops we cluster tickets so we can identify common problems and find opportunities for automation.
    • Recsys, I approach those as a combo of clustering, then regression. First you cluster your current data, then you regress new samples to fit then in your clusters. Recommend stuff according to the cluster. I’m currently implementing a system like this for music recommendation. Using agglomerative clustering and random forest regression.
    • Had a project to determine the location that would minimize the shipping distance from a distribution center to stores in its region. Clustering is great for that.
    • Yeah, it’s one of the most important/bang-for-buck tools. I use it extensively in recommendation systems, and clustering made the difference between a non-viable product and a viable-product for us.
      • We use clustering as a form of lossy compression to summarize large usage histories.
        • The typical single-user-vector approach from collaborative filtering didn’t work for us because of what our problem-space looks like. Not viable.
        • Looking at a user as a collection of N weighted item vectors, one per item that they ever interacted with worked perfectly, but it’s too expensive. Not viable.
        • Clustering is used to summarize a large history to a much smaller set of virtual item vectors that provide higher-resolution information about the user, but which don’t scale boundlessly as the usage histories get larger.
    • Clustering for topic detection, which powers an internal knowledge base. Clustering to understand ‘user activity profiles’ (eg content producer, curator, lurker, etc). Cluster centroids were used to map each user to the closest cluster and then this is used as a feature for the recommended system.

Customer Segmentation - Mixed Data Types

When working with mixed data types for unsupervised learning, I’ve found that using models like k-prototypes (which handles both categorical and numerical data) or Gower distance (a measure that works well for mixed data) can be effective for clustering. Preprocessing is crucial; I typically standardize numerical features and encode categorical variables using methods like one-hot encoding or label encoding, depending on the model.

For more nuanced approaches, some practitioners run separate cluster analyses for categorical and numerical variables, then combine the results using ensemble methods or apply dimensionality reduction techniques (e.g., PCA for numerical data, MCA for categorical data) before clustering.

In terms of adding value, using ML for customer segmentation has allowed my team to uncover unexpected customer personas that didn’t surface with RFM analysis alone. These insights have led to more personalized, behavior-based marketing strategies, improving customer engagement and conversion rates. ref

It’s more informative to do the clustering on rfm features and then look to see what demographic stratifications there are in the clusters after the fact

I would create the segments with RFM first then use clustering method. You can run a campaign to target both methods using A/B testing to see which one did better.

At my company, we used RFM a lot and gotten faster and profitable results on targeting customers from marketing campaign. The stakeholders want simple solutions and need to understand how customers were segmented. RFM is a lot easier to explain than clusters.

Readings