Bootstrapping K-Means: Getting a sense of centroids, boundary, and labels uncertainties¶

Intro¶

K‑Means, like many Machine Learning algorithms, can be highly sensitive to the particular realization of the training data. This sensitivity is especially pronounced in small or noisy datasets, where different samples from the same underlying population may lead to noticeably different centroid locations, cluster boundaries, and label assignments. As a result, the predictions of a K‑Means model inherently carry uncertainty - an important consideration in applications where misclassification has practical consequences (e.g., assigning a customer to the wrong market segment).

A principled and widely used approach to quantify this uncertainty is bootstrap resampling. The bootstrap generates B replicated datasets by sampling, with replacement, n points from the original n. Fitting a K‑Means model to each replica produces a distribution of centroid positions and cluster assignments. These distributions can then be used to derive uncertainty measures such as confidence intervals for the centroids (not implemented here) and cluster‑membership probabilities for individual observations (implemented in this work).

BootKMeans is a scikit‑learn-compatible estimator that integrates bootstrap resampling with K‑Means clustering to provide uncertainty estimates for cluster memberships, centroids, and decision boundaries. Because it adheres to the scikit‑learn API, it can be seamlessly integrated into pipelines that include preprocessing steps such as scaling or dimensionality reduction. This notebook demonstrates the use of BootKMeans on several illustrative toy datasets.

Example of Usage¶

InĀ [1]:
### DO THIS BEFORE IMPORTING NUMPY OR SKLEARN TO AVOID KNOWN MEMORY LEAK IN WINDOWS
import os
os.environ["OMP_NUM_THREADS"] = "1"   # or "1" for complete safety
os.environ["MKL_NUM_THREADS"] = "2"   # optional, often helps too

import sklearn
print('The scikit-learn version is {}.'.format(sklearn.__version__))
The scikit-learn version is 1.4.2.

Load libraries and set random seed

InĀ [2]:
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import numpy as np
import pandas as pd
from sklearn.datasets import make_blobs
from sklearn.datasets import load_digits, load_iris, load_wine
from sklearn.decomposition import PCA
from sklearn.metrics import adjusted_rand_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from BootstrapKMeans import KMeansBoot

random_state = 217

First we create a function to plot the results and return useful info on the instances which have membership uncertainties:

InĀ [3]:
## ==========================
## functions to plot kmeans uncertainty
    
def get_color_code(cmaplist, labels_prob):
    n_samples, n_clusters = labels_prob.shape    
    cols = np.zeros((n_samples,3)) # array for labels uncertainty colors
    for ii in range(n_samples):
        # average color
        for jj in range(n_clusters):
            cols[ii,:] += np.array(cmaplist[jj][0:3])*labels_prob[ii,jj]

    # clip all values between 0 and 1 to avoid floting point rounding problems
    return np.clip(cols, 0.0, 1.0, out=cols)
    
def plot_pred(X_tr, X_ts, y_tr, kmeans_mod):
    ''' Function to represent the Kmeans results (labels plus centroinds)
    together with the bootstrap uncertainty of labels and centroids. 
    Only the first two dimensions are ploted.
    
    To visualize the level of membership/label uncertainty the color of each instance ('col')
    is given by the weighted average of all available colors (cmaplist) weighted by the label
    probabilities given by labels_prob.
    
    In addition, we will store the indices ('id_unc') of the instances which has some level of
    uncertainty. If none of the values in labels_prob is 1, it means that
    there are some uncertainty.
    
    Input:
    -------
    X_tr: {array-like, sparse matrix} of shape (n_samples_tr, n_features)
        Training instances to cluster.
    X_ts: {array-like, sparse matrix} of shape (n_samples_ts, n_features)
        Test instances to assign label.
    kmeans_mod: KMeansBoot model fitted with X
    
    Returns:
    labels_prob_tr: ndarray of shape (n_samples-tr, n_clusters)
        Probabilities of each point in the train set to belong to the different clusters.
    id_unc_tr: ndarray of variable shape
        Indices of the observations with label uncertainty in train set
    labels_prob_ts: ndarray of shape (n_samples_ts, n_clusters)
        Probabilities of each point in the train set to belong to the different clusters.
    id_unc_ts: ndarray of variable shape
        Indices of the observations with label uncertainty in test set
    
    '''
    # convert datasets to numpy arrays
    if type(X_tr) != np.ndarray:
        X_tr = X_tr.to_numpy()
    if type(X_ts) != np.ndarray:
        X_ts = X_ts.to_numpy()
        
    # get params from KMeansBoot model
    centroids = kmeans_mod.cluster_centers_
    labels_tr = kmeans_mod.labels_
    labels_prob_tr = kmeans_mod.labels_boot_prob_
    centroids_boot = kmeans_mod.cluster_centers_boot_
    id_unc_tr = kmeans_mod.uncertain_ins_
    
    # predict labels probabilities for test set
    labels_ts, labels_prob_ts, id_unc_ts = kmeans_mod.predict(X_ts)
    
    ## Find uncertain labels and assign colors to probabilities
    # set number of colors in colormap to number of clusters
    cmap = plt.cm.gist_rainbow
    cmaplist = cmap(np.arange(0,256,np.round(256/n_clusters)).astype(int))

    # for the train set
    cols_tr = get_color_code(cmaplist, labels_prob_tr)
    
    # for the test set
    cols_ts = get_color_code(cmaplist, labels_prob_ts)      
       
    ### plot the results
    fig, ax = plt.subplots(1, 3, figsize = (15,4.5))
    
    ## left plot (conventional KMeans results)
    # plot all observations; color acording to their label/membership
    ax[0].scatter(X_tr[:,0],X_tr[:,1], c = cmaplist[labels_tr], s = 15)
    # Add centroids
    ax[0].scatter(centroids[:,0], centroids[:,1], c = 'white', edgecolor = 'black', marker = 'p', s = 80)
    ax[0].set_title('KMeans centroids and labels \n(Train set)')

    # create custom legends
    leg_handels = []
    for jj in range(n_clusters):
        idx = np.where(labels_tr == jj)[0][0]
        legend = mlines.Line2D([], [], color = cmaplist[labels_tr[idx]], marker = 'o', ls = '', label = str(labels_tr[idx]))
        leg_handels.append(legend)
    legend1 = ax[0].legend(handles = leg_handels, loc = 'best', fontsize = 10, handletextpad = 0.4, handlelength = 0.75,
                        labelspacing = 0.2, columnspacing = 1, frameon=True, ncols = 3, labelcolor = 'linecolor')
    ax[0].add_artist(legend1)
    
    ## center plot (bootstrapped KMeans results for train set)
    ax[1].scatter(X_tr[id_unc_tr,0], X_tr[id_unc_tr,1], c = cols_tr[id_unc_tr], s = 15)
    # Add bootstrapped centroids
    [ax[1].scatter(centroids_boot[:,0,bb], centroids_boot[:,1,bb], c = 'black', marker = '.', alpha = 0.1) for bb in range(centroids_boot.shape[2])]
    ax[1].set_title('KMeansBoot bootstrapped centroids \nand uncertain labels (Train set)')
    
    ## right plot (label uncertainties for test set)
    # plot only instances with uncertainty
    ax[2].scatter(X_ts[id_unc_ts,0], X_ts[id_unc_ts,1], c = cols_ts[id_unc_ts], s = 15)
    ax[2].set_title('KMeansBoot uncertain labels \n(Test set)')
    
    # match scales
    ax[1].set_xlim(ax[0].get_xlim())
    ax[1].set_ylim(ax[0].get_ylim())
    ax[2].set_xlim(ax[0].get_xlim())
    ax[2].set_ylim(ax[0].get_ylim())
    
    return labels_prob_tr, id_unc_tr, labels_prob_ts, id_unc_ts

Example with artificial dataset (Blobs)¶

We generate three clusters (blobs) and split the dataset into train and test sets of equal size (just because).

InĀ [4]:
n_samples = 1000
# different variances 
X_blobs, y = make_blobs(n_samples = n_samples, cluster_std = [1.0, 2.5, 1.5], random_state = random_state)
# split in train and test sets
X_blobs_tr, X_blobs_ts, y_blobs_tr, _ = train_test_split(pd.DataFrame(X_blobs), y, test_size=0.5, random_state = random_state)

In a real scenario we would have first to determine the number of clusters to use resorting to any of the well-known criteria (elbow plot, Silhoutte, Gap statistic, etc...). Unfortunatelly for real world practitoners, that number is itself affected by the train set (sampling bias). But again it can at the same time be determined using bootstrap, as it is nicely explained here. I will skip that process and will set n_clusters to 3. In addition, the number of bootstrap replicas (n_boot) is set to a decent number which is not too time consuming.

InĀ [5]:
# set number of bootstrap replicas and number of clusters.
n_boot = 299
n_clusters = 3
# instantiate and fit clustering
kmeans_boot_blobs = KMeansBoot(n_clusters = n_clusters, 
                               n_boot = n_boot,
                               random_state = random_state, 
                               verbose = 0).fit(X_blobs_tr)

Once fitted, we can access the parameters of the KMeansBoot model. For example:

InĀ [6]:
#Coordinates of cluster centers (for full dataset)
centroids = kmeans_boot_blobs.cluster_centers_ 
#Labels of each point (for full dataset)
labels_tr = kmeans_boot_blobs.labels_
#Probabilities of each point to belong to the different clusters
labels_prob_tr = kmeans_boot_blobs.labels_boot_prob_
#Indices of the instances with label uncertainty
id_unc_tr = kmeans_boot_blobs.uncertain_ins_

# show membership probabilities for a few uncertain observations
df_prob = pd.DataFrame(labels_prob_tr[id_unc_tr,:], index = X_blobs_tr.index[id_unc_tr])
display(HTML("<h5>Cluster mambership probilities:</h5>"))
df_prob.head(10)
Cluster mambership probilities:
Out[6]:
0 1 2
746 0.591973 0.408027 0.000000
488 0.872910 0.127090 0.000000
672 0.073579 0.926421 0.000000
260 0.575251 0.000000 0.424749
510 0.856187 0.123746 0.020067
592 0.066890 0.933110 0.000000
32 0.090301 0.000000 0.909699
453 0.030100 0.969900 0.000000
547 0.030100 0.969900 0.000000
975 0.528428 0.000000 0.471572

And do predictions on membership and uncertainties for the test set

InĀ [7]:
# predict labels, its probabilities and uncertain assignments for test set
labels_ts, labels_prob_ts, id_unc_ts = kmeans_boot_blobs.predict(X_blobs_ts)

# show membership probabilities for a few uncertain observations
df_prob = pd.DataFrame(labels_prob_ts[id_unc_ts,:], index = X_blobs_ts.index[id_unc_ts])
display(HTML("<h5>Cluster mambership probilities:</h5>"))
df_prob.head(10)
Cluster mambership probilities:
Out[7]:
0 1 2
38 0.026756 0.973244 0.000000
969 0.026756 0.973244 0.000000
617 0.936455 0.000000 0.063545
399 0.013378 0.986622 0.000000
552 0.545151 0.454849 0.000000
286 0.872910 0.000000 0.127090
586 0.986622 0.010033 0.003344
644 0.501672 0.498328 0.000000
397 0.311037 0.000000 0.688963
352 0.852843 0.147157 0.000000
InĀ [8]:
print('The percentage of uncertain train instances is {:.1f}.'.format(len(id_unc_tr)/X_blobs_tr.shape[0]*100))
print('The percentage of uncertain test instances is {:.1f}.'.format(len(id_unc_ts)/X_blobs_ts.shape[0]*100))
The percentage of uncertain train instances is 9.2.
The percentage of uncertain test instances is 8.6.

Finally, we can plot the results using the above defined functions. In particular, we can plot:

  • The conventional KMeans clustering results on the train set (left panel)
  • The instances that KMeansBoot has identified as uncertain in the train set and the bootstrapped positions of the centroids (middle panel)
  • The instances uncertainly assigned to a cluster in the test set (right panel)
InĀ [9]:
labels_prob_tr, id_unc_tr, labels_prob_ts, id_unc_ts = plot_pred(X_blobs_tr, X_blobs_ts, y_blobs_tr, kmeans_boot_blobs)
No description has been provided for this image

Example with Toy Dataset (Iris):¶

InĀ [10]:
# load data
X_iris, y = load_iris(return_X_y = True)
# split in train and test sets
X_iris_tr, X_iris_ts, y_iris_tr, _ = train_test_split(pd.DataFrame(X_iris), y, test_size=0.3, random_state = random_state)
# select number of clusters
n_clusters = 3
# instantiate and fit clustering
kmeans_boot_iris = KMeansBoot(n_clusters = n_clusters,  
                              n_boot = n_boot,
                              random_state = random_state, 
                              verbose = 0).fit(X_iris_tr)
InĀ [11]:
labels_prob_tr, id_unc_tr, labels_prob_ts, id_unc_ts = plot_pred(X_iris_tr, X_iris_ts, y_iris_tr, kmeans_boot_iris)
No description has been provided for this image
InĀ [12]:
df_prob = pd.DataFrame(labels_prob_ts[id_unc_ts,:], index = X_iris_ts.index[id_unc_ts])
display(HTML("<h5>Cluster mambership probilities:</h5>"))
df_prob.head(10)
Cluster mambership probilities:
Out[12]:
0 1 2
142 0.939799 0.0 0.060201
83 0.953177 0.0 0.046823
121 0.979933 0.0 0.020067
50 0.658863 0.0 0.341137
76 0.913043 0.0 0.086957
58 0.989967 0.0 0.010033
101 0.939799 0.0 0.060201
147 0.026756 0.0 0.973244
InĀ [13]:
print('The percentage of uncertain train instances is {:.1f}.'.format(len(id_unc_tr)/X_iris_tr.shape[0]*100))
print('The percentage of uncertain test instances is {:.1f}.'.format(len(id_unc_ts)/X_iris_ts.shape[0]*100))
The percentage of uncertain train instances is 18.1.
The percentage of uncertain test instances is 17.8.

Example with wine dataset (Pipeline: Standardization+PCA+K-means)¶

On the effect of changing the parameter threshold on the uncertainty determination

InĀ [14]:
# load data
X_wine, y = load_wine(return_X_y = True)
# split in train and test sets
X_wine_tr, X_wine_ts, y_wine_tr, _ = train_test_split(pd.DataFrame(X_wine), y, test_size=0.2, random_state = random_state)
# select number of PCA components and clusters
n_components = 3 
n_clusters = 3
# instantiate and fit pipeline
pipe = Pipeline([('scaler', StandardScaler()),
                 ('PCA', PCA(n_components = n_components, whiten = True, random_state=random_state)), 
                 ('clustering', KMeansBoot(n_clusters = n_clusters,  
                                           n_boot = n_boot,
                                           random_state = random_state, 
                                           verbose = 0))]).fit(X_wine_tr)
InĀ [15]:
# transform data to represent first two PCs
pipe_pca = Pipeline([('scaler', StandardScaler()), 
                  ('PCA', PCA(n_components = n_components, whiten = True))])

X_pca_tr = pipe_pca.fit_transform(X_wine_tr)

X_pca_ts = pipe_pca.transform(X_wine_ts)

# plot results
labels_prob_tr, id_unc_tr, labels_prob_ts, id_unc_ts = plot_pred(X_pca_tr, X_pca_ts, y_wine_tr, pipe['clustering'])
No description has been provided for this image

Notice how this dataset has a very large uncertainty and the cluster memberships and centroids are ill defined:

InĀ [16]:
print('The percentage of uncertain train instances is {:.1f}.'.format(len(id_unc_tr)/X_wine_tr.shape[0]*100))
print('The percentage of uncertain test instances is {:.1f}.'.format(len(id_unc_ts)/X_wine_ts.shape[0]*100))
The percentage of uncertain train instances is 21.1.
The percentage of uncertain test instances is 27.8.

We can force the attribute threshold to be smaller, so that there is not so much membership uncertainty

InĀ [17]:
# change threshold value
threshold = 0.9 # WE CHANGE THIS
# instantiate and fit pipeline
pipe = Pipeline([('scaler', StandardScaler()),
                 ('PCA', PCA(n_components = n_components, whiten = True, random_state = random_state)), 
                 ('clustering', KMeansBoot(n_clusters = n_clusters,  
                                           n_boot = n_boot,
                                           threshold = threshold, #WE CHANGE THIS
                                           random_state = random_state, 
                                           verbose = 0))]).fit(X_wine_tr)
InĀ [18]:
# plot results
labels_prob_tr, id_unc_tr, labels_prob_ts, id_unc_ts = plot_pred(X_pca_tr, X_pca_ts, y_wine_tr, pipe['clustering'])
No description has been provided for this image

Notice how the uncertaintain instances have been reduced, but still there is a lot of uncertainty:

InĀ [19]:
print('The percentage of uncertain train instances is {:.1f}.'.format(len(id_unc_tr)/X_wine_tr.shape[0]*100))
print('The percentage of uncertain test instances is {:.1f}.'.format(len(id_unc_ts)/X_wine_ts.shape[0]*100))
The percentage of uncertain train instances is 7.0.
The percentage of uncertain test instances is 16.7.

Example with digits dataset (Pipeline: Standardization+PCA+K-means)¶

On the effect of changing the number of PCA components on the uncertainty determination.

Let's start by using only the first 2 PCs, that have enough discriminative power already (visually, the clusters appear quite distinct in 2D):

InĀ [20]:
# load data
X_digits, y = load_digits(return_X_y = True)
# split in train and test sets
X_digits_tr, X_digits_ts, y_digits_tr, _ = train_test_split(pd.DataFrame(X_digits), y, test_size=0.2, random_state = random_state)
# select number of PCA components and clusters
n_clusters = 10
n_components = 2
# instantiate and fit pipeline
pipe = Pipeline([('scaler', StandardScaler()),
                 ('PCA', PCA(n_components = n_components, whiten = True, random_state = random_state)), 
                 ('clustering', KMeansBoot(n_clusters = n_clusters,  
                                           n_boot = n_boot,
                                           random_state = random_state, 
                                           verbose = 0))]).fit(X_digits_tr)
InĀ [21]:
# transform data to represent first two PCs
pipe_pca = Pipeline([('scaler', StandardScaler()), 
                  ('PCA', PCA(n_components = n_components, whiten = True, random_state = random_state))])

X_pca_tr = pipe_pca.fit_transform(X_digits_tr)

X_pca_ts = pipe_pca.transform(X_digits_ts)

# plot results
labels_prob_tr, id_unc_tr, labels_prob_ts, id_unc_ts = plot_pred(X_pca_tr, X_pca_ts, y_digits_tr, pipe['clustering'])
No description has been provided for this image

KMeansBoot can do a nice job clustering this dataset. Moreover, notice that, as expected, the number 4 is the one showing the largest dispersion in centroids. In any case, the uncertainty is large:

InĀ [23]:
print('The percentage of uncertain train instances is {:.1f}.'.format(len(id_unc_tr)/X_digits_tr.shape[0]*100))
print('The percentage of uncertain test instances is {:.1f}.'.format(len(id_unc_ts)/X_digits_ts.shape[0]*100))
The percentage of uncertain train instances is 44.9.
The percentage of uncertain test instances is 43.3.

We can try to improve this by increasing the number of PCs used. Beware of a few things on the digits dataset and KMeans:

  • The first 2 PCs already capture very clear digit‑separating directions
  • Most of the variance (~90-95%) is captured by the first ~10 PCs
  • The later PCs (PC10+, and especially PC20-PC63) mostly contain noise, that certainly kills KMeans, specially when using Bootstrapping
  • KMeans struggles in high dimensions ("curse of dimensionality")

Then, let's try with 10 PCs:

InĀ [24]:
# change number of PCA componets
n_components = 10
# instantiate and fit pipeline
pipe = Pipeline([('scaler', StandardScaler()),
                 ('PCA', PCA(n_components = n_components, whiten = True, random_state = random_state)), 
                 ('clustering', KMeansBoot(n_clusters = n_clusters,  
                                           n_boot = n_boot,
                                           random_state = random_state, 
                                           verbose = 0))]).fit(X_digits_tr)
InĀ [25]:
# transform data to represent first two PCs
pipe_pca = Pipeline([('scaler', StandardScaler()), 
                  ('PCA', PCA(n_components = n_components, whiten = True, random_state = random_state))])

X_pca_tr = pipe_pca.fit_transform(X_digits_tr)

X_pca_ts = pipe_pca.transform(X_digits_ts)

# plot results
labels_prob_tr, id_unc_tr, labels_prob_ts, id_unc_ts = plot_pred(X_pca_tr, X_pca_ts, y_digits_tr, pipe['clustering'])
No description has been provided for this image

Visually, the clustering is horrendous, but that is because we are representing only the first 2 PCs. If we check some numbers, we see that the clusterization is more accurate, since the uncertainty has been reduced:

InĀ [27]:
print('The percentage of uncertain train instances is {:.1f}.'.format(len(id_unc_tr)/X_digits_tr.shape[0]*100))
print('The percentage of uncertain test instances is {:.1f}.'.format(len(id_unc_ts)/X_digits_ts.shape[0]*100))
The percentage of uncertain train instances is 21.9.
The percentage of uncertain test instances is 23.9.

You can check by yourself that the best results (in terms of uncertainty) are obtained with n_components = 9.