Clustering with K-means algorithm

July 7, 2020 0 Comments

Clustering is a machine learning method where the labels are not assigned to the data when it is learning the patterns. K-means algorithm is computed when you want to use a non-hierarchical method. For my thesis, I may need to perform a cluster analysis, so I started learning how to. I thought that sharing this process would serve the purpose of my blog. As a first step, I will generate random data and try to use the packages in Python and understand the results. Real data is usually way more messy which is why- during the learning phase- I prefer the randomly generated data.

For clustering purposes, few packages are necessary. Numpy, sklearn, and matplotlib seem to be those required packages. After installing them, I found a website where I can practise clustering. I started with an easy example (I should not scare myself, right?:)

# k-means clustering
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# define dataset
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=6)
# define the model
model = KMeans(n_clusters=2)
# fit the model
# assign a cluster to each example
yhat = model.predict(X)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
	# get row indexes for samples with this cluster
	row_ix = where(yhat == cluster)
	# create scatter of these samples
	pyplot.scatter(X[row_ix, 0], X[row_ix, 1])
# show the plot

The resulting plot from this code:

This plot above indicates a two different clusters though the boundary between the clusters are not so clear… (blue and the orange dots seem to be merged 🙁 ) As a note, this happened when I set the random_state to 6. Random_state: “Determines random number generation for dataset creation. ” Now, let us try to see if we can have 3 clusters, or put it in a better way, let us see if 3 clusters can divide our current data better. To do that, I will change only one line of code (see below):

#model = KMeans(n_clusters=2)
model = KMeans(n_clusters=3)

The resulting plot is:

It looks like having 3 clusters was not a very good idea for this data (unclear boundary between the green and the blue cluster). But the point is that we can try doing this, we have that liberty to try different number of clusters. Now let us create a different data set but setting random_state to 3. To do that, we have to change one line of the code (below). Let us also change the number of clusters to 2 again.

#X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=6)
X, _ = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, random_state=3)
model = KMeans(n_clusters=2)

The resulting plot is:

This is by far the best result I think. The boundary between two clusters are not merged. And there are seemingly two clusters (though the blue ones are a bit more diverse in terms of the distance).

To be continued with more examples:)

Leave a Reply

Your email address will not be published. Required fields are marked *