Fuzzy clustering with Python

July 10, 2020 0 Comments

Ready for implementing fuzzy clustering by using Python? (Source). For now, I prefer to have a look at the example with a random-data (the real data is usually messier, and it might not be very good for the self-learning curve!

First, the libraries that we will need: skfuzzy, numpy, matplotlib, seaborn. Concerning “skfuzzy”, watch out for another very similar-named library called sk_learn_fuzzy. Initially, I installed that one instead of “skfuzzy”, and the argument “cluster” did not work in the code. Also, if you have different Python versions in your computer, you might encounter a problem installing “skfuzzy” (I had this problem, and on a number of forums, I saw that quite a number of people experienced it too), the way I manage to download is (type the code below to your terminal):

pip install -u scikit-fuzzy

Since I use PyCharm as editor, I used the terminal in the editor to solve the issue. Now, if you recall from my previous blog, you need to know how many clusters you plan to have for your data. Assume that we want to have 3 clusters. The first step was the random initialization of the data points into the pre-planned number of clusters. To do that, we need to have our data points (which we will do it as a next step), define 3 cluster centres and the membership values (sigmas) for our data points. Now, first, let us import the libraries after installing them.

import skfuzzy as fuzz
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Then, let us define the cluster centres and the sigmas:

# Define three cluster centers
centers = [[1, 3],
           [2, 2],
           [3, 8]]

# Define three cluster sigmas in x and y, respectively
sigmas = [[0.3, 0.5],
          [0.5, 0.3],
          [0.5, 0.3]]

For the data visualization, both mathplot and seaborn libraries will be used. Seaborn library is based on matplotlib and its purpose is to develop visually appealing figures/plots. If you are a R user, you can compare it with ggplot2. Now, let us arrange the general styling for the data visualization:

plt.rcParams['font.size'] = 12
plt.rcParams['axes.labelsize'] = 20
plt.rcParams['axes.labelweight'] = 'bold'
plt.rcParams['axes.titlesize'] = 20
plt.rcParams['xtick.labelsize'] = 15
plt.rcParams['ytick.labelsize'] = 15
plt.rcParams['legend.fontsize'] = 15
plt.rcParams['figure.titlesize'] = 20
plt.rcParams['figure.figsize'] = (8,7)
colors = ['b', 'orange', 'g', 'r', 'c', 'm', 'y', 'k', 'Brown', 'ForestGreen']

After that, we need to generate our random data. To do that, numpy library will be used. To understand the code below, it is useful to have a look at the certain functions such as np.zeros(), np.hstack(), np.vstack(), np.random.standard_normal(), the use of enumerate with zip. Click on them to reach the required information.

xpts = np.zeros(1)
ypts = np.zeros(1)
labels = np.zeros(1)
for i, ((xmu, ymu), (xsigma, ysigma)) in enumerate(zip(centers, sigmas)):
    xpts = np.hstack((xpts, np.random.standard_normal(200) * xsigma + xmu))
    ypts = np.hstack((ypts, np.random.standard_normal(200) * ysigma + ymu))
    labels = np.hstack((labels, np.ones(200) * i))

To visualize what we have created, use the code below:

# Visualize the test data
fig0, ax0 = plt.subplots()
for label in range(3):
    ax0.plot(xpts[labels == label], ypts[labels == label], '.')
ax0.set_title('Test data: 200 points.')

Before moving down to the code below, check out the arguments of fuzz.cluster.cmeans . One of its arguments is m (which was mentioned before as a fuzziness parameter and that generally it is taken as 2, here it was taken as 2 too, in case you are wondering about what 2 means before the “error”). Our purpose is to create 2 to 9 clusters to see which one will fit better. We will create a for loop to have these clusters.

# Set up the loop and plot
#the for loop is used to create 2-9 clusters!
fig1, axes1 = plt.subplots(3, 3, figsize=(10, 10))
alldata = np.vstack((xpts, ypts))
fpcs = []

for ncenters, ax in enumerate(axes1.reshape(-1), 2):
    cntr, u, u0, d, jm, p, fpc = fuzz.cluster.cmeans(
        alldata, ncenters, 2, error=0.005, maxiter=1000, init=None)

    # Store fpc values for later

    # Plot assigned clusters, for each data point in training set
    cluster_membership = np.argmax(u, axis=0)
    for j in range(ncenters):
        ax.plot(xpts[cluster_membership == j],
                ypts[cluster_membership == j], '.', color=colors[j])

    # Mark the center of each fuzzy cluster
    for pt in cntr:
        ax.plot(pt[0], pt[1], 'rs')

    ax.set_title('Centers = {0}; FPC = {1:.2f}'.format(ncenters, fpc), size=12)


fig2, ax2 = plt.subplots()
#[2:11] is used for depicting 2 to 9 clusters in the data. 
#9th one has 10 centers.
ax2.plot(np.r_[2:11], fpcs, color='#731810')
ax2.set_title("How Number of Clusters Change FPC?")
ax2.set_xlabel("Number of centers")
ax2.set_ylabel("Fuzzy partition coefficient")

The resulting figures are:

In short, we started with the assumption of 3 clusters, however, it turns out that the data has only 2 clusters.

Leave a Reply

Your email address will not be published. Required fields are marked *