The following post was contributed by Sam Triolo, system security architect and data scientist
In Data Science, there are both supervised and unsupervised machine learning algorithms.
In this analysis, we will use an unsupervised K-means machine learning algorithm. The advantage of using the K-means clustering algorithm is that it’s conceptually simple and useful in a number of scenarios. The advantage of using it’s unsupervised form is that it’s easy to learn and will tell you something about your data (as opposed to a supervised learning algorithm, where you teach the algorithm something about your data first).
Much of Data Science requires some knowledge of Statistics, some knowledge of other Mathematics, and some knowledge of Computers and programming. I put reference links at the end to a number of great learning resources for these topics.
As for the logic of the K-means algorithm, an oversimplified, step by step example is located here. I recommend taking a look at it after you finish reading here if it would help reinforce the concepts.
Conceptualizing the K-means Clustering Algorithm
The idea is based on a few basic concepts. As an over-simplified example, let’s say you have two groups of people – Group A and Group B.
Group A has people in it that are clearly taller and weigh more than those in Group B. If you were measuring the height and weight of each group of people, then your “features” are “height” and “weight.” The average (“mean”) weight and the average height of all the people in Group A will be larger than those in Group B. Each person who has their height and weight measured is called a “sample.”
If you were to graph this, you could plot the height measurement of each person as the x-value and the weight of each person as the y-value. Since we mentioned before that there was a clear separation, you would expect to see one “cluster” or grouping of points closely together (Group A), separated clearly from another set of points grouped closely together (Group B).
The average or mean of all the height measurements is the mean x-value and the average or mean of all the weight measurements is the mean y-value. If you plot the mean x value and the mean y value, that is the “center” (or centroid) of your cluster. In the same way, this cluster center serves as a comparison measurement. If the groups are clearly separated, you would expect a small person’s height and weight to be closer to the average or mean of the small people group.
The K-means clustering algorithm attempts to show which group each person belongs to. In this case, we’ve already established there is a clear grouping of people, but in other situations, and with more complex data, the associations will not be so clear. You can tell the algorithm how many groups you want it to have and also how you want it to calculate the groups, but we will not cover that here.
In our example, the K-means algorithm would attempt to group those people by height and weight, and when it is done you should see the clustering mentioned above.
The K-means clustering algorithm does this by calculating the distance between a point and the current group average of each feature. If you start with one person (sample), then the average height is their height, and the average weight is their weight. The K-means algorithm then evaluates another sample (person). If you asked for two groups, then person one and person two would be their own groups after two steps. The algorithm then takes another person (person 3), and measures the distance on a graph between their height (x-value) and the current average x-value for Group A vs. Group B. Whichever it is closest two, it is added to that group, and then a new mean for that group is calculated. It then does this with every other sample / person, adding them to whichever group their measurements are closest to. To make the data easier to understand, you can plot it on a graph and see who belongs to which group. This is usually done by coloring the areas of each group, so you can clearly see that (for example), person / sample 1 belongs to Group B (or Group A) because all the points in the group are the same color.
Performing the analysis yourself
Roughly this is broken down into a few high-level steps:
- acquiring your data
- cleaning your data
- running the algorithm
- plotting your data (data visualization)
- analyzing your findings
I performed the analysis on a remote server hosted in the cloud. This allows me to use a Linux server specifically configured for this, as well as not tax the resources on my local machine, or worry if it has enough resources for larger analyses. If you want to perform this on a local machine, I recommend downloading Enthought Canopy or Anaconda (for Windows / Mac OSX hosts).
I used Ipython to visualize the graph plots on the remote server. I used the Bokeh visualization libraries because they provide controls within Ipython to pan and zoom. If you use Matplotlib on a remote server, within Ipython, the graph displayed is a static image, which can limit the value of the visualization on more complex analyses.
One other thing I should mention – PCA (principal components analysis) is a common operation performed for dimensionality reduction. Briefly, if you have two measurements / features (height and weight) or even three, you can visualize them on a 2D or 3D graph. Each “feature” is a dimension. If you have (for example) six things you are measuring, to display them on a 2D or 3D graph you will need to perform dimensionality reduction to accurately depict them. PCA is often used for this, but is not covered here.
As for obtaining and cleaning your data, this depends on what data you are getting and how you will use it, so I will not go into the specifics of that. Scikit-learn has some great, already cleaned datasets that come with it. I include an example below (with code) using the Iris dataset. It’s straightforward and small, with many tutorials on using it in different algorithms.
For simplicity, I recommend having only two features when you start, such as height and weight.
- make sure you have python setup somewhere with the following libraries / packages
- ipython notebook, numpy, bokeh, scikit-learn (sklearn)
- get some data – by making up some heights and weights or using the provided scikit-learn datasets
- run the algorithm and visualize the results – Scikit-learn K-means
- what this does is tells you which group each person / data point belongs to and
- what the average / mean / cluster center is for each group
Taking a step back, some of the main values this algorithm provides (in it’s unsupervised form) is:
- shows you which groups / clusters a sample (in this case, a person) belongs to
- this can be valuable with more complex data, such as when you notice a person you thought would be in one group, actually is more closely measured to another group
- identify outliers / anomalies within a group
- this is valuable if you want to know which sample / person may be an abnormality within a group – they are extra tall, or otherwise unusual within their group for a reason you might want to investigate
Essentially what the algorithm is doing is coloring your data points so you know which group each person or sample belongs to.
See the pseudo-code below for an example based on above. This is meant to show you the high-level logic of how this is coded.
#the below assumes a data set that looks something like below. Height would be your x coordinate, and weight your y coordinate. #Name Height Weight #Bill 6 215 #Jeff 6.1 185 #Susan 5 135 #you would use a numpy array, which looks a little like a "list of lists": height_weight_data = np.array([[6,215], [6.1,185], [5,135]]) #perform k-mean analysis on our data, so we know which group / color each person belongs to kmean.fit(height_weight_data) #plot the center of our clusters plot.circle_cross(kmean.cluster_centers) #now plot each person, color them according to group membership for person in height_weight_data: #"labels_" tells us which cluster each point is a member of if person kmean.labels_ == blue: plot.circle(x=item, y=item, size=15, color="blue") if person kmean.labels_ == red: plot.circle(x=item, y=item, size=15, color="red") show(plot) ################
Here’s an actual code example using the Iris dataset. This dataset is included with the Scikit-learn package. The Iris dataset has 150 samples (flowers that were picked), with each flower having four measurements (features). We will only use two for simplicity: petal length and petal width. Each flower belongs to one of three groups: setosa (largest size), versicolor (smallest size), or virginica (medium size). After running the code below, you’ll see the algorithm determined which group each flower belonged to based on which group’s average measurement it was closest to. The plots were done inside Ipython, using the Bokeh visualization libraries. The bokeh libraries provide the controls at the top-right of the image that allow me to re-size / pan / zoom on the plot.
#!/usr/bin/python from sklearn.cluster import KMeans import numpy as np import bokeh.plotting from bokeh.plotting import figure from sklearn import datasets bokeh.plotting.output_notebook() #initialize bokeh in ipython #the iris dataset is 150 samples, each with four features #we only want petal length and petal width iris = datasets.load_iris() petal_data = iris.data[:,2:] #get only petal features, which are the third and fourth values in each sample #perform k-means analysis on iris data #there are only 3 iris flower groups: 'setosa', 'versicolor', 'virginica' kmean = KMeans(n_clusters=3) #n_clusters asks for only 3 groupings kmean.fit(petal_data) #initialize our bokeh plot plot = figure(width=500, height=500, title='Iris Petals', x_axis_label = "Petal Length", y_axis_label = "Petal Width") #plot centroid / cluster center / group mean for each group clus_xs =  clus_ys =  #we get the cluster x / y values from the k-means algorithm for entry in kmean.cluster_centers_: clus_xs.append(entry) clus_ys.append(entry) #the cluster center is marked by a circle, with a cross in it plot.circle_cross(x=clus_xs, y=clus_ys, size=40, fill_alpha=0, line_width=2, color=['red', 'blue', 'purple']) plot.text(text = ['setosa', 'versicolor', 'virginica'], x=clus_xs, y=clus_ys, text_font_size='30pt') i = 0 #counter #begin plotting each petal length / width #We get our x / y values from the original plot data. #The k-means algorithm tells us which 'color' each plot point is, #and therefore which group it is a member of. for sample in petal_data: #"labels_" tells us which cluster each plot point is a member of if kmean.labels_[i] == 0: plot.circle(x=sample, y=sample, size=15, color="red") if kmean.labels_[i] == 1: plot.circle(x=sample, y=sample, size=15, color="blue") if kmean.labels_[i] == 2: plot.circle(x=sample, y=sample, size=15, color="purple") i += 1 bokeh.plotting.show(plot) #########end of code###########
If you’re interested in the Iris dataset, here are additional plots of the four Iris features: petal length, petal width, as well as sepal length and sepal width:
- Good scikit-learn introduction to machine learning: http://www.astroml.org/sklearn_tutorial/general_concepts.html
- Ipython: http://ipython.org/notebook.html
- K-means and PCA on the iris dataset: http://www.dummies.com/how-to/content/how-to-visualize-the-clusters-in-a-kmeans-unsuperv.html
And here’s even more information on statistics:
Want more data science tutorials and content? Subscribe to our data science newsletter.