Co-clustering Document-term Matrices on Azure ML

Francois Role – June 2015

My research department at Paris-Descartes University has recently  been awarded an “Azure Grant” from Microsoft Research (see http://research.microsoft.com/en-us/projects/azure/), As part of this grant, I experimented with the creation of Azure ML based machine learning experiments. More specifically I wanted to investigate how to deploy experiments involving Python language modules in the field of unsupervised Data Mining. It was also an opportunity to test how the Scikit-learn library (.http://scikit-learn.org/) could be used in combination with Azure ML.

I started by creating the very simple experiment shown in Figure 1.

Fig1-FirstExperiment

Fig. 1 Our first experiment

To have a dataset to work on, I first imported a matrix file from my disk using the “New Dataset from local file” feature of Azure ML studio. I then connected the newly created dataset to the first (left-most) input of a Python module whose content will be described below.

In fact, the experiment basically boils down to invoking the Scikit implementation of a well-known co-clustering algorithm namely “Spectral Co-clustering”( http://scikit-learn.org/stable/modules/biclustering.html#biclustering) and giving it as input a dataset (a matrix file in CSV format) we imported earlier into our workspace.

But what is co-clustering? Let’s assume we have a set of objects, each objet having a set of features (for example, an user with the list of films he/she likes). We can organize this data in the form of a matrix where the objects are the rows and the features are the columns. The co-clustering  will create “co-clusters”. A co-cluster is a group whose elements are objects with a set of features that characterize them most.  To continue with our example, each co-cluster would contain a set of users along with the set of films that are most appreciated by this set of users. We would typically have a co-cluster grouping children and cartoon movies, a  co-cluster grouping sportsmen and sport documentaries, etc.

Here is the first line of the module:

from sklearn.cluster.bicluster import SpectralCoclustering

This line serves to import the relevant Scikit class. The nice thing about using Python with AzureML  is that the most recent version of Scikit is already integrated into the platform. So we do not have to struggle with installation problems. On the second line we start defining the azureml_main function where the actual logic will take place. The function expects to be passed a dataframe. This raises a question: our data, as usual in many Machine Learning settings, is in the form of a matrix, more specifically a document-term matrix  where each row represents a document, each column represents a term, and each cell indicates how many times a given term j occurs in the document i. Figure 2 shows a small excerpt from this matrix.

Actually, this is a no problem since, as we connect our CSV-formatted matrix  to the Python module, it gets automatically converted into a Pandas dataframe. Figure 3 shows how Azure ML sees the small toy input matrix we will use as an example.

0.,1.,0.,0.,1. …

0.,1.,1.,0.,1. …

Fig. 2 The document-term matrix to be co-clustered

 Fig3Matrix

Fig. 3 How Azure ML sees our matrix after loading it

The first real input problem occurs at a later stage.  Scikit’s implementation of Spectral Co-clustering accepts a dense Numpy or sparse Scipy arrays while a Python Azure module has to be passed a Panda dataframe.  We thus have to first convert our dataframe into a Numpy array (using the .values() method) before creating the model and using it to co-cluster our array (model.fit(a) ). The code so far is given in Figure 4.

import numpy as np , sklearn as sk, pandas as pd
from sklearn.cluster.bicluster import SpectralCoclustering
def azureml_main(dataframe1 = None, dataframe2 = None):
   a = dataframe1.values
   model = SpectralCoclustering(n_clusters=3, random_state=0)
   model.fit(a)

Fig. 4 The code so far. The input dataframe has to be converted into a Numpy array

You may already notice a problem with this code. The number of desired co-clusters is set statically in the code of the function (n_clusters=3). Our matrix may well include a different number of co-clusters! We will come back to this important issue. Before that, assuming we can put up with a static setting, let us examine how we can return the co-clustering results to the outside world. In the same way as it accepts dataframes as input, a Python Azure module outputs a dataframe as result (more precisely, a sequence of dataframes having only one element). This means that we have to package the results returned by the co-clustering algorithm into a dataframe. Depending on the used algorithm, a co-clustering program can produce a lot of information. In the case of the Spectral co-clustering algorithm the most important results are the vectors indicating which co-cluster a given row (column) belongs to. These vectors can be retrieved from the model using the following code:

row_labels= model.row_labels_

col_labels = model.column_labels_

To output these vectors, we need to package them in a dataframe. The problem is that our matrix (as all non square matrices) has not the same number of rows and columns while all the columns in a dataframe must have the same number of rows. We thus have to perform some kind of padding before returning our result dataframe. This is what is done by the lines of code at the end of the module. The complete code is given in Fig. 5 and the output dataframe for a small toy matrix is shown in Fig. 6.

import numpy as np , sklearn as sk, pandas as pd
from sklearn.cluster.bicluster import SpectralCoclustering
def azureml_main(dataframe1 = None, dataframe2 = None):
    a = dataframe1.values
    model = SpectralCoclustering(n_clusters=3, random_state=0)
    model.fit(a)
    row_labels= model.row_labels_
    col_labels = model.column_labels_
    maxi = max(col_labels.size,row_labels.size)
    cols=col_labels
    rows=row_labels  
    nb_cols= [cols.size] * maxi
    nb_rows= [rows.size] * maxi
    if col_labels.size != row_labels.size :
        if row_labels.size == maxi :
            cols=cols.tolist()
            cols = cols + [-1]* (maxi - col_labels.size )
        else:
            rows=rows.tolist()
            rows = rows + [-1]* (maxi - row_labels.size )
    df = pd.DataFrame({"nb_rows" : nb_rows , "nb_cols" : nb_cols , "rows" : rows , "cols" : cols })
    return df,

Fig. 5 The complete code for the Python module

         cols  nb_cols  nb_rows  rows

0       0         5          7           2

1      2         5         7            2

2      2         5         7            1

3      1         5         7            1

4      1         5         7           1

5     -1         5         7           0

6     -1         5         7           0

  Fig. 6 The output dataframe

However we still have to fix a problem: in addition to an input file, the Python module requires some parameters to be specified, such as the number of co-clusters. In the current version of the module this number is set statically. How can we pass this parameter dynamically? I recalled from the documentation that the entry point function could contain up to two input arguments. I then added a second input named “params.csv”, a dataframe specifying the desired number of co-clusters as another column indicating the name of the algorithm (just to demonstrate that other parameters could also be passed in this way). We just have to add two lines at the very beginning of the program in order to to extract the number of clusters (the n_cluster parameter) from the second input dataframe. These two lines are as follows:

 def azureml_main(dataframe1 = None, dataframe2 = None):
     parameters=dataframe2  # dataframe2 corresponds to params.csv 
     n_clusters=parameters.ix[0,'K']     


Fig7SecondInputParameter

Fig. 7 New version of the experiment. In addition to the main input dataframe (the matrix), a second input dataframe is devoted to specifying the additional parameters (for example, the number of requested co-clusters).

  Now recall that the output dataframe shown in Figure 6 has just been printed to the Python console. What if we want to save it to a file? The simplest way to do that is to add a Writer module. Our experiment looks like in Figure 8. When creating your Writer module, in addition to specifying location and authentication details, pay attention to the “Azure blob storage write” option. Otherwise, you will get an error when rerunning your experiment.

experimentwithawriter

  Fig. 8 The experiment after adding a writer module.

  After running the experiment , we can check that the results have been properly written to our Storage Account. In our case, the results are stored in a file named ‘classic3-results.csv’. Using the  Azure SDK for Python (https://github.com/Azure/azure-sdk-for-python) I can quickly check if the resut file has been …

from azure.storage import BlobService

import os

import re

with open(os.getenv("HOME") + '/nb/azure.conf') as f :

   acc_name, acc_key, container_name = f.read().split()

   blob_service = BlobService(account_name=acc_name, account_key=acc_key)

   blobs = blob_service.list_blobs(container_name)
  
   for blob in blobs:

     if  re.match(r".*resul*", blob.name) :

       print blob.name

       print blob.url
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: