"Many real-world phenomena are described by pairwise proximity data, modeling interactions between the entities of the system. This in contrast to the more common situation where each data sample is given as a feature vector. Even though the clustering of the proximity data may be performed directly on the data matrix, there are some advantatages of embedding the data into a vector space. For example, it enables the use of some standard preprocessing techniques such as denoising or dimensionality reduction. In this coding exercise, we will explore the tecnhique called _Constant Shift Embedding_ for restating pairwise clustering problems in vector spaces [1] while preserving the cluster structure. We will apply the algorithm described in [1] to cluster the groups of research community members based on the email correspondence matrix. The data and its description is given in [2].\n",

"\n",

"### References \n",

"\n",

"[1] [Optimal cluster preserving embedding of nonmetric proximity data](https://ieeexplore.ieee.org/document/1251147)\n",

"The number of nodes is hardcoded for simplicity (taken from [2]):"

]

},

{

"cell_type": "code",

"execution_count": null,

"metadata": {},

"outputs": [],

"source": [

"NUM_NODES = 1005"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"We load the file which contains the list of interactions between the community members (nodes). Our data matrix represents an undirected graph which connects two nodes if there was at least one email sent between the two corresponding community members. Thus our data matrix is essentially an adjacency matrix."

]

},

{

"cell_type": "code",

"execution_count": null,

"metadata": {},

"outputs": [],

"source": [

"# initialize data matrix which will be an adjacency matrix\n",

"DATA = np.zeros((NUM_NODES, NUM_NODES))\n",

"\n",

"# fill out the symmetric adjacency matrix\n",

"with open(\"email-Eu-core.txt\") as file:\n",

" for line in file:\n",

" pair = [int(x) for x in line.split()]\n",

" DATA[pair[0], pair[1]] = 1\n",

" DATA[pair[1], pair[0]] = 1"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"Note that DATA specifies an adjacency matrix of the email graph. It's not claimed to be a proper dissimilarity matrix required by CSE algorithm. So, you are allowed to perform any manipulations to construct a suitable (dis-)similarity matrix for the further analysis."

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"Next we define a class which contains main functionalities - TO BE IMPLEMENTED."

" \"\"\"Template class for Constant Shift Embedding (CSE)\n",

" \n",

" Attributes:\n",

" PMAT (np.ndarray): Proximity matrix used for calculating the embeddings.\n",

" S (np.ndarray): Similarity matrix.\n",

" D (np.ndarray): Dissimilarity matrix.\n",

" \n",

" \"\"\"\n",

" \n",

" def __init__(self):\n",

" self.PMAT = None\n",

" self.S = None\n",

" self.D = None\n",

" # Add/change parameters, if necessary.\n",

" \n",

" def fit(self, PMAT):\n",

" \"\"\" Calculate similarity/dissimiliarity matrix and all\n",

" the necessary variables for calculating the embeddings.\n",

" \n",

" Args:\n",

" PMAT (np.ndarray): proximity matrix\n",

" \"\"\"\n",

"\n",

" # Save data\n",

" self.PMAT = PMAT\n",

" \n",

" ## IMPLEMENT THIS METHOD\n",

" \n",

" \n",

" def get_embedded_vectors(self, p):\n",

" \"\"\"Return embeddings\n",

" \n",

" Args:\n",

" p (np.ndarray): cut-off value in eigenspectrum\n",

" \n",

" Returns:\n",

" Xp (np.ndarray): embedded vectors\n",

" \n",

" \"\"\"\n",

" \n",

" ## IMPLEMENT THIS METHOD"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"\n",

"<h2 style=\"background-color:#f0b375;\">\n",

"Section 4.0 \n",

"<span style=font-size:50%> Complete all problems in this section to get a pass on this exercise. </span>\n",

"</h2>\n",

"\n",

"<p style=\"background-color:#adebad;\">Describe briefly and consicely the model given in [1]. Explain the main steps of _Constant Shift Embedding_ algorithm. See <a href=\"https://medium.com/ibm-data-science-experience/markdown-for-jupyter-notebooks-cheatsheet-386c05aeebed\">markdown cheatsheet</a> for text editing.</p>\n",

"\n",

"- **Constant Shift Embedding**: (Put your answer here)\n"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"<p style=\"background-color:#adebad;\">\n",

" Implement Constant Shift Embedding. We start off by making an instance of the corresponding class.\n",

"</p> "

]

},

{

"cell_type": "code",

"execution_count": null,

"metadata": {},

"outputs": [],

"source": [

"CSE = ConstantShiftEmbedding()"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"<p style=\"background-color:#adebad;\">\n",

" Fit the data matrix. _fit(...)_ method computes necessary variables which can be later on used to produce embeddings [1].\n",

"</p> "

]

},

{

"cell_type": "code",

"execution_count": null,

"metadata": {},

"outputs": [],

"source": [

"CSE.fit(DATA)"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"<h2 style=\"background-color:#f0b375;\">\n",

"Section 4.5 \n",

"<span style=font-size:50%> Complete all problems in this and previous sections to get a grade of 4.5 </span>\n",

"</h2>\n",

"\n",

"<p style=\"background-color:#adebad;\">\n",

" Next, try to find approximately optimal $p = p^∗$, a cut-off value which removes noise from the data. To do that, produce an eigen-spectrum plot as shown in [1] figure 4a and briefly explain your choice of $p^∗$.\n",

"</p>"

]

},

{

"cell_type": "code",

"execution_count": null,

"metadata": {},

"outputs": [],

"source": [

"## Compute eigen-spectrum"

]

},

{

"cell_type": "code",

"execution_count": null,

"metadata": {},

"outputs": [],

"source": [

"## Determine a good cut-off value (and write some lines to explain your choice)\n",

"p_opt = 1 ## change accordingly\n",

"print(\"Chosen cut-off value is: \", p_opt)"

]

},

{

"cell_type": "code",

"execution_count": null,

"metadata": {},

"outputs": [],

"source": [

"## Plot spectrum and indicate the cut-off value on the spectrum"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"<h2 style=\"background-color:#f0b375;\">\n",

"Section 5.0 \n",

"<span style=font-size:50%> Complete all problems in this and previous sections to get a grade of 5.0 </span>\n",

"</h2>\n",

"\n",

"<p style=\"background-color:#adebad;\">\n",

" Plot the distance matrices both for the denoised ($p = p^*$ -- from the previous step) and the original versions as shown in figure 5 in [1]. Note that the distance matrix is a matrix with pairwise distances between every two points from the dataset ($d_{ij} = dist(x_i, x_j)$).<br>\n",

" Perform K-MEANS algorithm for varying number of clusters K on the embedded vectors derrived from CSE. You may use the sklearn implementation of K-MEANS. To make the aforementioned plots meaningful, sort the nodes according to the cluster belongings for every number of clusters K (see the figure 5). For now, there is no need to include the actual ground truth labels given in [2].\n",

"</p>"

]

},

{

"cell_type": "code",

"execution_count": null,

"metadata": {},

"outputs": [],

"source": [

"## Distance matrices"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"<h2 style=\"background-color:#f0b375;\">\n",

"Section 5.5 \n",

"<span style=font-size:50%> Complete all problems in this and previous sections to get a grade of 5.5 </span>\n",

"</h2>\n",

"\n",

"<p style=\"background-color:#adebad;\">\n",

" Producing 2D and 3D embeddings allows us to nicely visualize generated clusters. Now calculate the embeddings for p = 2 (2D case) and p = 3 (3D case) and plot clusterings for a few values of K. Alternatively, you could use $p = p^*$ for more gentle denoising, cluster the denoised embeddings and only then apply a dimensionality reduction technique to get a plot in 2,3-dimensional space. You could use PCA, LLE, t-SNE etc. figure out what works for you. As an example see figure 6 (b) from [1] where CSE is combined with PCA.\n",

"</p>"

]

},

{

"cell_type": "code",

"execution_count": null,

"metadata": {},

"outputs": [],

"source": [

"## Get embeddings, run K-MEANS and generate plots"

]

},

{

"cell_type": "code",

"execution_count": null,

"metadata": {},

"outputs": [],

"source": [

"## p = 2"

]

},

{

"cell_type": "code",

"execution_count": null,

"metadata": {},

"outputs": [],

"source": [

"## p = 3"

]

},

{

"cell_type": "code",

"execution_count": null,

"metadata": {},

"outputs": [],

"source": [

"## choose p > 3, for example, p = p_opt, to compute CSE embeddings \n",

"## First, cluster the computed p-dimentional embeddings and then project them onto 2-dimensional space \n",

"## for visualization using PCA, LL, t-SNE or something else"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"<h2 style=\"background-color:#f0b375;\">\n",

"Section 6.0 \n",

"<span style=font-size:50%> Complete all problems in this and previous sections to get a grade of 6.0 </span>\n",

"</h2>\n",

"\n",

"<p style=\"background-color:#adebad;\">\n",

"Finally, to evaluate the quality of the above derived clusters, let's compare our predictions with the ground truth. We will use the actual member-institution mappings given in [2]. You can reuse code from the previous coding exercises to align the cluster labels with the ground truth.\n",

"</p>"

]

},

{

"cell_type": "code",

"execution_count": null,

"metadata": {},

"outputs": [],

"source": [

"# Initialize community members affeliation array\n",

"AFFILIATIONS = np.zeros((NUM_NODES, ))\n",

"\n",

"# Fill out the affiliation array\n",

"with open(\"email-Eu-core-department-labels.txt\") as file:\n",

" for line in file:\n",

" pair = [int(x) for x in line.split()]\n",

" AFFILIATIONS[pair[0]] = pair[1]\n",

"\n",

"# Number of organizations is \n",

"print(\"The true number of clusters (departments) is: \",len(np.unique(AFFILIATIONS)))"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"<p style=\"background-color:#adebad;\">\n",

"Visually or quantitatively, in a clever and convincing way, show that the K-MEANS generated clusters overlap with the ground truth clusters (member affiliations). How can we measure the overlapping of the predicted and true clusters?\n",

"</p>"

]

},

{

"cell_type": "code",

"execution_count": null,

"metadata": {},

"outputs": [],

"source": [

"## Here you can provide plots and calculations"

]

},

{

"cell_type": "markdown",

"metadata": {},

"source": [

"Please, write here your explanations, observation and thoughts about results of the experiments above."

]

},

{

"cell_type": "markdown",

"metadata": {

"collapsed": true

},

"source": [

"## Comments\n",

"\n",

"We hope you found this exercise instructive.\n",

"\n",

"Feel free to leave comments below, we will read them carefully."