2010-11-08

Directory Color Embedding

Do you also save all your experiment data in folder names which
represent the settings? I do. If you have many experiments, however,
this strategy will defy comparison of the experiments, as it becomes
hard to put everything in one plot. The default settings in plotting
programs such as gnuplot or matplotlib do not have a large enough
range of colors, and if they do, it becomes hard to assign the colors
in a meaningful way for exploratory data analysis. The script here
might come to rescue.



Reading pre-processing directory names


We first transform directory names such that test-paramA_0.1 becomes
test-paramA_0000.1, allowing easier string comparison.

import Levenshtein as L
import numpy as np
from colormath.color_objects import LabColor
import os, re

def fillfunc(mo):
    """ returns the matched number, with padded zeros to the front """
    s = mo.group(1)
    while(len(s)<6): s = '0'+s
    return s
def get_unified_names(fns):
    n = len(fns)
    strs = [x for x in fns]
    for i in xrange(len(strs)):
        strs[i] = re.sub(r'(?<!\d)([\d.]+)(?!\d)',fillfunc,strs[i])
    return strs

Creating the distance matrix based on Levenstein string distance

The Levenshtein string distance is a measure of how many
edits/insertions/deletions one needs to transform one string into the
other. With this we automatically derive a dissimilarity measure for
the unified directory strings:

mat = np.zeros((n,n))
for i in xrange(n):
    for j in xrange(n):
        if j<=i: continue
        mat[i,j] = L.distance(strs[i],strs[j])
        mat[j,i] = mat[i,j]
return np.exp(-mat)

Embedding the dissimilarity matrix using Multidimensional Scaling (MDS) in R


The distance matrix does not necessarily directly map to coordinates
in 3D, so we need an embedding algorithm which deals with distance
matrices only.


For simplicity, we call R from python with a saved dissimilarity matrix.

def emb(mat):
    np.savetxt("file.dat", mat)
    os.system("./analyse/emb.R")
    res = np.loadtxt("mds.dat")
    return res

Now to the part written in R, which only does the embedding and saves
the result in a text file:

#!/usr/bin/Rscript
library(MASS)
library(vegan)
tab <- read.table("file.dat", header = FALSE, sep=" ")
data.m    <- as.matrix(tab)
data.mds <- vegan::metaMDS(data.m, k=3, trymax=50)
write.table(data.mds$points, file="mds.dat", quote=F, row.names=F, col.names=F)

Transforming embedded coordinates into colors

The coordinates are in some – it seems arbitrary – range and
therefore need to be mapped to colors. We want similar colors to have
similar edit distance, we therefore map our coordinates directly to
Lab Color Space and then transform the Lab color coordinates to RGB:

def getcolor(embedded,i):
    #lab = LabColor(0.8,*embedded[i,:])
    lab = LabColor(*embedded[i,:])
    rgb = lab.convert_to('RGB', debug=False).get_rgb_hex()
    return rgb

The resulting string can be used directly in the color specification
of a matplotlib plot and is then best combined with picking to find
out more about a particularly interesting plotline.

Example Embedding of directory names

No comments:

Post a Comment