aerial.utils.math.clustering
Various data clustering algorithms, techniques, functions.
Generally applicable, but typically used for sequence clustering of
various sorts.
center-dist-expect
(center-dist-expect distfn clustering & {:keys [avgfn], :or {avgfn mean}})
The expectation of all center distances of clusters in CLUSTERING.
Distances are computed by DISTFN. Expectation is computed by AVGFN
which defaults to the mean.
center-distances
(center-distances distfn clustering)
Pairwise distances of centers of clusters Ci in CLUSTERING by means
of distance function DISTFN.
centers
(centers avgfn clusters)
Compute a new set of centers from CLUSTERS using AVGFN as a 'mean'
for the points in each cluster Ci of clusters. Returns an 'eager'
seq of these new centers.
cluster-distances
(cluster-distances distfn clusters)
Given a clustering CLUSTERS and its distance function DISTFN,
returns a set of scores/measures of the quality of the clustering,
a map with keys [:global, :each, :inter-m-xs, inter-xs, ...].
Where,
* global gives a global intra cluster cohesion, uses SSE
* each gives intra cluster cohesion for each cluster, uses SSE
* inter a global inter cluster cohesion (sum of sqrs of center
distances)
* inter-ms gives pairwise inter cluster cohesion (center distances)
* inter-m-xs gives pairwise inter cluster cohesion by min distances
over points xis in Ci to center mj of Cj and vice versa.
* inter-xs gives pairwise inter cluster cohesion by min distance
over all [xi, xj] pairs, xi in Ci, xj in Cj
cluster-stdev
(cluster-stdev distfn avgfn clustering)
Compute the average standard deviation of the spread of clusters in
CLUSTERING, the result of a (clusters ...) call. This is not as
obvious as it may seem, and is basically (avg-std-deviation
clustering) except we cheat a bit as we already have the
means (centers).
clusters
(clusters distfn coll centers)
Form and return a set of clusters. Each cluster is the set of
points from COLL closest to a point ci in CENTERS as determined by
distance function distfn (see nearest). Result is a map with keys
ci and values the cluster points for ci.
davies-bouldin-index
(davies-bouldin-index distfn clustering & {:keys [avgfn], :or {avgfn mean}})
Computes the Davies Bouldin Index of cluster validity. This is a
cluster validity measure which is the average of the maximal Rij
over all not equal pairwise clusters (see DBI-Rij).
DBI-Rij
(DBI-Rij distfn clustering & {:keys [avgfn], :or {avgfn mean}})
Rij is a similarity measure between two clusters of a clustering
based on DBI-Si compactness and center point distance dij. The
important aspects are that it is postive definite and symmetric.
Additionally, 'triangle like' aspects hold:
* if Sj > Sk and dij = dik, then Rij > Rik
* if Sj = Sk and dij < dik, then Rij > Rik
Rij = (/ (+ Si Sj) (distfn ci cj))
Returns the collection of Rij for all pairwise clusters of
clustering as a map indexed by i in (range n), n = (count
clusterings). The value of an entry is the n-1 set of Ri,i/=j for
cluster Ci. Note since Rij=Rji there are duplicate values across
the entries, but they are computed only once.
DBI-Si
(DBI-Si distfn cluster & {:keys [avgfn], :or {avgfn mean}})
The Davies Bouldin Index of the compactness (aka 'scatter' or Si)
of cluster CLUSTER.
Si is the expectation of pairwise distances of points in CLUSTER
with its center ci. Distances are given by DISTFN. The expectation
is AVGFN of these distances. AVGFN defaults to mean:
Si = (avgnf (sum (fn[xj] (distfn ci xj)) (val cluster)))
CLUSTER must be an element of a result of (clusters ...), i.e., a
map entry with center key and points val.
density
(density distfn stdev u coll)
Computes the 'density' of points in COLL relative to u (typically a
'center' or 'mid point' of some sort) as determined by counts
within 1 STDEV of u. Distances are given by distance function
DISTFN.
dist-matrix
(dist-matrix distfn coll & {:keys [sym keyfn], :or {sym true, keyfn identity}})
Computes and returns the pairwise distance matrix for items in
COLL. The distances between items is given by distfn. SYM
indicates that distfn is symmetric (default) and keyfn returns a
key suitable for map entries for an item and defaults to identity.
The idea behind keyfn is that some items may be large or otherwise
complicated and thus expensive to compare but a unique key may be
generated for them beforehand. A typical scenario would be to
'Goedel number' them.
Returns {[k v] | k=[(kefn i) (kefn j)] v =(distfn i j)}
Uses reducers and vfold to parallelize computation with auto
computed queue granularity (see vfold)
edist
(edist x y)
Simple Euclidean distance between points x and y. x and y are
points from the same dimensional space (1-d, 2-d, 3-d, ... n-d)
extreme-pd
(extreme-pd x distfn comp coll)
For point X, compute its 'extreme' point and its corresponding
distance from all the points in COLL. The 'extremeness' is
determined by COMP a comparison predicate. For example, comp = <
indicates 'nearest'.
DISTFN is the distance function used to compute the distances of x
from any p in coll.
If coll is a map, uses (vals coll) for the candidate collection of
points.
farthest
(farthest x distfn coll)
Return the farthest point p in coll to x. Distance is given by
distfn. If coll is a map uses (vals coll).
farthest-pd
(farthest-pd x distfn coll)
Return the farthest point p in coll to x along with its distance.
Distance is given by distfn. If coll is a map uses (vals coll).
Returns [p,d], where p is the farthest point and d its distance.
find-clusters
multimethod
(find-clusters coll & {:keys [distfn avgfn algo vindex], :or {algo kmeans++, vindex S-Dbw-index, distfn edist, avgfn mean}})
(find-clusters _ coll & {:keys [distfn avgfn algo vindex], :or {algo kmeans++, vindex S-Dbw-index, distfn edist, avgfn mean}})
Computes the 'best' clustering of the data in collection COLL whose
distances are given by DISTFN and means by AVGFN, as produced by
ALGO and measured by VINDEX. ALGO is the clustering algorithm,
defaults to kmeans++, and VINDEX is the validity index measure,
defaults to S-Dbw-index.
'Best', here is as determined by vindex. If the data has
convex (natural/true) clusters, and is not skewed, the default
kmeans++ with S-Dbw-index will return the optimal
clustering (indeed, it will be or be extremely close to the
natural/true clustering).
If the data are not convex, kmeans++ is invalid (can only find
convex clusters...). If the data is heavily skewed, kmeans will
not find the optimal (true) clusters (as it always tends to find
'equal area' clusters.
intercluster-density
(intercluster-density distfn avgfn clustering & {:keys [stdev]})
Compute the inter cluster point density. This is the density
between clusters as determined by point counts 1 cluster-stdev from
midpoints between cluster centers.
Note: if (= 1 (count clustering)), simply returns the density of
the single cluster.
Returns floating point density score of separation - the smaller
the better.
intra-dist-expect
(intra-dist-expect distfn cluster & {:keys [avgfn], :or {avgfn mean}})
The expectation of all pairwise point distances in cluster CLUSTER.
Distances are computed by DISTFN and expectation by AVGFN, which
defaults to mean. CLUSTER is either an element of a result
of (clusters ...), i.e., a map entry with center key and points
val, or the point collection of such cluster.
intra-distances
(intra-distances distfn cluster)
Pairwise distances of points in a cluster CLUSTER as given by
DISTFN. CLUSTER is either an element of a result of (clusters
...), i.e., a map entry with center key and points val, or the
point collection of such cluster.
ith-sum-sqr-err
(ith-sum-sqr-err distfn Ci)
Sum of Squares Error for cluster Ci. Ci is a map entry with key
the mean mi of Ci and val the points assigned to Ci. distfn is the
distance function for the err (and must be the same as the distance
function assigning the points to Ci).
kmeans
(kmeans initial-centers coll & {:keys [distfn avgfn], :or {distfn vecdist, avgfn vecmean}})
Computes kmeans clustering over COLL starting with initial set of
centers INITIAL-CENTERS, using DISTFN as the distance function
between points and AVGFN as the 'means' function for a cluster.
let k = count initial-centers
repeat
form k clusters by grouping all p in coll with nearest center
compute new k centers from clusters
until centers-i = centers-i+1
Return [Cs sse], where Cs is the set of clusters asssociated with
final centers and sse is the sum-sqr-err of Cs
kmeans++
(kmeans++ k coll & {:keys [distfn avgfn], :or {distfn vecdist, avgfn vecmean}})
kmeans++ clustering of COLL into K clusters. Picks initial-centers
as (k++init k distfn coll) and then proceeds via kmeans. As for
kmeans, DISTFN is the distance function between points and AVGFN is
the 'means' function for a cluster.
knn
(knn k distfn p coll)
Compute K nearest neighbors of point P in collection COLL.
Distance is given by means of distfn, preferably a metric, but can
be a 'similarity measure' such as relative entropy. O(nlogn)
complexity.
knn-graph
(knn-graph k distfn coll)
Compute the k nearest neighbors graph over the items in COLL as
determined by distance function distfn, preferably a metric, but
can be a 'similarity measure' such as relative entropy. Returns
the graph encoded as a map (sparse edge set matrix), with keys
items in COLL and values k-sets of items in COLL.
krnn-clust
(krnn-clust k distfn coll & {:keys [keyfn], :or {keyfn identity}})
krnn-graph
(krnn-graph knngraph coll)
(krnn-graph k distfn coll)
Compute the k reverse nearest neighbors graph over the items in
COLL as determined by the k nearest neighbors graph over COLL (see
knn-graph). Distance of knn graph is determined by distance
function distfn, preferably a metric, but can be a 'similarity
measure' such as relative entropy.
Alternately, the signature with KNNGRAPH, provides the k nearest
neighbor graph as a direct input.
Returns the graph encoded as a map (sparse edge set matrix), with
keys items in COLL and values sets of items in COLL.
NOTE: the size of the value sets for krnn need not be k, typically
isn't k, and can range from 0 to (count coll).
loyd-step
(loyd-step distfn avgfn data incenters)
Primary step in kmeans algorithm. Recluster data to the 'new'
centers incenters. Take new clusters and compute and return newer
centers for all new clusters. That's the basic Loyd step, but here
there is an extra wrinkle:
If (count new-clusters) is not equal (count incenters), some input
clusters have coalesced and need to be resplit (see
split-worst-cluster). The result is then 'restepped' until the
count of new clusters is count of incenters.
nearest
(nearest x distfn coll)
Return the nearest point p in coll to x. Distance is given by
distfn. If coll is a map uses (vals coll).
nearest-pd
(nearest-pd x distfn coll)
Return the nearest point p in coll to x along with its distance.
Distance is given by distfn. If coll is a map uses (vals coll).
Returns [p,d], where p is the nearest point and d its distance.
refoldin-outliers
(refoldin-outliers krnngrph rnnG>k-sccs rnnG<k-sccs)
(refoldin-outliers krnngrph rnnG>k-sccs rnnG<k-sccs knngrph & {:keys [minsize], :or {minsize 1}})
Takes a krnn graph and the SCC 'clusters' corresponding to the s
S-Dbw-index
(S-Dbw-index distfn clustering & {:keys [avgfn], :or {avgfn mean}})
Compute the 'S_Dbw' cluster validity index: Scatter plus Density
between. By several accounts (IEEE 2010 ICDM paper 'Understanding
Internal Clustering Validity Measures', in particular) this is the
most robust general internal validity measure across both data sets
and clustering algorithms. It is the default used by
FIND-CLUSTERS.
It accounts for both compactness of clusters and cluster
separation. It does this with a dual density measure:
1. Scattering (see SCATT), which computes the average variance in
clusters to the overall variance of the data set.
2. Intercluster density (so called 'Dens_bw', see
INTERCLUSTER-DENSITY), which computes an averaged density in the
space between all cluster pairs.
Note in particular that for convex sets, it is proved that the
clustering which minimizes Scatt + Dens_bw, is the optimal
clustering for the data set and algorithm pair (see Halkidi &
Vazirgiannis, Clustering Validity Assesment: Finding Optimal
partitioning of a Data Set, Proc ICDM 2001, pp187-194).
DISTFN is the distance function for the data and AVGFN the 'mean'
for the data.
Returns a floating point number as score. The smaller the better.
scatt
(scatt distfn avgfn clustering)
Compute the compactness of a clustering by means of 'density'
measure. This is the averaged ratio of variance of clusters in
clustering to the overall variance of the data set:
let S all data points
n (count clustering)
(* 1/n (sum (fn[Ci] (/ (variance Ci) (variance S))) clustering))
Returns a floating point density score of compactness - the smaller
the better.
split-krnn
(split-krnn k krnngrph rnncntM knngrph)
Takes a krnn graph (as built by krnn-graph) for value k, with point
counts given by rnncntM (as built by krnn-graph) and the
corresponding knngrph that is the basis of the krnn graph, and
returns two new graphs [rnnG>k rnnG<k]:
rnnG>k is the subgraph of krnngrph whose nodes have >= k edges
rnnG<k is the subgraph of krnngrph whose nodes have < k edges
Additionally, all edge set nodes in both subgraphs that do not
appear as nodes in the subgraphs are removed (another choice would
have been to include them with null edge sets, but that would have
violated the constraints on rnnG>k).
These two graphs form the basis for initial sets of clusters based
on the set of strongly connected components (SCC) in them. These
SCC form the basis of both the Chameleon and RECORD clustering
algorithms.
split-worst-cluster
(split-worst-cluster distfn avgfn clusters & {:keys [wradius], :or {wradius 0.8}})
Clusters a cluster map as given by function CLUSTER (for which
see). The functions distfn and avgfn are the distance and 'mean'
functions used to build clusters.
Clusters is assumed to be smaller in count than it 'should' be.
Using an error measure, which here is SSE (see ith-sum-sqr-err and
sum-sqr-err), clusters has a worst cluster Ci. Split Ci by first
sorting its points by their distance from cmi, the center of Ci.
Then remove the points within WRADIUS percent of the furthest point
and place in new cluster wCi. Remove Ci from clusters and add the
pair of clusters from splitting Ci:
new-clusters = clusters - Ci + (Ci - wCi) + wCi
Return new-clusters.
sum-sqr-err
(sum-sqr-err distfn clusters)
Sum of Squares Error for a set of clusters (result of single loyd
step or end of k-means or ...). distfn is the distance function
used to form the clusters (in a loyd step) and is also the distance
function for computing the errors.