aerial.utils.math.infoth
Various Information Theory functions and measures rooted in Shannon
Entropy measure and targetting a variety of sequence and string
data.
bi-tri-grams
(bi-tri-grams s)
combin-joint-entropy
(combin-joint-entropy coll1 coll2)
(combin-joint-entropy coll1 coll2 & colls)
Given a set of collections c1, c2, c3, .. cn, return the joint
entropy: - (sum (* px1..xn (log2 px1..xn)) all-pairs-over {ci}).
Where all-pairs-over is an exhaustive combination of elements of
{ci} taken n at a time, where each n-tuple has exactly one element
from each ci (i.e., the cross product of {ci}).
Reports in bits (logfn = log2) and treats [x y] [y x] elements as
the same.
cond-entropy
(cond-entropy PXY PY)
(cond-entropy combinator opts coll1 coll2)
(cond-entropy combinator opts coll1 coll2 & colls)
Given the joint probability distribution PXY and the distribution
PY, return the conditional entropy of X given Y = H(X,Y) - H(Y).
Alternatively, given a set of collections c1, c2, c3, .. cn, and
combinator, a function of n variables which generates joint
occurances from {ci}, returns the multivariate conditional entropy
induced from the joint probability distributions.
OPTS is a map of options, currently sym? and logfn. The defaults
are false and log2. If sym? is true, treat [x y] and [y x] as
equal. logfn can be used to provide a log of a different base.
log2, the default, reports in bits. If no options are required,
the empty map must be passed: (cond-entropy transpose {} my-coll)
For the case of i > 2, uses the recursive chain rule for
conditional entropy (bottoming out in the two collection case):
H(X1,..Xn-1|Xn) = (sum H(Xi|Xn,X1..Xi-1) (range 1 (inc n)))
CREl
(CREl l sq & {:keys [limit alpha], :or {limit 15}})
dice-coeff
(dice-coeff s1 s2)
diff-fn
(diff-fn f)
Return the function that is 1-F applied to its args: (1-(apply f
args)). Intended for normalized distance metrics.
Ex: (let [dice-diff (diff-fn dice-coeff) ...]
(dice-diff some-set1 some-set2))
DLX||Y
(DLX||Y l Pdist Qdist)
Synonym for lambda divergence
DX||Y
(DX||Y & args)
Synonym for relative-entropy.
args is [pd1 pd2 & {:keys [logfn] :or {logfn log2}}
entropy
(entropy dist & {logfn :logfn, :or {logfn log2}})
Entropy calculation for the probability distribution dist.
Typically dist is a map giving the PMF of some sample space. If it
is a string or vector, this calls shannon-entropy on dist.
expected-qdict
(expected-qdict q-1 q-2 & {:keys [alpha], :or {alpha ["A" "U" "G" "C"]}})
freq-jaccard-index
(freq-jaccard-index s1 s2)
freq-xdict-dict
(freq-xdict-dict q sq)
hamming
(hamming s1 s2)
Compute hamming distance between sequences S1 and S2. If both s1
and s2 are strings, performs an optimized version
HXY
(HXY & args)
Synonym for joint-entropy
hybrid-dictionary
(hybrid-dictionary l sqs)
Compute the 'hybrid', aka centroid, dictionary or Feature Frequency
Profile (FFP) for sqs. SQS is either a collection of already
computed FFPs (probability maps) of sequences, or a collection of
sequences, or a string denoting a sequence file (sto, fasta, aln,
...) giving a collection of sequences. In the latter cases, the
sequences will have their FFPs computed based on word/feature
length L (resolution size). In all cases the FFPs are combined,
using the minimum entropy principle, into a joint ('hybrid' /
centroid) FFP.
II
(II combinator opts collx colly collz & colls)
Synonym for interaction-information
IXY
(IXY combinator opts coll1 coll2)
Synonym for mutual information
IXY|Z
(IXY|Z combinator opts collx colly collz)
Synonym for conditional mutual information
jaccard-dist
(jaccard-dist s1 s2)
Named version of (diff-fn jaccard-index s1 s2). This difference
function is a similarity that is a proper _distance_ metric (hence
usable in metric trees like bk-trees).
jaccard-index
(jaccard-index s1 s2)
jensen-shannon
(jensen-shannon Pdist Qdist)
Computes Jensen-Shannon Divergence of the two distributions Pdist
and Qdist. Pdist and Qdist _must_ be over the same sample space!
joint-entropy
(joint-entropy combinator opts coll)
(joint-entropy combinator opts coll & colls)
Given a set of collections c1, c2, c3, .. cn, and combinator, a
function of n variables which generates joint occurances from {ci},
returns the joint entropy over all the set:
-sum(* px1..xn (log2 px1..xn))
OPTS is a map of options, currently sym? and logfn. The defaults
are false and log2. If sym? is true, treat [x y] and [y x] as
equal. logfn can be used to provide a log of a different base.
log2, the default, reports in bits. If no options are required,
the empty map must be passed: (joint-entropy transpose {} my-coll)
KLD
(KLD & args)
Synonym for relative-entropy.
args is [pd1 pd2 & {:keys [logfn] :or {logfn log2}}
lambda-divergence
(lambda-divergence lambda Pdist Qdist)
Computes a symmetrized KLD variant based on a probability
parameter, typically notated lambda, in [0..1] for each
distribution:
(+ (* lambda (DX||Y Pdist M)) (* (- 1 lambda) (DX||Y Qdist M)))
Where M = (+ (* lambda Pdist) (* (- 1 lambda) Qdist))
= (merge-with (fn[pi qi] (+ (* lambda pi) (* (- 1 lambda) qi)))
Pdist Qdist)
For lambda = 1/2, this reduces to
M = 1/2 (merge-with (fn[pi qi] (+ pi qi)) P Q)
and (/ (+ (DX||Y Pdist M) (DX||Y Qdist M)) 2) = jensen shannon
levenshtein
(levenshtein s t)
Compute the Levenshtein (edit) distance between S and T, where S
and T are either sequences or strings.
Examples: (levenshtein [1 2 3 4] [1 1 3]) ==> 2
(levenshtein "abcde" "bcdea") ==> 2
limit-entropy
(limit-entropy q|q-dict sq|q-1dict & {:keys [alpha NA], :or {alpha ["A" "U" "G" "C"], NA -1.0}})
lod-score
(lod-score qij pi pj)
log-odds
(log-odds frq1 frq2)
max-qdict-entropy
(max-qdict-entropy q & {:keys [alpha], :or {alpha ["A" "U" "G" "C"]}})
ngram-compare
(ngram-compare s1 s2 & {uc? :uc?, n :n, scfn :scfn, ngfn :ngfn, :or {n 2, uc? false, scfn dice-coeff, ngfn word-letter-pairs}})
ngram-vec
(ngram-vec s & {n :n, :or {n 2}})
normed-codepoints
(normed-codepoints s)
q-1-dict
(q-1-dict q-xdict)
(q-1-dict q sq)
q1-xdict-dict
(q1-xdict-dict q sq & {:keys [ffn], :or {ffn probs}})
raw-lod-score
(raw-lod-score qij pi pj & {scaling :scaling, :or {scaling 1.0}})
reconstruct-dict
(reconstruct-dict l sq & {:keys [alpha], :or {alpha ["A" "U" "G" "C"]}})
relative-entropy
(relative-entropy pdist1 pdist2 & {:keys [logfn], :or {logfn log2}})
Take two distributions (that must be over the same space) and
compute the expectation of their log ratio: Let px be the PMF of
pdist1 and py be the PMF pdist2, return
(sum (fn[px py] (* px (log2 (/ px py)))) xs ys)
Here, pdist(1|2) are maps giving the probability distributions (and
implicitly the pmfs), as provided by freqs-probs, probs,
cc-freqs-probs, combins-freqs-probs, cc-combins-freqs-probs,
et. al. Or any map where the values are the probabilities of the
occurance of the keys over some sample space. Any summation term
where (or (= px 0) (= py 0)), is taken as 0.0.
NOTE: maps should have same keys! If this is violated it is likely
you will get a :negRE exception or worse, bogus results. However,
as long as the maps reflect distributions _over the same sample
space_, they do not need to be a complete sampling (a key/value for
all sample space items) - missing keys will be included as 0.0
values.
Also known as Kullback-Leibler Divergence (KLD)
KLD >= 0.0 in all cases.
seq-joint-entropy
(seq-joint-entropy s & {:keys [sym? logfn], :or {sym? false, logfn log2}})
Returns the joint entropy of a sequence with itself: -sum(* pi (log
pi)), where probabilities pi are of combinations of elements of S
taken 2 at a time. If sym?, treat [x y] and [y x] as equal.
shannon-entropy
(shannon-entropy s & {logfn :logfn, :or {logfn log2}})
Returns the Shannon entropy of a sequence: -sum(* pi (log pi)),
where i ranges over the unique elements of S and pi is the
probability of i in S: (freq i s)/(count s)
TCI
(TCI combinator opts coll1 coll2 & colls)
Synonym for total-correlation information
total-correlation
(total-correlation combinator opts coll1 coll2)
(total-correlation combinator opts coll1 coll2 & colls)
One of two forms of multivariate mutual information provided here.
The other is "interaction information". Total correlation
computes what is effectively the _total redundancy_ of the
information in the provided content - here the information in coll1
.. colln. As such it can give somewhat unexpected answers in
certain situations.
Information content "measure" is based on the distributions
arising out of the frequencies over coll1 .. colln _individually_,
and jointly over the result of combinator applied to coll1 .. colln
collectively.
OPTS is a map of options, currently sym? and logfn. The defaults
are false and log2. If sym? is true, treat [x y] and [y x] as
equal. logfn can be used to provide a log of a different base.
log2, the default, reports in bits. If no options are required,
the empty map must be passed: (total-correlation transpose {}
my-coll)
NOTE: the "degenerate" case of only two colls, is simply mutual
information.
Let C be (combinator coll1 coll2 .. colln), so xi1..xin in C is an
element in the joint sample space, and xi in colli is an element in
a "marginal" space . Computes
sum (* px1..xn (log2 (/ px1..xn (* px1 px2 .. pxn)))) x1s x2s .. xns =
Hx1 + Hx2 + .. + Hxn - Hx1x2..xn
(<= 0.0
TC(X1,..,Xn)
(min|i (sum Hx1 .. Hxi Hxi+2 .. Hxn, i = 0..n-1, Hx0=Hxn+1=0)))
Ex:
(shannon-entropy "AAAUUUGGGGCCCUUUAAA")
=> 1.9440097497163569
(total-correlation transpose {}
"AAAUUUGGGGCCCUUUAAA" "AAAUUUGGGGCCCUUUAAA")
=> 1.9440097497163569 ; not surprising
(total-correlation transpose {}
"AAAUUUGGGGCCCUUUAAA" "AAAUUUGGGGCCCUUUAAA"
"AAAUUUGGGGCCCUUUAAA" "AAAUUUGGGGCCCUUUAAA")
=> 5.832029249149071 ; possibly surprising if not noting tripled redundancy
tversky-index
(tversky-index s1 s2 alpha beta)
Tversky index of two sets S1 and S2. A generalized NON metric
similarity 'measure'. Generalization is through the ALPHA and BETA
coefficients:
TI(S1,S2) = (/ |S1^S2| (+ |S1^S2| (* ALPHA |S1-S2|) (* BETA |S2-S1|)))
For example, with alpha=beta=1, TI is jaccard-index
with alpha=beta=1/2 TI is dice-coeff