aerial.utils.math.infoth

Various Information Theory functions and measures rooted in Shannon
Entropy measure and targetting a variety of sequence and string
data.

all-grams

(all-grams s)

bi-tri-grams

(bi-tri-grams s)

combin-joint-entropy

(combin-joint-entropy coll1 coll2)(combin-joint-entropy coll1 coll2 & colls)
Given a set of collections c1, c2, c3, .. cn, return the joint
entropy: - (sum (* px1..xn (log2 px1..xn)) all-pairs-over {ci}).
Where all-pairs-over is an exhaustive combination of elements of
{ci} taken n at a time, where each n-tuple has exactly one element
from each ci (i.e., the cross product of {ci}).

Reports in bits (logfn = log2) and treats [x y] [y x] elements as
the same.

cond-entropy

(cond-entropy PXY PY)(cond-entropy combinator opts coll1 coll2)(cond-entropy combinator opts coll1 coll2 & colls)
Given the joint probability distribution PXY and the distribution
PY, return the conditional entropy of X given Y = H(X,Y) - H(Y).

Alternatively, given a set of collections c1, c2, c3, .. cn, and
combinator, a function of n variables which generates joint
occurances from {ci}, returns the multivariate conditional entropy
induced from the joint probability distributions.

OPTS is a map of options, currently sym? and logfn.  The defaults
are false and log2.  If sym? is true, treat [x y] and [y x] as
equal.  logfn can be used to provide a log of a different base.
log2, the default, reports in bits.  If no options are required,
the empty map must be passed: (cond-entropy transpose {} my-coll)

For the case of i > 2, uses the recursive chain rule for
conditional entropy (bottoming out in the two collection case):

H(X1,..Xn-1|Xn) = (sum H(Xi|Xn,X1..Xi-1) (range 1 (inc n)))

conditional-mutual-information

(conditional-mutual-information combinator opts collx colly collz)
Conditional mutual information I(X;Y|Z).  Conditional mutual
information is the relative entropy (KLD) of the conditional
distribution (of two random variables on a third) with the product
distribution of the distributed conditionals.

Here these arise out of frequencies for the information in collz
individually, collx & collz and colly & collz jointly as the result
of combinator, and collx & colly & collz jointly as the result of
combinator as well.  Hence, combinator needs to be able to handle
the cases of 2 and 3 collections passed to it (collectively or as
variadic).

OPTS is a map of options, currently sym? and logfn.  The defaults
are false and log2.  If sym? is true, treat [x y] and [y x] as
equal.  logfn can be used to provide a log of a different base.
log2, the default, reports in bits.  If no options are required,
the empty map must be passed: (conditional-mutual-information
transpose {} my-coll)

Other interpretations/meanings: Conditional mutual information is
the mutual information of two random variables conditioned on a
third.  Or, put another way, it is the expected value of the mutual
information of two RV over the values of a third.  It measures the
amount of information shared by X & Y beyond that provided by a
common "mediator".

Let XYZ be (combinator collx colly collz)
    XZ  be (combinator collx collz)
    YZ  be (combinator colly collz)

Computes

sum (* pxy|z (log2 (/ pxy|z (* px|z py|z)))) xs ys zs =

 ... (some algebra and applications of Bayes) ...

sum (* pxyz (log2 (/ (* pz pxyz) (* pxz pyz)))) xyzs xzs yzs

 ... (some more algebra and entropy identitities) ...

(+ H(X,Z) H(Y,Z) -H(X,Y,Z) -H(Z))

which is easier to work with...

CREl

(CREl l sq & {:keys [limit alpha], :or {limit 15}})

dice-coeff

(dice-coeff s1 s2)

diff-fn

(diff-fn f)
Return the function that is 1-F applied to its args: (1-(apply f
args)).  Intended for normalized distance metrics.

Ex: (let [dice-diff (diff-fn dice-coeff) ...]
      (dice-diff some-set1 some-set2))

DLX||Y

(DLX||Y l Pdist Qdist)
Synonym for lambda divergence

DX||Y

(DX||Y & args)
Synonym for relative-entropy.
args is [pd1 pd2 & {:keys [logfn] :or {logfn log2}}

entropy

(entropy dist & {logfn :logfn, :or {logfn log2}})
Entropy calculation for the probability distribution dist.
Typically dist is a map giving the PMF of some sample space.  If it
is a string or vector, this calls shannon-entropy on dist.

expected-qdict

(expected-qdict q-1 q-2 & {:keys [alpha], :or {alpha ["A" "U" "G" "C"]}})

freq-jaccard-index

(freq-jaccard-index s1 s2)

freq-xdict-dict

(freq-xdict-dict q sq)

hamming

(hamming s1 s2)
Compute hamming distance between sequences S1 and S2. If both s1
and s2 are strings, performs an optimized version

HXY

(HXY & args)
Synonym for joint-entropy

HX|Y

(HX|Y & args)
Synonym for cond-entropy

hybrid-dictionary

(hybrid-dictionary l sqs)
Compute the 'hybrid', aka centroid, dictionary or Feature Frequency
Profile (FFP) for sqs.  SQS is either a collection of already
computed FFPs (probability maps) of sequences, or a collection of
sequences, or a string denoting a sequence file (sto, fasta, aln,
...) giving a collection of sequences.  In the latter cases, the
sequences will have their FFPs computed based on word/feature
length L (resolution size).  In all cases the FFPs are combined,
using the minimum entropy principle, into a joint ('hybrid' /
centroid) FFP.

II

(II combinator opts collx colly collz & colls)
Synonym for interaction-information

information-capacity

(information-capacity q sq & {:keys [cmpfn], :or {cmpfn jensen-shannon}})

informativity

(informativity q sq)(informativity q-dict)

interaction-information

(interaction-information combinator opts collx colly collz & colls)
One of two forms of multivariate mutual information provided here.
 The other is "total correlation".  Interaction information
 computes the amount of information bound up in a set of variables,
 beyond that present in _any subset_ of those variables. Well, that
 seems to be the most widely held view, but there are disputes. This
 can either indicate inhibition/redundancy or facilitation/synergy
 between the variables.  It helps to look at the 3 variable case, as
 it at least lends itself to 'information' (or info venn) diagrams.

 II(X,Y,Z) = I(X,Y|Z) - I(X,Y)

 II measures the influence of Z on the information connection
 between X and Y.  Since I(X,Y|Z) < I(X,Y) is possible, II can be
 negative, as well as 0 and positive.  For example if Z completely
 determines the information content between X & Y, then I(X,Y|Z) -
 I(X,Y) would be 0 (as overlap contributed by X & Y alone is totally
 redundant), while for X & Y independent of Z, I(X,Y) could still be
 > 0.  A _negative_ interaction indicates Z inhibits (accounts for,
 causes to be irrelevant/redundant) the connection between X & Y.  A
 _positive_ interaction indicates Z promotes, is synergistic with,
 or enhances the connection.  The case of 0 is obvious...

 If we use the typical info theory venn diagrams, then II is
 represented as the area in all the circles. It is possible (and
 typically so done) to draw the Hxi (xi in {X,Y,Z}) circles so that
 all 3 have an overlap.  In this case the information between any
 two (the mutual information...) is partially accounted for by the
 third.  The extra overlap accounts for the negative value
 "correction" of II in this case.

 However, it is also possible to draw the circles so that any two
 overlap but all three have null intersect.  In this case, the third
 variable provides an extra connection between the other two (beyond
 their MI), something like a "mediator".  This extra connection
 accounts for the positive value "facillitation" of II in this
 case.

 It should be reasonably clear at this point that negative II is the
 much more intuitive/typical case, and that positive cases are
 surprising and typically more interesting and informative (so to
 speak...).  A positive example, would be the case of two mutually
 exclusive causes for a common effect (third variable).  In such a
 case, knowing any two completely determines the other.

 All of this can be bumped up to N variables via typical chain rule
 recurrance.

 In this implementation, the information content "measure" is
 based on the distributions arising out of the frequencies over
 coll1 .. colln _individually_, and jointly over the result of
 combinator applied collectively to all subsets of coll1 .. colln.

 OPTS is a map of options, currently sym? and logfn.  The defaults
 are false and log2.  If sym? is true, treat [x y] and [y x] as
 equal.  logfn can be used to provide a log of a different base.
 log2, the default, reports in bits.  If no options are required,
 the empty map must be passed: (interaction-information transpose {}
 my-coll)

 For collections coll1..colln and variadic combinator over any
 subset of collections, Computes:

 (sum (fn[ss] (* (expt -1 (- n |ss|))
                 (HXY combinator sym? ss)))
      (subsets all-colls))

 |ss| = cardinality/count of subset ss.  ss is a subset of
 coll1..colln.  If sym? is true, treat reversable combined items as
 equal (see HYX/joint-entropy).

 For n=3, this reduces to (- (IXY|Z combinator sym? collx colly collz)
                             (IXY combinator sym? collx colly))

 Ex:

 (interaction-information transpose {}
   "AAAGGGUUUUAAUAUUAAAAUUU"
   "AAAGGGUGGGAAUAUUCCCAUUU"
   "AAAGGGUGGGAAUAUUCCCAUUU")
 => -1.1673250256261127

 Ex:

 (interaction-information transpose {}
   "AAAGGGUUUUAAUAUUAAAAUUU"
   "AAAGGGUGGGAAUAUUCCCAUUU"
   (reverse-compliment "AAAGGGUGGGAAUAUUCCCAUUU"))
 => -0.282614135781623

 Ex: Shows synergy (positive value)

 (let [X1 [0 1] ; 7 independent binary vars
       X2 [0 1]
       X3 [0 1]
       X4 [0 1]
       X5 [0 1]
       X6 [0 1]
       X7 [0 1]
       X8 [0 1]
       Y1 [X1 X2 X3 X4 X5 X6 X7]
       Y2          [X4 X5 X6 X7]
       Y3             [X5 X6 X7]
       Y4                   [X7]
       bitval #(loop [bits (reverse %) n 0 v 0]
                 (if (empty? bits)
                   v
                   (recur (rest bits)
                          (inc n)
                          (+ v (* (first bits) (math/expt 2 n))))))
       rfn #(map bitval (apply reducem (fn[& args] [args]) concat %))
       ;; Turn to (partially redundant) value sets - each Yi is the set
       ;; of all numbers for all (count Yi) bit patterns.  So, Y1, Y2,
       ;; and Y3 share all values of 3 bits, while the group with Y4
       ;; added shares just all 1 bit patterns
       Y1 (rfn Y1)
       Y2 (rfn Y2)
       Y3 (rfn Y3)
       Y4 (rfn Y4)]
   [(interaction-information concat {} Y1 Y2 Y3)
    (interaction-information concat {} Y1 Y2 Y3 Y4)])
=> [-3.056592722891465   ; Intuitive, since 3 way sharing
     1.0803297840536592] ; Unintuitive, since 1 way sharing induces synergy!?

IXY

(IXY combinator opts coll1 coll2)
Synonym for mutual information

IXY|Z

(IXY|Z combinator opts collx colly collz)
Synonym for conditional mutual information

jaccard-dist

(jaccard-dist s1 s2)
Named version of (diff-fn jaccard-index s1 s2).  This difference
function is a similarity that is a proper _distance_ metric (hence
usable in metric trees like bk-trees).

jaccard-index

(jaccard-index s1 s2)

jensen-shannon

(jensen-shannon Pdist Qdist)
Computes Jensen-Shannon Divergence of the two distributions Pdist
and Qdist.  Pdist and Qdist _must_ be over the same sample space!

joint-entropy

(joint-entropy combinator opts coll)(joint-entropy combinator opts coll & colls)
Given a set of collections c1, c2, c3, .. cn, and combinator, a
function of n variables which generates joint occurances from {ci},
returns the joint entropy over all the set:

-sum(* px1..xn (log2 px1..xn))

OPTS is a map of options, currently sym? and logfn.  The defaults
are false and log2.  If sym? is true, treat [x y] and [y x] as
equal.  logfn can be used to provide a log of a different base.
log2, the default, reports in bits.  If no options are required,
the empty map must be passed: (joint-entropy transpose {} my-coll)

KLD

(KLD & args)
Synonym for relative-entropy.
args is [pd1 pd2 & {:keys [logfn] :or {logfn log2}}

lambda-divergence

(lambda-divergence lambda Pdist Qdist)
Computes a symmetrized KLD variant based on a probability
parameter, typically notated lambda, in [0..1] for each
distribution:

(+ (* lambda (DX||Y Pdist M)) (* (- 1 lambda) (DX||Y Qdist M)))

Where M = (+ (* lambda Pdist) (* (- 1 lambda) Qdist))
        = (merge-with (fn[pi qi] (+ (* lambda pi) (* (- 1 lambda) qi)))
                      Pdist Qdist)

For lambda = 1/2, this reduces to

M = 1/2 (merge-with (fn[pi qi] (+ pi qi)) P Q)

and (/ (+ (DX||Y Pdist M) (DX||Y Qdist M)) 2) = jensen shannon

levenshtein

(levenshtein s t)
Compute the Levenshtein (edit) distance between S and T, where S
and T are either sequences or strings.

Examples:  (levenshtein [1 2 3 4] [1 1 3]) ==> 2
           (levenshtein "abcde" "bcdea")   ==> 2

limit-entropy

(limit-entropy q|q-dict sq|q-1dict & {:keys [alpha NA], :or {alpha ["A" "U" "G" "C"], NA -1.0}})

limit-informativity

(limit-informativity q sq)(limit-informativity q-dict)

lod-score

(lod-score qij pi pj)

log-odds

(log-odds frq1 frq2)

max-qdict-entropy

(max-qdict-entropy q & {:keys [alpha], :or {alpha ["A" "U" "G" "C"]}})

mutual-information

(mutual-information combinator opts coll1 coll2)
Mutual information between the content of coll1 and coll2 as
combined by combinator, a function of two variables, presumed to be
over coll1 and coll2 returning a collection of combined elements.

OPTS is a map of options, currently sym? and logfn.  The defaults
are false and log2.  If sym? is true, treat [x y] and [y x] as
equal.  logfn can be used to provide a log of a different base.
log2, the default, reports in bits.  If no options are required,
the empty map must be passed: (mutual-information transpose {} my-coll)

Mutual information is the relative entropy (KLD) of the joint
probability distribution to the product distribution.  Here the
distributions arise out of the frequencies over coll1 and coll2
individually and jointly over the result of combinator on coll1 and
coll2.

Let C be (combinator coll1 coll2).  Computes

(sum (* pxy (log2 (/ pxy (* px py)))) xs ys) =

(+ (shannon-entropy coll1) (shannon-entropy coll2) (- (joint-entropy C))) =

Hx + Hy - Hxy = I(X,Y)

(<= 0.0 I(X,Y) (min [Hx Hy]))

ngram-compare

(ngram-compare s1 s2 & {uc? :uc?, n :n, scfn :scfn, ngfn :ngfn, :or {n 2, uc? false, scfn dice-coeff, ngfn word-letter-pairs}})

ngram-vec

(ngram-vec s & {n :n, :or {n 2}})

normed-codepoints

(normed-codepoints s)

q-1-dict

(q-1-dict q-xdict)(q-1-dict q sq)

q1-xdict-dict

(q1-xdict-dict q sq & {:keys [ffn], :or {ffn probs}})

raw-lod-score

(raw-lod-score qij pi pj & {scaling :scaling, :or {scaling 1.0}})

reconstruct-dict

(reconstruct-dict l sq & {:keys [alpha], :or {alpha ["A" "U" "G" "C"]}})

relative-entropy

(relative-entropy pdist1 pdist2 & {:keys [logfn], :or {logfn log2}})
Take two distributions (that must be over the same space) and
compute the expectation of their log ratio: Let px be the PMF of
pdist1 and py be the PMF pdist2, return

(sum (fn[px py] (* px (log2 (/ px py)))) xs ys)

Here, pdist(1|2) are maps giving the probability distributions (and
implicitly the pmfs), as provided by freqs-probs, probs,
cc-freqs-probs, combins-freqs-probs, cc-combins-freqs-probs,
et. al.  Or any map where the values are the probabilities of the
occurance of the keys over some sample space.  Any summation term
where (or (= px 0) (= py 0)), is taken as 0.0.

NOTE: maps should have same keys! If this is violated it is likely
you will get a :negRE exception or worse, bogus results.  However,
as long as the maps reflect distributions _over the same sample
space_, they do not need to be a complete sampling (a key/value for
all sample space items) - missing keys will be included as 0.0
values.

Also known as Kullback-Leibler Divergence (KLD)

KLD >= 0.0 in all cases.

seq-joint-entropy

(seq-joint-entropy s & {:keys [sym? logfn], :or {sym? false, logfn log2}})
Returns the joint entropy of a sequence with itself: -sum(* pi (log
pi)), where probabilities pi are of combinations of elements of S
taken 2 at a time.  If sym?, treat [x y] and [y x] as equal.

shannon-entropy

(shannon-entropy s & {logfn :logfn, :or {logfn log2}})
Returns the Shannon entropy of a sequence: -sum(* pi (log pi)),
where i ranges over the unique elements of S and pi is the
probability of i in S: (freq i s)/(count s)

TCI

(TCI combinator opts coll1 coll2 & colls)
Synonym for total-correlation information

total-correlation

(total-correlation combinator opts coll1 coll2)(total-correlation combinator opts coll1 coll2 & colls)
One of two forms of multivariate mutual information provided here.
The other is "interaction information".  Total correlation
computes what is effectively the _total redundancy_ of the
information in the provided content - here the information in coll1
.. colln.  As such it can give somewhat unexpected answers in
certain situations.

Information content "measure" is based on the distributions
arising out of the frequencies over coll1 .. colln _individually_,
and jointly over the result of combinator applied to coll1 .. colln
collectively.

OPTS is a map of options, currently sym? and logfn.  The defaults
are false and log2.  If sym? is true, treat [x y] and [y x] as
equal.  logfn can be used to provide a log of a different base.
log2, the default, reports in bits.  If no options are required,
the empty map must be passed: (total-correlation transpose {}
my-coll)

NOTE: the "degenerate" case of only two colls, is simply mutual
information.

Let C be (combinator coll1 coll2 .. colln), so xi1..xin in C is an
element in the joint sample space, and xi in colli is an element in
a "marginal" space . Computes

sum (* px1..xn (log2 (/ px1..xn (* px1 px2 .. pxn)))) x1s x2s .. xns =

   Hx1 + Hx2 + .. + Hxn - Hx1x2..xn

(<= 0.0
    TC(X1,..,Xn)
    (min|i (sum Hx1 .. Hxi Hxi+2 .. Hxn, i = 0..n-1, Hx0=Hxn+1=0)))

Ex:

(shannon-entropy "AAAUUUGGGGCCCUUUAAA")
=> 1.9440097497163569

(total-correlation transpose {}
  "AAAUUUGGGGCCCUUUAAA" "AAAUUUGGGGCCCUUUAAA")
=> 1.9440097497163569 ; not surprising

(total-correlation transpose {}
  "AAAUUUGGGGCCCUUUAAA" "AAAUUUGGGGCCCUUUAAA"
  "AAAUUUGGGGCCCUUUAAA" "AAAUUUGGGGCCCUUUAAA")
=> 5.832029249149071 ; possibly surprising if not noting tripled redundancy

tversky-index

(tversky-index s1 s2 alpha beta)
Tversky index of two sets S1 and S2.  A generalized NON metric
similarity 'measure'.  Generalization is through the ALPHA and BETA
coefficients:

TI(S1,S2) = (/ |S1^S2| (+ |S1^S2| (* ALPHA |S1-S2|) (* BETA |S2-S1|)))

For example, with alpha=beta=1,  TI is jaccard-index
             with alpha=beta=1/2 TI is dice-coeff

variation-information

(variation-information combinator opts coll1 coll2)