aerial.utils.math.probs-stats

Various frequency, combinatorial, probability, statistical,
measure, and metrics for a variety of sequence and string data.

alphabet2

(alphabet2 coll & {n :n, :or {n 1}})
Return the alphabet for coll.  If coll is a map, return it's keys.
If a set, return coll, otherwise return alphabet generated over
coll by freqn with size n.

avg-std-deviation

(avg-std-deviation sample-set)(avg-std-deviation sample-set {:keys [distfn avgfn v], :or {distfn -, avgfn mean}})
Compute the average of the n standard deviations for the n samples
in SAMPLE-SET.  This is not as obvious as it may at first seem.  It
is _not_ the mean of the stdevs of each sample.  The correct
averaging is the sqrt of the average of the weighted variances of
each sample, where the weighting is the sample size.  This latter
is given by avg-variance (see its documentation for details).  So,
we return:

(sqrt (avg-variance sample-set))

avg-variance

(avg-variance sample-set)(avg-variance sample-set & {:keys [distfn avgfn m], :or {distfn -, avgfn mean}})
Compute the average of the n variances for the n samples in
SAMPLE-SET.  This is not as obvious as it may at first seem.  It is
_not_ the mean of the variances of each sample.  The correct
averaging is the average of the weighted variances of each sample,
where the weighting is the sample size:

(/ (sum (fn[Si] (dec (count Si)) (variance Si)) sample-set)
   (- (sum count sample-set) (count sample-set)))

This does reduce to the mean of the variances if the samples are
all of the same size, but this is often (typically?) not the case.

DISTFN and AVGFN are as for the function variance (for which, see)

binomial-dist

(binomial-dist N p)
Binomial probability distribution. P is the probability of a hit
and N is the number of trials.  Returns a seq of pairs [k pr],
where k in (range 1 (inc N)) and pr is probability of getting k
successes in N trials.

binomial-pdf

(binomial-pdf p)
Return the binomial probability mass/distribution _function_
corresponding to P, the probability of a hit in some Bernoulli
trial.  The resulting function returned, call it bnpdf, is a
function of two parameters, the number of hits k (successes), and
the number of trials N.  So, if the result of (binomial-pdf 1/4) is
bound to bnpdf, then (bnpdf 3 10) would be the probability of
getting 3 hits in 10 trials for sample probability 1/4.  A
distribution of bnpdf can then be generated via some technique
like (map #(bnpdf % 1000) (range 1 (inc 1000))).

cc-combins-freqn

(cc-combins-freqn n colls & {par :par, :or {par 1}})
combins-freqn over all collections C in colls.  Basically cc-freqs
with freq-fn equal combins-freqn

cc-combins-freqs-probs

(cc-combins-freqs-probs n colls & {par :par, :or {par 1}})
cc-freqs&probs with freqs&probs-fn equal to combins-freqs-probs.
So, computes freqs and probs for each C in colls based on
combinations of coll items taken n at a time (the keys for the
maps).  Returns [ccfsps allfs allps cnt], where ccfsps is the seq
of triples for each C in colls, allfs is the map for all freqs,
allps is corresponding map for all probs and cnt is total count of
all elements.

cc-freqn

(cc-freqn n colls & {par :par, :or {par 1}})
cc-freqs for freqn - use freqn over colls

cc-freqs

(cc-freqs n colls freq-fn & {:keys [par], :or {par false}})
Frequencies of items in each collection in the collection of
collections colls.  Uses freq-fn to generate and count items.  Set
PAR to parallelize when (count colls) is large and each (freq-fn
c), c in colls, is expensive.  Returns a seq of frequency maps, one
for each collection in colls.

cc-freqs&probs

(cc-freqs&probs n colls freqs&probs-fn & {:keys [par], :or {par false}})
For each C in colls (a collection of collecions), compute the
frequency and probability map for C and its total item count as
given by freqs&probs-fn (for example, combins-freqs-probs).  Using
these maps, compute the total item frequency and probablity maps
and total count over all collections in colls.  Low level
functional base for public functions such as cc-freqs-probs,
et. al.

For large (count colls) with expensive freqs&probs-fn, set par to
parallelize computation via vfold (and fork/join) using auto
partitioning

Return [ccfs&ps allfs allps tcount], where

ccfs&ps is a seq of triples [fs ps cnt], for each C in colls,
allfs is the map of freqs over all Cs,
allps is the map of probs over all Cs and
tcount is the total items over coll.

cc-freqs-probs

(cc-freqs-probs n colls & {par :par, :or {par 1}})
cc-freqs&probs using freqn as base frequency function

cc-tfreqn

(cc-tfreqn n colls & {par :par, :or {par 1}})
cc-tfreqs for freqn, cc-freqs with freqn plus overall totals.

cc-tfreqs

(cc-tfreqs n colls freq-fn & {par :par, :or {par 1}})
Same as cc-freqs, but also compute total item frequencies reduced
over all collections in colls.  Returns pair [ccfreqs cctfreqs],
where ccfreqs is the seq of frequency maps per cc-freqs and
cctfreqs is the single map for all items.

choose-k-freqn

(choose-k-freqn n coll)
Synonym for combins-freqn

combin-count-reduction

(combin-count-reduction item-coll sym)
Performs (freqn 1 item-coll), giving result M.  If sym is true,
coalesces items in M whose keys are reversible by retaining one key
with the sum of the values of previouis two.  See coalesce-xy-yx
for complete description.  This was originally a one off
specifically for coalescing combination sets (see combins), hence
the (now bogus) name.

combins-freqn

(combins-freqn n coll & {sym :sym, :or {sym true}})
Compute frequency map for coll where items are formed by all
choices of coll things taken n at a time: (combins n coll).  If SYM
is true (the default), treat vec/seq keys as symmetric.  Examples:

(combins-freqn
  2 "UUAAAAAACAAAAAAAAAAAACAAAACACAAAAACAAAUUCUACAAAAAAUAAAAACA")
{[\C \C] 28, [\A \C] 352, [\A \A] 946, [\U \C] 48,
 [\U \A] 264, [\U \U] 15}

(combins-freqn
  2 "UUAAAAAACAAAAAAAAAAAACAAAACACAAAAACAAAUUCUACAAAAAAUAAAAACA" :sym false
{[\C \U] 23, [\C \C] 28, [\C \A] 149, [\A \U] 131,
 [\A \C] 203, [\A \A] 946, [\U \C] 25, [\U \A] 133, [\U \U] 15}

combins-freqs-probs

(combins-freqs-probs n coll)
freqs&probs with freq-fn equal to combins-freqn.  So, triple with
fs and ps maps based on combinations of coll items taken n at time,
i.e., keys are such items, with cnt element (nCk (count coll) n).

cond-probability

(cond-probability PXY PY occurs?)(cond-probability combinator sym? occurs? coll1 coll2)
Given collections coll1 and coll2, and combinator, a function of 2
variables which generates joint occurances from coll1 & coll2,
returns the conditional probability distributions for coll1
conditioned on elements of coll2.  If sym? is true coalesce
reversable element occurances during joint occurance
computation.

Alternatively, in the distribution version, PXY is a joint
distribution in map form, and PY is a distribution in map form.
That is, in this variant, the probability distributions are
explicitly provided.

OCCURS? is a function of two variables, pkx and pkxy, where pkx is
a key derived from coll2/PY and pkxy is a key derived from the
combinator map of coll1 with coll2 or a key in PXY.  It returns
whether pkx occurs in pkxy.

cond-probability returns a map of maps, where each inner map is a
distribution of coll1 given ei in coll2:

{{ei {P(COLL1)|ei}} ei in COLL2.

Use pXY|y to obtain any given distribution, where y = some ei.

Examples:

(cond-probability
  transpose false #(= %1 (.charAt %2 1)) "defdeef" "abcaabb")
=> {\a {"ea" 0.3333333333333333, "da" 0.6666666666666666},
    \b {"fb" 0.3333333333333333, "eb" 0.6666666666666666},
    \c {"fc" 0.9999999999999997}}

(cond-probability
  transpose false #(= %1 (.charAt %2 1)) "abcaabb" "defdeef")
=> {\d {"ad" 1.0},
    \e {"ae" 0.3333333333333333, "be" 0.6666666666666666},
    \f {"bf" 0.5, "cf" 0.5}}

correlation

(correlation X Y)
For population/sample sets X and Y, compute the correlation between
X and Y, cor(X,Y).  The size of X and Y must be the same.  Provides
a 'measure' of the strength and connection of a linear relationship
beween X and Y.  Bounded by -1 (anti correlation) and 1 (perfect
correlation), i.e., cor(X,Y) in [-1, 1].  A value of zero indicates
no linear correlation, but does not mean independence, however
independence will produce a zero value.

  cor(X,Y) = (/ cov(X,Y) (* (std-deviation X) (std-deviation Y)))

Defined only if both standard deviations are nonzero.

Also known as Pearson correlation coefficient, Pearson
product-moment correlation, or simply Pearson correlation.

Correlation is a normalized covariance.

covariance

(covariance X Y)
For populations/sample sets X and Y, compute the covariance of X and Y,
cov(X,Y). The size of X and Y must be the same. Provides a
'measure' of how X and Y 'follow' one another or change together.
Do they both go up and down together, vice versa, or something in
between.

  let N (count X)
      mux (mean X)
      muy (mean Y)
   (sum (fn[xi yi] (/ (* (- xi mux) (- yi muy)) (dec N))) X Y)

   (the (dec N) would be just N, if using true populations)

   which, by some algebra, is E(XY) - E(X)E(Y).

Note: If X and Y are independent, E(XY) = E(X)E(Y) and so
covariance must be 0.  If X=Y, cov(X,X) = E(X^2)-E(X)^2 = var(X).

Note: a particular useful application is for the variance of a
linear combination of X & Y:

 var(aX + bY) = a^2*E(X^2)+2ab*E(XY)+b^2*E(Y^2) -
                a^2*E(X)^2+2ab*E(X)E(Y)+b^2*E(Y)^2

              = a^2*(E(X^2)-E(X)^2) +
                b^2*(E(Y^2)-E(Y)^2) +
                2ab*(E(XY)-E(X)E(Y))

              = a^2*var(X)+b^2*var(Y)+2ab*cov(X,Y)

Note: cov is very dependent on the sample matching of X and Y, so
be sure X and Y are in the correct order on call:

 (covariance [1 2 3 4 5] [1 2 3 4 5])
 => 2.0
 (covariance [1 2 3 4 5] [2 1 4 3 5])
 => 1.6
 (covariance [1 2 3 4 5] [5 4 3 2 1])
 => -2.0

Bugs: not an unbiased estimator of population variance.

flatten-pair-coll

(flatten-pair-coll coll)
COLL is a collection of pairs [v w] where v is a value and w its
weight. For maps, keys are vs and values are ws.  Returns a lazy
seq of of the expansions as a single collection, with w repetitions
of v for each such [v w] pair.

freqn

(freqn n coll)
Frequencies of n-grams (aka kmers, "features", et. al.) in
collection COLL treated as a sequence.  n is the "window" or
resolution width.  Slide is always fixed at 1 (one) position.

 Ex: (freqn 2 "acagtcaacctggagcctggt")
 =>
 {"aa" 1, "cc" 2, "gg" 2, "ac" 2, "ag" 2, "gt" 2,
 "tc" 1, "ct" 2, "tg" 2, "ga" 1, "gc" 1, "ca" 2}

freqs&probs

(freqs&probs n coll freq-fn)
Generate frequencies of items in coll by means (freq-fn n coll),
take those and compute the items' corresponding probabilities.
Return triple: [freqs probs cnt], where freqs and probs are maps
keyed by item and cnt is the total count of all items.  Lower level
functional base for more public functions freqs-probs, et.al.

freqs-probs

(freqs-probs n coll)
freqs&probs for n and coll with freq-fn equal freqn

geometric-dist

(geometric-dist N p)
Geometric probability distribution.  P is the probability of an
event.  Return seq of pairs [k pr], where k in (range N) and pr is
probability of success after k failures.

geometric-pdf

(geometric-pdf p)
Return the geometric probability mass/distribution _function_
corresponding to p, the probability of a hit in some Bernoulli
trial.

The resulting function returned, call it gmpdf, is a function of
one parameter, the number of occurances K.  NOTE: gmpdf is only
defined for integer values.

So, if the result of (geometric-pdf 1/4) is bound to gmpdf,
then (gmpdf 3) would be the probability of getting a hit after 3
tries.  A distribution of gmpdf can then be generated via some
technique like:

(map pspdf (range 10))

joint-prob-x

(joint-prob-x Omega X Y & {constr :constr, :or {constr true}})
Xperimental.  Not sure how to make this work in general while still
being useful.  Intent would be to supply sample space, random
variables X & Y (as functions - including maps & vectors), that
obey and/or define a constraint CONSTR, which may involve further
random variables.  Generate all joint occurances and from there the
joint probabilities.

joint-probability

(joint-probability combinator sym? coll)(joint-probability combinator sym? coll1 coll2)(joint-probability combinator sym? coll1 coll2 & colls)
Given a set of collections c1, c2, c3, .. cn, and combinator, a
function of n variables which generates joint occurances from {ci},
returns the joint probability distribution.  If sym? is true
coalesce reversable element occurances.

Ex: (apply joint-probability
           transpose true (take 2 (seq/rotations "GGCGGAAGACCGCCUCGA")))
=> {[\U \C] 0.11111111, [\A \G] 0.2777778, [\C \G] 0.2777778,
    [\A \C] 0.055555556, [\A \A] 0.055555556, [\G \G] 0.11111111,
    [\C \C] 0.11111111}

Ex: (joint-probability #(combins 2 %) true "GGCGGAAGACCGCCUCGA")
=> {[\C \C] 0.09803921568627451, [\G \G] 0.13725490196078433,
    [\A \A] 0.0392156862745098, [\A \C] 0.1568627450980392,
    [\C \G] 0.27450980392156865, [\A \G] 0.1830065359477124,
    [\U \A] 0.026143790849673203, [\G \U] 0.0457516339869281,
    [\U \C] 0.0392156862745098}

Ex: (joint-probability
      (fn[X Y]
        (for [x X]
          (if (even? x)
              [:e (.charAt Y 0)]
              [(second (div x 3)) (rand-nth (rest Y))])))
      false
      (take 100 (iterate inc 0)) "abcde")
=> {[2 \b] 0.04, [1 \b] 0.05, [2 \c] 0.04, [0 \b] 0.07, [1 \c] 0.05,
    [2 \d] 0.04, [0 \c] 0.05, [1 \d] 0.05, [2 \e] 0.04, [0 \d] 0.04,
    [1 \e] 0.02, [0 \e] 0.01, [:e \a] 0.5}

JPSxy

(JPSxy combinator coll)(JPSxy combinator coll & colls)
Joint Probability Symmetric: joint-probability with sym? true

JPxy

(JPxy combinator coll)(JPxy combinator coll & colls)
Joint Probability non symmetric: joint-probability with sym? false

keysort

(keysort map)

letter-pairs

(letter-pairs n s)

mean

(mean coll)(mean x1 x2 & xs)
Compute the expectaion of the collection COLL.  If coll is a number
keyed frequency map, computes as if seq of expansion of keys
weighted by frequency.  If coll is a simple map use (vals coll).
Also coll can be a collection of pairs [n w], where w is the weight
to assign to n.  Lastly, if single numbers are given, their weight
is taken as simply 1.  Effectively returns:

  let c* (flatten-pair-coll coll)
      cnt (count c*)
    (/ (sum c*) cnt)

But without computing actual weighted sequence c*.

median

(median coll)(median x & xs)
Compute the median of the given collection COLL.  If coll is a
collection of number pairs [v w], where w is the weight for v,
effectively computes its median based on expansion of keys weighted
by frequency.  This includes frequency maps where keys are numbers
and values are their weights.  If coll is a simple map computes the
median of (vals coll).

num-key-freq-map?

(num-key-freq-map? x)
Returns true if X is a number keyed frequency map

p-value

(p-value d hits trials)
Compute p-value from distribution D and sample statistics x y

pair-coll?

(pair-coll? x)
Returns true if X is a collection of number pairs.

pdsum

(pdsum f pdf1 pdf2 & pdfs)
NOT CURRENTLY USED.  REVISIT AND POSSIBLY ELIMINATE

pearson-correlation

(pearson-correlation X Y)
Often used synonym of correlation.  See correlation for details.

poisson-cdf

(poisson-cdf mu)
Return the Poisson cumulative probability _function_ corresponding
to mu, the mean or expected value of a random variable X over some
time/space interval.  In the Poisson case, the mean is also the
variance of X.

The resulting function returned, call it pscdf, is a function of
two parameters: lower limit l, and upper limit u, both >= 0.  l and
u-1 are the minimum and maximum number of occurances, i.e., the
interval is the half open [0, u) interval.

Ex: If 2 bugs/week are introduced on average in statware, what is
the probability there will be less than 5 bugs introduced next
month.  So, mu = 2/week * 4 week = 8, l = 0 and u = 5:

((poisson-cdf 8) 0 5) => 0.0996 or about 10% chance.  Of course,
those dudes aren't using Clojure! :)

poisson-dist

(poisson-dist N mu)
Poisson probability distribution. 

poisson-pdf

(poisson-pdf mu)
Return the Poisson probability mass/distribution _function_
corresponding to mu, the mean or expected value of a random
variable X over some time/space interval. In the Poisson case the
mean is also the variance of X.

The resulting function returned, call it pspdf, is a function of
one parameter, the number of occurances K.  NOTE: pspdf is only
defined for integer values.

So, if the result of (poisson-pdf 1.0) is bound to pspdf,
then (pspdf 3) would be the probability of getting 3 hits.  A
distribution of pspdf can then be generated via some technique
like: (map pspdf (range 1 (inc 100))).

poisson-sample

(poisson-sample mu)(poisson-sample N mu)
Generate a "random" sample from a Poisson distribution
characterized by mean/variance mu.  In the two parameter case,
generate N such samples.

probs

(probs n coll)(probs freq-dist-map)
Probabilities for items from coll taken n at a time (see freqn) or
as determined by the frequency dist map freq-dist-map.

pXY|y

(pXY|y XY y)
Given a total conditional distribution XY (see cond-probability),
 return the distribution of XY with respect to y.  XY should be a
 map of maps of the form returned from cond-probability.  Typically,
 this would be used in conjunction with cond-probability, but need
 not be as long as the form of the map of maps is the same, where
 each inner map is the joint distribution conditioned on an element.

 Ex:

 (pXY|y (cond-probability transpose false #(= %1 (.charAt %2 1))
                          "defdeef" "abcaabb")
        \a)
=> {"ea" 0.3333333333333333, "da" 0.6666666666666666}

sampling

(sampling cnt dist)
First small step toward 'generic' sampler.  CNT is the size of the
sample to create.  DIST is some distribution over some event
space (aka omega).

std-deviation

(std-deviation coll)(std-deviation coll & {:keys [distfn avgfn v], :or {distfn -, avgfn mean}})
Compute the standard deviation of collection COLL.  If coll is a
map uses (vals coll).  Returns the sqrt of the variance of
coll. The functions distfn and avgfn are used for distances of
points in coll and the averaging function for the mean of
coll. These are used to compute the variance, or if V is given it
is taken as the variance and variance computation is skipped.  For
the single parameter case, distfn is '-' and avgfn is 'mean'.

variance

(variance coll)(variance coll & {:keys [distfn avgfn m rfn], :or {distfn -, avgfn mean, rfn map}})
Compute the variance of the collection COLL.  If coll is a map
use (vals coll).  Returns average of squared differences of xs in
coll with 'mean' of coll.  The functions distfn and avgfn are used
for distances of points in coll and the averaging function.  Or,
the 'mean' can be given explicitly as M.

The rfn function is the 'reducer' function to use for calculating
the set of squared differences.  It defaults to 'map', but the user
can change this to some variant of fold, most typically vfold, to
parallelize the computation for large collections and/or expensive
distfns (such as various RE functions).

The single parameter case uses '-' as distfn and 'mean' as avgfn
and 'map' as the reducer.

word-letter-pairs

(word-letter-pairs n s)