Thursday, December 8, 2016

Probability of a subsequence

I am a little surprised to find no quick online answer for the question: what is the probability of finding a fixed 'search for' sequence of length K within a longer 'search in' sequence of length N? 
Assume a common alphabet - say - {0,1}. The total number of sequences of length N is 2^N [actually more like (2^N - K)]. The number of these that begin with the 'search for' sequence is 2^(N-K). Call that set "D"

Consider applying the shift operator to the elements of D and how D has an orbit, within the whole space of sequences, that is the same as the set of sub sequences of interest.

The total number of things you get that way divided by 2^N is the desired probability. I suppose it is complicated because not all sequences have the same orbit size, so depending on where you start in D you get a different orbit length. So actually, rather than caring how to count these things, we should be more interested in how they fit together geometrically. You are looking at the points of a discrete surface within a discrete volume, so counting points may not be as interesting as other geometric properties of the sets. Not going to figure it out though. I see why now.

However we can say that the maximum orbit size for a point in D, under shifts, is N - cuz that is how many possible shifts are available. So the probability is < N*2^(N-K) / 2^N . So an estimate is:
probability < N / 2^K
Unfortunately this is a lousy estimate. It misses the subtler point that periodicity of 'search for' within 'search in' must bring the numerator down a lot.
Update:  If N is a prime number then only a constant sequence like 0000000 can be self similar and have an orbit under the shift operator that is less than 2^N - K long.
Update: The proof is that is shift^n has a fixed point, then the subgroup with this  fixed point would have order dividing a prime.

1 comment:

  1. Ultimately I concluded that the problem was as hard to solve as the usual unsolved problems in number theory - it is just a different formulation with its own convenience(s)

    ReplyDelete