Sunday, July 31, 2016

Goodness of fit metric for text matching

I believe it involves the quantity:

g = (num words read)^2 / (last - first + 1)
where a pattern is used to match text, and 'first' and 'last' are the first and last indices of words read in the text.
Here is why. Increasingly I think pattern matching should require all of the pattern to be filled in some way so the number of words consumed will just be a function of how many of them are in the pattern. hence (num words read) / (last - first + 1) is simply measuring how spread out those same words are in the incoming text. The additional factor in the numerator of (num words read) gives greater weight to longer patterns. You might want  to consider G = g/(num words in text) so the quantity is never more than 1.

No comments:

Post a Comment