Friday, August 5, 2016

GOF - the goodness of fit formula [a work in progress]

I think this is right, and took the time to draw it. (Actually it is  not, see below)

We consider all possible assignments of words of text to slots of a narrative pattern, including not using some of the slots. We are interested in assignments that use as many slots as possible. Let p_used be the number of slots used in a given assignment and let |text| be the number of words of text. Let "delta i" be the difference between the first and last indices of words that are used plus one. Then a measure of goodness of fit (or "GOF") for the assignment is:
 
This rewards for having extra slots that match but does not punish for having extra slots that are un-matched. To do that, pre-multiply by pattern length divided by text length.
Update: as usual there is a bit of confusion as something like this starts to finalize.  In fact the formula confuses between the number of slots used out of a total available in the narrative pattern with the number of indices used (in the many-to-many mapping of slots to indices) out of the available indices in the text. If we assume the above formula has p_used to be the number of indices of text tokens that are used in pattern matching, then the missing piece to penalize for un-used slots is the factor (u/n) where u is the number of slots of the pattern that are used and n is the total available number of slots. So, less elegantly but more correctly we can let 
|p_used|= num pattern slots used 
|p| = num pattern slots available
|text_used| = num text token indices used
|text| = num text token indices available
di = (first index used - last index used + 1)
then define
GOF = ( |p_used|/|p| )  *  (|text_used|/|text|) * (|text_used|/ di)
Where the first factor measures how much of the pattern is used. Second factor measures consumption of the text. Third factor measures how clustered is the use. But note that the "use" of the pattern may be smeared out over the text. 

Currently (in Sept): I am favoring one optimization involving u = numSlotsUsed() of a narrative, n = slotsUsed(), and r = words read, f = lastwordread-firstwordread+1. The formulas, for goodness of fit is gof = (u/n)*(r/f). A different version, used in recursion, is simply 'u' along with trying to read as many words as possible.

No comments:

Post a Comment