|Next we move from a description of the data to a description of the analysis. The method used to test for relationship between the categories requires some more advanced statistical knowledge. A mathematical reference is provided below, and I will attempt to go through the spreadsheet in some detail as well. However I'd first like to attempt to give an overview of what it is doing.
First some definitions. The word "word" is somewhat ambiguous for our purposes here. So to refer to the rows in the spreadsheet, I will use the term "vocabulary item". To refer to a single occurrence of the vocabulary item, I'll say "occurrence of the word or vocabulary item".
"Absolute frequency" is a count of the number of times a given vocabulary item appears in a given synoptic category.
"Relative frequency" is the number of times a given vocabulary item occurs in a category divided by the total number of occurrences of all vocabulary items in the category. So, if the category "222", "triple agreement" contains 2000 individual occurrences of words, and there are 2 occurrences of "cat" in category "222", then the relative frequency of "cat" in the category "222" is 2/2000.
The goal is to determine if the relative frequency of each vocabulary item in each of two categories, say A and B, is significantly more like the relative frequency of the vocabulary item in the other category than it is like the relative frequency of the vocabulary item in the synoptics as a whole. For example, if the relative frequency of "cat" in category "222" was 1/1000, and the relative frequency of "cat" was 1/900 in category "221", and the relative frequency of "cat" was 1/500 in the synoptics overall, then we could say that it terms of relative frequency of the vocabulary item "cat", categories "222" and "221" are more like each other than they are like the synoptics overall. The argument is that categories that are similar in this way, across a large number of vocabulary items, are probably originally authored by the same hand. We will be able to give quantitatively, the probability that the observed similarity was not due to random chance alone.
Specifically, what the study did was to first compare the relative frequency of each vocabulary item in each category to the relative frequency of the vocabulary item in all categories combined. Then for each possible pair of categories the study determined if using information about the relative frequency of words in category B, could help predict the relative frequency of words in category A. The method used is known as the maximum likelihood method, with a likelihood ratio test. It involves first testing how well data fits one distribution ("Poisson" in this case), and then testing another distribution with an extra free parameter, to see if the fit is significantly better.
The process is exactly the same as the process used in "logistic regression" except for the fact that a Poisson distribution (for non-negative integers) is used instead of a Bernoulli distribution (for binary or "yes/no" conditions).
I'll try to give an idea of what these statistical methods do here.
But on this page, I'll just continue through to the final result.
In the first case, the distribution was a Poisson distribution, with its gamma parameter determined by the data from all categories. The likelihood that the observed relative frequencies found in category A, would occur given that distribution was calculated. In the second case the distribution was also Poisson, but its gamma parameter was fit by using both information about the relative frequency of vocabulary items in all the categories together, and the relative frequency of vocabulary items in a second category, B.
The end result is that if the relative frequency of vocabulary items in categories A and B are more alike than they are like the relative frequency of vocabulary items in all categories combined, then the test will tell us this, and also tell us how likely it is that the observed relationship is just random chance.
Again, for my attempt at explaining the math, go here.
|A good reference text that gives details of the mathematics used here is "Econometric Analysis". Relevant pages can be found here.
A reference with similiar methods used for similiar purposes is -
M.P. Oakes, "Statistics for Corpus Linguistics" (Edinburgh: Edinburgh University Press, 1998)
see sections 5.2.5 for "distinctiveness ratios", and 22.214.171.124 for work with a Poisson distribution and a likelihood ratio test. Relevant pages can be found here.
|Back to synoptic main page|