The math
The Poisson distribution is similar in many ways to the normal distribution, which is more familiar.
The normal distribution has a mean. The rough equivalent of the mean for a Poisson distribution is often called the "gamma". If you know the mean of a normal distribution, you know what outcome is most likely. The "gamma" of the Poisson distribution gives us the same information.

The main difference between a Poisson and a normal, is that the Poisson only allows positive integer values. For example, if we describe temperature with a normal distribution, we could have a temperature of -10, or 22.5. The Poisson distribution would only allow positive integers, and would not be appropriate to describe temperatures.

Suppose we wanted to study ATM usage, and we wanted a distribution to describe how many customers arrive in a given 5 minute period. The Poisson distribution is ideal for this. We can never have a negative number of arrivals, or a fractional number. If the number of arrivals is random, and the average rate does not change,  then a Poisson distribution will describe it. Say we find that on average 3 customers arrive in a 5 minute period, then our "gamma" will be 3. The distribution can then tell us the probability that no customers will arrive in the next 5 minutes, or the probability that 10 will arrive. 3 customers is the most likely number. The greater the departure from this average, the less the probability of occurrence.

In the study we are looking at the number of occurrences of vocabulary items. These can not be negative or fractional, so a Poisson distribution is appropriate. Suppose we know that the average absolute frequency of the word "cat" in 1000 words of the ancient Greek language is 3. This would be our gamma. We could then calculate the probability that any given 1000 words of Greek text would actually contain say 10 occurrence of the word cat.

In the maximum likelihood procedure, we turn this process around. We are in a situation where we know we have things that should follow a given distribution (Poisson in this case), and we know the actual counts. We want to find the best "gamma" to describe what we actually see. We do this by finding the gamma that will maximize the chances of the actual results occurring. In out bank example, if we observe 10 customers, then 12, then 8, it is more likely that our average rate of arrival (or "gamma") is 10 than say for example 2. If our average rate of arrival were 2, then it would be highly unlikely to see 10,12,8. But if our average rate of arrival is 10, then 10,12,8 is much more likely.

In the study we first use the relative frequency of each vocabulary item in the all the synoptic categories combined, for our gammas. We then calculate how likely the actual observed frequency of each vocabulary item, in a given test category is. In other words, based on an average of 2 cats per 1000 words in the synoptics as a whole, we should expect about 2 cats per 1000 words in the test category, and we can calculate the probability of the actual result we do see, based on this.

Next we ask "If we allow information from a predictor category to be used, can we get a better distribution?" That is, can we find a distribution, which will make the observed results more likely? We allow ourselves to adjust only one parameter, beta. This is a weighting factor. We will try to use the relative frequency of words in the synoptics, and the relative frequency of words in a predictor category, and we will weight the importance of each factor, and call this weight beta. For example, say the synoptics overall have 2 cats per 1000 words, and the predictor category has 6 cats per 1000 words. If our test category has 4 cats per 1000 words, then a 50% weighting of the relative frequency in the synoptics, and a 50% weighting of the relative frequency in the predictor category, would give us the best estimate of the relative frequency in the test category. The catch is that we only have 1 adjustable factor, and over 800 vocabulary items. The ideal weighting will vary by vocabulary item, but we want to find one weighting that will maximize the probability of the actual observed number of occurrences, for all 800 vocabulary items. For example, we might find that giving the predictor category a 20% weight was better then a 21% weight, or a 19% weight, where "better" is measured by calculating the probability of getting all 800+ actual results in the test category, given the different possible assumptions about the distribution, or weightings.

To summarize the last paragraph, we have 3 pieces of information.
1) The relative frequency of 800+ vocabulary items in all the synoptic categories combined.
2) The relative and absolute frequency of 800+ vocabulary items in a test category.
3) The relative frequency of 800+ vocabulary items in a predictor category.
We know that making use of #1, can give us a distribution that give us predictions about #2.
What we want to know is if using a weighting of #1 and #3, can give us a distribution that will make predictions about #2, that are closer to reality.

The way we quantify how much better the predictions are using both #1 and #3, over #1 alone, is to use the likelihood ratio test. It looks at the number of parameters we adjusted (1), and the probability of the actual results occurring under both sets of assumptions. It will tell us the probability that any observed improvement was random chance. I won't try to describe the rational for this test. I'll instead direct the truly interested to mathematical
references, or texts.