Complex Sentences

© Wayne Paul Amsbury             31 May 2002        UP: a_process_model.htm

 

An English equivalent of a proper name uses an article or preposition, such as: A boy, and this adds one count, with an additional convention. The basis for the convention is the intuitive belief that such fragments as the long road are fused as encountered no matter where they occur in a sentence, then placed in the context of a rule path when a marker is encountered. (Consider: The long road on the ridge …, where on is a marker.) Hence there are three cache entries encountered as tokens, but the first token of the next fragment, (on), triggers the fusing of the previous fragment, and finally the fused unit is cached.

Convention 3: A fragment is typically processed by fusing words as they are encountered within the fragment.

A very simple example is an English sentence that contains multi-word fragments, such as: cap-the boy hit a girl period.

         cap-the 2

         (The boy) 1

[At this point, the next token encountered, after its recognition as a verb, signals the end of the subject phrase and causes the phrase to be uncached in order for it to be fused.]

         (The boy)-SU                                         1

         (The boy)-SU │ hit-V                            1

         (The boy)-SU │ hit-V │a                       1

         (The boy)-SU │ hit-V │ (a girl)-OB       2         [verb retrieval]

         The boy hit a girl.                                    1

This has seven tokens, nine counts, and a burden of 9/7 = 1.29. The expressive precision bought by articles is paid for in counts, but lowers the burden, just as did the extra tokens in Japanese, Navaho and Mohawk.

Hypothesis: Language structures that increase precision increase the cache count but lower the caching burden.

Delays for vocabulary searches are not taken into account by the burden, and neither is the search for structure; an encounter with a noun is not necessarily an encounter with a subject (in English at least). For comparison:

         Cats eat mice. [7/5 = 1.4]

         The boy hit a girl. [9/7 = 1.27]

         John hit Mary [8/6 = 1.25]

However, other features of particular languages impact these measures.

In English, verb forms are efficient in that they can be unilaterally complete: was stunned is three tokens: was stun –ed, and requires three counts to discern as a unit, fused as encountered. An alternate view is to cache: was separately until stunned is complete, then put the verb phrase together with a total of 4 counts. However, it is considered that the suffix is not what signals that was and some form of stun- go together ─ it is known that they do so as soon as the verb stem is encountered. We give it three counts.

Complex sentences. The effect of parameters is to regulate the formation of complex structures, and these tend to have a higher burden, particularly if they omit commas, connectives, and the like. Care must be given to what is fused as encountered.

Prepositional phrases in English require a different treatment than verb phrases and noun phrases. Consider: of a flute. It might seem that this prepositional phrase should be cached as: │ of a flute │ [3/3 = 1.00 plus agreement], but when put it in the context of: (of a flute near at hand,) it is clear that near at hand modifies simply a flute, not of a flute. Hence the preposition needs to be cached separately from its object. In the interior of a sentence:

         (of │ (a flute)-P.OB))-tag │ (near │ (at hand)-P.OB))-(a flute) │     [10/7 = 1.43]

English provides for great flexibility with respect to word order. Consider the sentences:

         1. We met the dawn on the road at first light.

         2. On the road at first light we met the dawn.

An abbreviated tracing of them is:

         cap-we-SU │met-V    4                               (cap-on │ the road)       5

         (the dawn)-OB            4                               (at │ first light)               4

         (on │ the road)-met     5                               (we)-SU                        2

         (at │ first light)-met      5                                met-V                           1

         sentence                      1                                (On the road)—met       1

                                                                              (at first light)-met            1

                                                                              (the dawn)-OB               4

                                                                              sentence                         1

The caching occurs in rather different ways in these two sentences, but both have 12 tokens and a burden of 19/12 = 1.58. Sentences that appear to be more complex than this, however, have a lower burden. If commas are added to these sentences, (On the road, at first light, we met the dawn.), the commas invoke a simpler vocabulary search than some markers, and they do not need to be cached. Thus commas increase the token count and decrease the burden, in this case to [19/14 = 1.36].

This example is in accord with my own experience that liberally placed commas speed up a sentence scan although they may annoy the cognoscenti, who do not feel the need.

An easily missed point is that a conjunction can cause the initial half of X and Y to be recognized as a whole; it acts like a period. For instance: We went to the store and then we went to a movie. This is cached essentially as would be two sentences, but the convention applied is that the and is cached, whereas a period is not.

It is not the intent here to define a formal system, but with some additional conventions, the count and burden of a complex sentence can be determined. Articles and leading modifiers fuse with the next token, commas are not cached, and various connectives signal the end of a fragment. One complex sentence used above was:

At one extreme, a parallel search for three components of a token path finds the token meanings simultaneously in one unit of time and infers their rule path in a single additional step.

This is a sequence of 36 tokens, composed into component fragments that are fused dynamically and then recognized as units when other tokens, such as: comma, a, for, finds, in, and, in, period, with the help of one or more rule paths. Counting includes the recognition of the completion of a fragment as a cache operation when the termination invokes agreement, otherwise it merely begins a new cache partition. The most complex portion of this sentence is actually: in one unit of time, which must be broken down into:

         in │ ((one unit) │ (of │ time-P.OB)-unit))-in [8 counts]

Thus we have:

         (cap- at │ one extreme)                         5

         (a parallel search)                                   3

         (for │ three components)-search            4+1

         (of │ a token path)-search                     5+1

         (a parallel search)-finds                          1

         finds                                                      1

         (the token meanings)-finds                     3+1

         (simultaneously)-finds                             2+1

         (in one unit of time)-finds                        8+1

         and                                                        1+1

         infers                                                      1

         (their │ rule path)-infers                          4+1

         (in a single additional step)-infers             6+1

         period                                                     1

This gives a burden of 53/36 = 1.47.

The considerations given above are involved in diagramming an English sentence as well as in determining its cache count. A sentence diagram may be taken as data about the mental process of interpretation of the sentence diagrammed.

A burden for English sentences of not much above 1.40 appears to hold for complex sentences as well. Consider these:

         The truth will set you free. [12/8 = 1.50]

 

         God save the Queen! [9/6 = 1.50]

 

         The sound of a flute sifted through the soughing of the wind in the trees from some bower deep in the woods, achingly, hauntingly, familiar. [46/31 = 1.48]

 

          For want of a shoe a horse was lost. [16/11 = 1.45]

 

          Ask not for whom the bell tolls, it tolls for thee. [20/14 = 1.43]

 

          The contralto cadence of her speech gave him a deep thrill. [18/13 = 1.38]

 

          ‘Twas brillig and the slythy toves did gyre and gimbol in the wabe. [23/16 = 1.44]

The last of these is a bit tricky because ‘Twas is really (It)-was and the verb is compound. The verb gimbol is tagged by did, and both components are tagged by, (or do tag), in the wabe. It also shows that it is the rules of grammar, guided by parameters, that determine the caching burden, not vocabulary.

Intuitively, at least, we might expect a much smaller caching burden for the reader of Ernest Hemingway than for Charles Darwin, and even more so for Darwin’s disciple in both style and substance, Stephen Jay Gould, binder of a dozen counterpoised aspects of a thought into a single extended sentence, perhaps one that claims an entire paragraph. The examples above, however, cast doubt on that, and raise the possibility that the parameters allow extensive sentences in English without increasing the burden significantly. Punctuation, however, does have a significant effect:

         That that is, is; that that is not, is not. [18/15 = 1.2]

It should be clear that juggling several potential rule paths would require building a cache, then deconstructing it when it did not work, then building anew. The hunt for the right rule is a significant cost in computer searches that are not so tightly constrained. The elimination of potential rule paths when complex sentences are built from simple ones has a direct effect on the burden of language.

Intuitively, the parameters that govern how component fragments can be fused into a larger fragment prevent confusion and the unnecessary use of short-term memory.

It must be accepted that languages do differ in the ease in which they handle this problem. Consider the Warlpiri examples of Baker [page 36]. The following and many other permutations say the roughly same thing, except that the word order carries additional information that does not show up in the English transliteration of: These small children chased those big dogs.

          These-SU big-OB children-SU chased those-OB small-SU dogs-OB.

          These-SU big-OB small-SU chased children-SU dogs-OB those-OB.

          Dogs-OB big-OB chased children-SU small-SU those-OB these-SU.

It is somewhat arbitrary just when to retrieve from the cache, but suppose that every time an x-OB (or an x-SU), is encountered, things in the same category are uncached, fused and cached again. The result is surprising:

         cap-these-SU │ big-OB │                                                                5

         (These children)-SU │ big-OB                                                         2

         (These children)-SU │ big-OB │ chased-V                                       1

         (These children)-SU│ (those big)-OB │ chased-V                            2

         (These small children)-SU │ (those big)-OB │ chased-V                  2

         (These small children)-SU │ (those big dogs)-OB │ chased-V           2

         These small children chased those big dogs.                                        1

This is 15 cache counts for 15 tokens. A caching burden of 1.0! To be sure, the equivalent English sentence has only 9 tokens, but it has 13 counts and a burden of 1.44. As noted for simple Japanese/English/Navaho examples above, the cache counts do not seem to differ greatly for a transliteration from one language to another, but burdens do.

It remains to be seen if the parameters that govern the construction of complex sentences determine a characteristic range of burdens for each language. The twelve complex English sentences analyzed above have an average burden of 1.43, which is not so much above the 1.40 of the basic SVO form. A lot of the burden above that of the SVO form occurs when a sentence is heavily loaded with three-word prepositional phrases, which tend to add five cache counts for three words. When separated by commas, (and enclosed in parentheses), such phrases would tend to generate five cache counts for 7 tokens.

Conclusion. This essay has explored a view of language as a set of intertwined concurrent search processes, based on the interpretation and the construction of coherent fragments of language. Both the producer and the consumer of language are involved in producing language as a stream as a result of an appropriate search through tokens and their relationships.

The language search space involves probability, multi-indexing, and dictionaries, and it is huge. It was argued that the search processes of language cannot reasonably be linear, they must be concurrent, and at some level they involve pattern matching, but it is untenable that every conceivable language construct is matched as a whole.

To support the discussion of the search processes, a particular form of directed graph, the dig, was used to model the equivalent mental inference processes. A dig contains a token path that depicts the language stream. It also contains a rule path that depicts the rules required for the particular fragment being modeled. The links of a dig depict position, specificity, rule, inference, or agreement. It appears that any language construct can be represented by a dig.

Languages differ, and there are rules that appear to govern which digs can occur as landigs, those digs specific to a given language. The parameters of Baker generally partition the space of possible digs for a token stream in a particular language into those that are landigs and those that are not, providing a combinatorial reduction of the possibilities with which any writer, reader, speaker, listener, signer, or sign viewer must contend.

There are measures of the apparent speed with which searches can occur, the cache count and the caching burden, derived from the token stream and the relationships displayed by an appropriate dig. The tentative conclusion drawn from applying these measures to examples is much like the one to be taken from parameters.

There are differences between languages, but any apparent advantage of one over another by these measures is clearly a trade-off for other attributes. In particular, affixes in some languages reduce the caching burden at the expense of token counts. Finally, in English, complex sentences appear to often have a burden in the range of 1.45 to 1.50.

 

BACK: a_process_model.htm