|
|
|
A Process Model © Wayne P Amsbury 31 May 2002 UP: language_differences.htm
It is not possible at this time to model the metasearch processes in any detail simply because too little is known about the component searches. They vary from one individual to another, and there are many subtle effects to be taken into account. Consider the following examples: Ham and Eggs is a good breakfast. Ham and Eggs are good for breakfast. Mitsool and Grimba are good at breakfast. The typical English user will straighten out the use of Ham and Eggs as either a combination or as individual items when the verb has been recognized, but may puzzle over which middle-European what or who is involved in the third sentence, even after it is absorbed as a whole. In the face of these difficulties, and perhaps because of them, it is useful to devise some very simple measures that correspond to the complexity or the effective speed of the interpretation processing of a token stream, called the cache count and the caching burden. These are arbitrary but intuitive, and they do allow languages and sentence complexity, which subsumes both writing style and parametric guidance, to be compared. These are measures of speed in some sense. The cache count measures how much work is required to understand a fragment and to place it in its proper context. The caching burden is a measure of the mean amount of work the tokens of a fragment must do. These measures also support a convenient shorthand description of digs. The most basic sentence forms are those with no explicit subject in null-subject languages, such as Italian: Piove. (It is raining in English.) Even these have an implicit subject that must be inferred, so they are a special case of a rule path such as subject à verb. The heart of discourse, however, has another element: subject à verb à object in some permutation, of which there are six. These are set forth in Baker as Table 5.1 at page 128: Subject – verb – object Subject – object – verb Verb – subject – object Verb – object – subject Object – verb – subject Object – subject – verb Call these Simple forms. Many languages use one of these largely, or even exclusively, but some mix simple forms, and some use other equivalents. These forms are so basic to understanding discourse that any language that mixes them must tag parts of speech in some way, other than by position. It is striking that the parameters discussed by Baker tend to be focused on the way in which these structures are expanded into more complex statements. A natural question is: Are there processing advantages for one simple form over another? We approach this question with a thought experiment. Simple Form Thought Experiment. At one extreme, a parallel search for three components of a token path finds the token meanings simultaneously in one unit of time and infers their meaning from a rule path in a single additional step. A parallel search ignores the differences between simple forms, which lie in the number of tokens required to express them, and the number of feedback loops and agreements required to complete the search. These takes their clearest form in the other extreme, a linear search. Consider the linear search for the English word quickly, with the dig: quick à -ly ↑ ↓ adverb ↓ └ ← ← ← ┘ This has the shorthand form: quick à -ly. We can think of the stem quick as being placed in a cache, and then retrieved when –ly is encountered in order to fuse the two tokens into the fully understood word quickly. Then the fused word is generally cached again (recached). When it is retrieved from the cache to be fused into a larger fragment, it is treated as a unit. This idea can be extended to fragments of any size and complexity with a convention: Convention 1. A token T2 that modifies a previous token T1 causes T1 to be retrieved from the cache (uncached) so that the pair (T1-T2) can be fused, and then the fused pair is treated as a unit. When it is recognized that the last token in a fragment has been processed, the prior tokens in the fragment are uncached and then recached as a fused unit. Cache counts. The search cost of this process is taken to be the number of times that items are cached. In the case of quickly, it is two; the cache count of this word is two. It is tempting to apply mechanisms from Computer Science here, such as placing items awaiting their proper context on a stack. However, it is considered that the mind can be given full credit for retrieving items independently as needed from temporary storage. This corresponds most closely to an associative memory rather than a stack or hash table or linked list or the like. Precisely what is cached and when it is cached in a mental process is unknown and thus somewhat arbitrary, but liberal amounts of common sense and examples are applied here to provide a reasonable system of cache counting. A simple English fragment that takes the SVO form is: John hit Mary (fragment). The cache count can be traced as follows, using the tags –SU, -OB and -V for subject, object and verb, respectively, as an indication of token roles. The cache contents are represented as being partitioned into fragments:
cache count next token John 1 hit John-SU 1 hit [When hit is encountered and recognized as a verb, the English rule path: subject à action is invoked and John is recognized as a subject. John is retrieved, tagged by –SU, and recached.] John-SU │ hit-V 1 Mary John-SU │ hit-V │ Mary-OB 2 [When Mary is encountered and recognized as a noun, it is potentially at least part of an object, which would invoke the English rule path: subject à action à object. However, the word following a verb is not necessarily an object word, which may be encountered later on the token path, and so the rule invocation requires that the verb be retrieved in order to apply it to Mary, and the verb is then recached.] This is rather misleading, however, since the subject and the object are singletons here, rather than fragments that need to be fused, and no credit has been given to the need for an initiation and a termination of the fragment. This also ignores the information that Mary is a proper name, recognized by its capitalization. This term is properly treated with a virtual token as cap-mary. In contrast, the recognition that cap-john is a proper name as well as the first word in a sentence is absorbed into a vocabulary search, concurrent with the recognition of it as the first word in a sentence. After all, this situation is fundamentally ambiguous. Now consider this fragment as a sentence: John hits Mary. (sentence), and in more detail: cap-john hits cap-mary period, which has the cache count: cap 1 cap-John 1 John-SU │ hit-V 2 John-SU │ hit-V │ cap-mary-OB 3 John hit Mary. 1 There are six tokens and eight cache counts. It might seem reasonable to forget the last count, on the basis that this sentence is complete, but in the context of an enclosing sentence or of an extended discourse such as a paragraph, it must be cached as a whole for later retrieval. On reviewing the fragment John hit Mary above, it can be seen that the two extra counts are really needed for that fragment also: for an initiation and a termination, and for the caching of the fragment as a unit within a larger language fragment. It appears that the amount of caching imposed by initiation and termination becomes relatively less as the number of tokens involved in a fragment grows, but in a complex sentence partitioned into fragments, each fragment is initiated and terminated in some manner, and it may be retrieved or it may cause a retrieval that is needed for fragments to become properly linked. The analysis can be sharpened: The caching burden (burden) is the ratio of cache counts divided by the number of tokens. The use of proper names in this example is an artifact of the language examples in Baker; a more general example would be: Cats eat mice. In more detail: cap-cats eat mice period. [Five tokens] The abstract form of the example sentence above (without proper names) is: cap-w1 w2 w3 period, which is to be understood as the simple form: SVO. Clearly, the rule path of the corresponding dig comes into play in the recognition of parts of speech. Fragments such as the subject can be rather complex, so we adopt another convention: Convention 2: Parts of speech are recognized by a marker of some kind, perhaps by punctuation, or when a token of the next fragment is recognized. With this convention applied to English, a verb form that follows a fragment can signal that the fragment is a subject. This is really the application of a (potential) rule path. The marker is considered to take the fragment that it marks from the cache and then to recache it, tagged as a part of speech. In this context, then: cap-w1 2 W1-SU 1 S │ w2-V 1 S │ V │ w3-OB 2 SVO 1 This sentence has five tokens, a cache count of seven, and a burden of 7/5, which we write as [7/5 = 1.40]. This analysis can be generalized to three-word sentences of the form: cap-w1 w2 w3 period without regard to which simple form is to be the structure. In the SVO form in English, the verb determines the part of speech of both the subject and the object. In the case of the SOV form the verb is encountered last, so that the first two words need to be retrieved in order to tag them: cap-w1 2 W1 │ w2 1 W1-SU │ w2-OB 2 S │ O │V 1 SOV 1 It is easily verified that the other four simple form possibilities abstractly have the same cache count and caching burden. [7/5 = 1.4.] There is no apparent advantage to one simple form over another, so long as the rule path is strict. Variations. Some languages, such as Japanese and Navaho, allow both SOV and OSV. The flexibility is paid for with suffix or prefix tags. Consider a Japanese example from Baker: John-ga Mary-o butta. (sentence). Here we again use -SU and -OB for subject and object tags: cap-John 2 John-SU 1 John-SU │ cap-mary 2 John-SU │ Mary-OB 1 John-SU │ Mary-OB │butta 1 John-SU Mary-OB butta. 1 This has eight tokens and eight cache counts, and so it has what appears to be an optimum caching burden of 1.00, basically because it is not necessary to retrieve anything from the cache in order to determine parts of speech. Without proper names this becomes [7/7 = 1.00]. The extra two tokens buy the option of reversing the order of subject and object, (SOV and OSV), without losing the precision of the statement. Adding an indirect pronoun (with a –ni suffix) to the Japanese example adds two counts and does not change the burden. Caution. A burden of 1.0 can be interpreted as indicating that the search process is nearly linear, but that is not the same as almost costless, as is parallel search in this system of analysis. To the contrary, the conclusion is that essentially every token is delayed at least one unit for fusion. Navaho also allows SOV and OSV, but does this with one prefix on the verb rather than suffixes on the subject and object. The abstract form is either w1 w2 yi-w3, meaning: S à O à yi-V, or w1 w2 bi-w3, meaning: O à S à bi-V; the verb prefix places the subject as the first or second precursor of the verb. In either case, both subject and object must be uncached in order to tag them properly. cap- w1 2 W1│ w2 1
[At this point, either bi- or yi- is encountered, and both words are uncached and fused appropriately.]
W1-SU │ w2-OB 2 [or: O │ S] S │ O │V 1 SOV 1 This has a cache count of seven for six tokens and a caching burden of 7/6 = 1.17. A comparison of the examples without proper names is: tokens counts burden English: cap w1 w2 w3 . 5 7 1.40 Navaho: cap w1 w2 prefix-w3 . 6 7 1.17 Japanese: cap w1-SU w2-OB w3 . 7 7 1.0 If caching burden is important, English pays a high burden price for minimizing tokens. If tokens are important, Japanese pays a high token price for minimizing burden. Navaho falls between. Navaho and Japanese provide for, and must sort out, two potential rule paths, but English only one. Some languages do not fit into the simple form patterns, even for simple sentences. Consider the Mohawk Shakonuhwe’s. (Shako- à nuhwe’s) meaning: He likes her. cap-shako- 2 Shako │ nuhwe’s 1 Shakonuhwe’s 1 This has four tokens and four counts for a burden of 1.0. However, while there are fewer tokens, there is a cost involved in memorizing 58 prefix combinations of which he/her is one. When a closer equivalent to the English and Japanese sentences above are considered, (given my shaky Mokawk!), we have John shakonuhwe’s Mary.(sentence). cap-John 2 [shako-] John 1 John-SU │shako- 1 [shako- must be used again] John-SU │shako- │ nuhwe’s 1 [At this point, Mary is encountered. The pair prefix shako- must be uncached and fused with Mary, but it does not need to be put back into the cache because its links are now resolved.] John-SU │nuhwe’s │ cap-mary-OB 2 John shakonuhwe’s Mary. 1 In the form with out proper names, this would have a count of 7 for 6 tokens, giving a burden of 7/6 = 1.17, the same as for Navaho. There is a cost for adding explicit information that was assumed in the pronoun form of the sentence. It appears that the null-subject languages have an advantage, as in the Italian Piove, in comparison to the French Il pleut, and the English It is raining. As sentences, these can be analyzed as: cap-piove-V period [3/3 = 1.00] cap-il-SU pleut-V period [5/4 = 1.25] cap-it-SU (is raining)-V period [6/5 = 1.20] However, the English interjection: Quickly! corresponds to (cap-quick-ly)-ADV exclamation [4/4 = 1.0]. Of course, this may evoke: Do what quickly? It relies on outside context. Similarly, the distinction between: He was running. and He was Running Dog. is not so straight-forward. They can be analyzed as: cap-he-SU (was running)-V period [6/5 = 1.2] cap-he-SU was-V (cap-running cap-dog)-OB period [10/8 = 1.25] This analysis considers that was is recognized as a verb, and running uncaches it and recaches the verb phrase was running, whereas the capital of Running marks it as (part of) a proper name and leaves the verb alone. No analysis, however, can catch the nuances of speech that make computational linguistics and language translation so difficult. Consider the innuendo of: The contralto cadence of her speech gave him a deep thrill. . NEXT: complex_sentences BACK: language_differences.htm |