|
|
1.1 root 1: .\" @(#)rm1 6.1 (Berkeley) 5/22/86
2: .\"
3: .EQ
4: delim $$
5: .EN
6: .NH 1
7: Introduction
8: .PP
9: Computers have become important
10: in the document preparation process, with programs
11: to check for spelling errors and to format documents.
12: As the amount of text stored on line increases, it becomes
13: feasible and attractive to study writing
14: style and to attempt to help the writer in producing readable
15: documents.
16: The system of writing tools described here is a first step toward such help.
17: The system includes programs and a data base to
18: analyze writing style at the word and sentence level.
19: We use the term ``style'' in this paper to describe the
20: results of a writer's particular choices among individual words and
21: sentence forms.
22: Although many judgements of style are subjective,
23: particularly those of word choice,
24: there are some objective measures that experts
25: agree lead to good style.
26: Three programs have been written to measure some of
27: the objectively definable characteristics of writing style
28: and to identify some commonly misused or unnecessary phrases.
29: Although a document that conforms to the stylistic rules
30: is not guaranteed to be coherent and readable, one that
31: violates all of the rules is likely to be
32: difficult or tedious to read.
33: The program STYLE calculates readability, sentence length variability,
34: sentence type, word usage and sentence openers at a rate of about 400 words per second
35: on a PDP11/70 running the
36: .UX
37: Operating System.
38: It assumes that the sentences are well-formed, i. e. that
39: each sentence has a verb and that the subject and verb agree in number.
40: DICTION identifies phrases that are either bad usage or unnecessarily wordy.
41: EXPLAIN acts as a thesaurus for the phrases found by DICTION.
42: Sections 2, 3, and 4 describe the programs; Section 5 gives the results
43: on a cross-section of technical documents; Section 6 discusses
44: accuracy and problems; Section 7 gives implementation details.
45: .NH 1
46: STYLE
47: .PP
48: The program STYLE reads a document and prints a summary of
49: readability indices, sentence length and type, word usage,
50: and sentence openers.
51: It may also be used to locate all sentences in a document
52: longer than a given length, of readability index higher than a given
53: number, those containing a passive verb, or those beginning with an expletive.
54: STYLE
55: is based on the system for finding English word classes or parts of speech, PARTS [1].
56: PARTS is a set of programs that uses a small dictionary (about 350 words)
57: and suffix rules to partially assign word classes to
58: English text.
59: It then uses experimentally derived rules of word order to assign
60: word classes to all words in the text with an accuracy of about 95%.
61: Because PARTS uses only a small dictionary and general rules, it works
62: on text about any subject, from physics to psychology.
63: Style measures have been built into the output phase
64: of the programs that make up PARTS.
65: Some of the measures are simple counters of the word classes
66: found by PARTS; many are more complicated.
67: For example, the verb count is the total number of verb phrases.
68: This includes phrases like:
69: .DS
70: has been going
71: was only going
72: to go
73: .DE
74: each of which each counts as one verb.
75: Figure 1 shows the output of STYLE run on a paper by Kernighan and Mashey
76: about the
77: .UX
78: programming environment [2].
79: .KF
80: .sp 2
81: .TS
82: box;
83: l1l.
84: programming environment
85: readability grades:
86: (Kincaid) 12.3 (auto) 12.8 (Coleman-Liau) 11.8 (Flesch) 13.5 (46.3)
87: sentence info:
88: no. sent 335 no. wds 7419
89: av sent leng 22.1 av word leng 4.91
90: no. questions 0 no. imperatives 0
91: no. nonfunc wds 4362 58.8% av leng 6.38
92: short sent (<17) 35% (118) long sent (>32) 16% (55)
93: longest sent 82 wds at sent 174; shortest sent 1 wds at sent 117
94: sentence types:
95: simple 34% (114) complex 32% (108)
96: compound 12% (41) compound-complex 21% (72)
97: word usage:
98: verb types as % of total verbs
99: tobe 45% (373) aux 16% (133) inf 14% (114)
100: passives as % of non-inf verbs 20% (144)
101: types as % of total
102: prep 10.8% (804) conj 3.5% (262) adv 4.8% (354)
103: noun 26.7% (1983) adj 18.7% (1388) pron 5.3% (393)
104: nominalizations 2 % (155)
105: sentence beginnings:
106: subject opener: noun (63) pron (43) pos (0) adj (58) art (62) tot 67%
107: prep 12% (39) adv 9% (31)
108: verb 0% (1) sub_conj 6% (20) conj 1% (5)
109: expletives 4% (13)
110: .TE
111: .sp
112: .ce
113: Figure 1
114: .sp
115: .KE
116: As the example shows, STYLE output is in five parts.
117: After a brief discussion of sentences, we will describe the parts in order.
118: .NH 2
119: What is a sentence?
120: .PP
121: Readers of documents have little
122: trouble deciding where the sentences end.
123: People don't even have to stop and think about uses of the
124: character ``.'' in constructions like
125: 1.25, A. J. Jones, Ph.D., i. e., or etc. .
126: When a computer reads a document,
127: finding the end of sentences is not as easy.
128: First we must throw away the printer's marks and formatting
129: commands that litter the text in computer form.
130: Then STYLE
131: defines a sentence
132: as a string of words ending in one of:
133: .DS
134: . ! ? /.
135: .DE
136: The end marker ``/.'' may be used to indicate an imperative sentence.
137: Imperative sentences that are not so marked are not identified as imperative.
138: STYLE properly handles numbers with embedded decimal points and commas,
139: strings of letters and numbers with embedded decimal points used for
140: naming computer file names, and
141: the common
142: abbreviations listed in Appendix 1.
143: Numbers that end sentences, like the preceding sentence, cause
144: a sentence break if the next word begins with a capital letter.
145: Initials only cause a sentence break if the next word begins with
146: a capital and is found in the dictionary of function words used by PARTS.
147: So the string
148: .DS
149: J. D. JONES
150: .DE
151: does not cause a break, but the string
152: .DS
153: ... system H. The ...
154: .DE
155: does.
156: With these rules most sentences are broken at the proper place,
157: although occasionally
158: either two sentences are called one or a fragment is called
159: a sentence.
160: More on this later.
161: .NH 2
162: Readability Grades
163: .PP
164: The first section of STYLE output consists of four readability indices.
165: As Klare points out in [3] readability indices may be used to
166: estimate the reading skills needed by the reader to understand a document.
167: The readability indices reported by STYLE are based on
168: measures of sentence and word lengths.
169: Although the indices
170: may not measure whether the document is coherent
171: and well organized,
172: experience has shown that high indices seem to be indicators of stylistic
173: difficulty.
174: Documents with short sentences and short words have low scores;
175: those with long sentences and many polysyllabic words have high scores.
176: The 4 formulae reported are Kincaid Formula [4], Automated Readability Index [5],
177: Coleman-Liau Formula [6]
178: and a normalized version of Flesch Reading Ease Score [7].
179: The formulae differ because they were experimentally derived using different texts
180: and subject groups.
181: We will discuss each of the formulae briefly; for a more
182: detailed discussion the reader should see [3].
183: .PP
184: The Kincaid Formula, given by:
185: .EQ
186: Reading_Grade = 11.8 * syl_per_wd + .39 * wds_per_sent - 15.59
187: .EN
188: .br
189: was based on Navy training manuals that ranged in difficulty
190: from 5.5 to 16.3 in reading grade level.
191: The score reported by this formula tends to be in the mid-range of the
192: 4 scores.
193: Because it is based on adult training manuals rather than
194: school book text, this formula is probably the best
195: one to apply to technical documents.
196: .PP
197: The Automated Readability Index (ARI), based on text from
198: grades 0 to 7, was derived to be easy to automate.
199: The formula is:
200: .EQ
201: Reading_Grade = 4.71 * let_per_wd + .5 * wds_per_sent - 21.43
202: .EN
203: .br
204: ARI tends to produce scores that are higher than Kincaid and
205: Coleman-Liau but are usually slightly lower than Flesch.
206: .PP
207: The Coleman-Liau Formula, based on text ranging in
208: difficulty from .4 to 16.3, is:
209: .EQ
210: Reading_Grade = 5.89 * let_per_wd - .3 * sent_per_100_wds - 15.8
211: .EN
212: .br
213: Of the four formulae this one usually gives the lowest
214: grade when applied to technical documents.
215: .PP
216: The last formula, the Flesch Reading Ease Score, is based
217: on grade school text covering grades 3 to 12.
218: The formula, given by:
219: .EQ
220: Reading_Score = 206.835 - 84.6 * syl_per_wd - 1.015 * wds_per_sent
221: .EN
222: .br
223: is usually reported in the range 0 (very difficult) to 100 (very easy).
224: The score reported by STYLE is scaled to be comparable to
225: the other formulas,
226: except that the maximum grade level reported is set to 17.
227: The Flesch score is usually the highest of the 4 scores
228: on technical documents.
229: .PP
230: Coke [8] found that the Kincaid Formula is probably the best predictor for
231: technical documents;
232: both ARI and Flesch tend to overestimate
233: the difficulty; Coleman-Liau tend to underestimate.
234: On text in the range of grades 7 to 9
235: the four formulas tend to be about the same.
236: On easy text the Coleman-Liau formula is probably
237: preferred since it is reasonably accurate at the lower
238: grades and it is safer to present text that is a little too
239: easy than a little too hard.
240: .PP
241: If a document has particularly difficult technical content, especially if
242: it includes a lot of mathematics,
243: it is probably best to make the text very easy to read, i.e. a lower
244: readability index by shortening the sentences and words.
245: This will allow the reader to concentrate on the technical
246: content and not the long sentences.
247: The user should remember that these indices are estimators;
248: they should not be taken as absolute numbers.
249: STYLE called with ``\-r number'' will print all sentences with
250: an Automated Readability Index equal to or greater than ``number''.
251: .NH 2
252: Sentence length and structure
253: .PP
254: The next two sections of STYLE output deal with sentence length and structure.
255: Almost all books on writing style or effective writing emphasize
256: the importance of variety in sentence length and structure for good writing.
257: Ewing's first rule in discussing style in the book
258: .I
259: Writing for Results
260: .R
261: [9] is:
262: .DS
263: ``Vary the sentence structure and length of your sentences.''
264: .DE
265: Leggett, Mead and Charvat break this rule into 3 in
266: .I
267: Prentice-Hall Handbook for Writers
268: .R
269: [10] as follows:
270: .DS
271: ``34a. Avoid the overuse of short simple sentences.''
272: ``34b. Avoid the overuse of long compound sentences.''
273: ``34c. Use various sentence structures to avoid monotony and increase effectiveness.''
274: .DE
275: Although experts agree that these rules are important, not all writers
276: follow them.
277: Sample technical documents have been found with almost no
278: sentence length or type variability.
279: One document had 90% of its sentences about the same
280: length as the average;
281: another was made up almost entirely of simple sentences (80%).
282: .PP
283: The output sections labeled ``sentence info'' and ``sentence types'' give
284: both length and structure measures.
285: STYLE reports on the number and average length of both
286: sentences and words,
287: and number of questions and imperative sentences (those ending in ``/.'').
288: The measures of non-function words are an attempt to look at the content
289: words in the document.
290: In English
291: non-function words are nouns, adjectives, adverbs, and non-auxiliary verbs;
292: function words are prepositions, conjunctions, articles, and auxiliary
293: verbs.
294: Since most function words are short, they tend to lower the average
295: word length.
296: The average length of non-function words may be a more useful measure for comparing
297: word choice of different writers than the total average word length.
298: The percentages of short and long sentences measure sentence
299: length variability.
300: Short sentences are those at least 5 words less than the
301: average; long sentences are those at least 10 words longer than the average.
302: Last in the sentence information section is the
303: length and location of the longest and shortest sentences.
304: If the flag ``\-l number'' is used, STYLE will print all sentences
305: longer than ``number''.
306: .PP
307: Because of the difficulties in dealing with the many uses of commas and conjunctions
308: in English, sentence type definitions
309: vary slightly from those of standard textbooks, but still measure
310: the same constructional activity.
311: .IP 1.
312: A simple sentence has one verb and no dependent clause.
313: .IP 2.
314: A complex sentence has one independent
315: clause and one dependent clause, each with one verb.
316: Complex sentences are found by identifying sentences that contain either
317: a subordinate conjunction or a clause beginning with words like ``that''
318: or ``who''.
319: The preceding sentence has such a clause.
320: .IP 3.
321: A compound sentence has more than one verb and no dependent
322: clause.
323: Sentences joined by ``;'' are also counted as compound.
324: .IP 4.
325: A compound-complex sentence has either several dependent clauses
326: or one dependent clause and a compound verb in either
327: the dependent or independent clause.
328: .PP
329: Even using these broader definitions, simple
330: sentences dominate many of the technical documents that
331: have been tested,
332: but the example in Figure 1 shows variety in both sentence structure and
333: sentence length.
334: .NH 2
335: Word Usage
336: .PP
337: The word usage measures are an attempt to identify
338: some other constructional features of writing style.
339: There are many different ways in English to
340: say the same thing.
341: The constructions differ from one another
342: in the form of the words used.
343: The following sentences all convey approximately the
344: same meaning but differ in word usage:
345: .DS
346: The cxio program is used to perform all communication between the systems.
347: The cxio program performs all communications between the systems.
348: The cxio program is used to communicate between the systems.
349: The cxio program communicates between the systems.
350: All communication between the systems is performed by the cxio program.
351: .DE
352: The distribution of the parts of speech and verb constructions
353: helps identify overuse of particular constructions.
354: Although the measures used by STYLE are crude, they do point out
355: problem areas.
356: For each category, STYLE reports a percentage and a raw count.
357: In addition to looking at the percentage, the user
358: may find it useful to compare the raw count with the number of sentences.
359: If, for example, the number of infinitives is almost equal to the number
360: of sentences, then many of the sentences in the document are constructed
361: like the first and third in the preceding example.
362: The user may want to transform some of these sentences into another form.
363: Some of the implications of the word usage measures are discussed below.
364: .IP "\fIVerbs\fR "
365: are measured in several different ways to
366: try to determine what types of verb constructions are
367: most frequent in the document.
368: Technical writing tends to contain many
369: passive verb constructions and other usage of the verb ``to be''.
370: The category of verbs labeled ``tobe'' measures both passives and sentences of
371: the form:
372: .DS
373: .I
374: subject tobe predicate
375: .R
376: .DE
377: In counting verbs, whole verb phrases are counted as one verb.
378: Verb phrases containing auxiliary verbs are counted in the category
379: ``aux''.
380: The verb phrases counted here are those whose tense is not
381: simple present or simple past.
382: It might eventually be useful to do more detailed measures
383: of verb tense or mood.
384: Infinitives are listed as ``inf''.
385: The percentages reported for these three categories are based on
386: the total number of verb phrases found.
387: These categories are not mutually exclusive;
388: they cannot be added, since, for example,
389: ``to be going'' counts as both ``tobe'' and ``inf''.
390: Use of these three types of verb constructions varies significantly among authors.
391: .sp 2
392: STYLE reports passive verbs as a percentage of the finite verbs in the
393: document.
394: Most style books warn against the overuse of passive verbs.
395: Coleman [11] has shown that sentences with
396: active verbs are easier to learn than those
397: with passive verbs.
398: Although the inverted object-subject order of the passive
399: voice seems to emphasize the object, Coleman's experiments
400: showed that there is little difference in retention
401: by word position. He also showed that the direct object of an active verb
402: is retained better than the subject of a passive verb.
403: These experiments support the advice of the style books suggesting
404: that writers should try to use active verbs wherever possible.
405: The flag ``\-p'' causes STYLE to print all sentences containing passive verbs.
406: .PP
407: .IP "\fIPronouns\fR "
408: add cohesiveness and connectivity to a document
409: by providing back-reference.
410: They are often a short-hand notation for something
411: previously mentioned, and therefore connect the sentence containing the pronoun with the
412: word to which the pronoun refers.
413: Although there are other mechanisms for such connections, documents
414: with no pronouns tend to be wordy and to have little connectivity.
415: .IP "\fIAdverbs\fR "
416: can provide transition between sentences and order
417: in time and space.
418: In performing these functions, adverbs, like pronouns, provide
419: connectivity and cohesiveness.
420: .IP "\fIConjunctions\fR "
421: provide parallelism in a document by connecting two or more
422: equal units.
423: These units may be whole sentences, verb phrases, nouns, adjectives, or
424: prepositional phrases.
425: The compound and compound-complex sentences reported under
426: sentence type are parallel structures.
427: Other uses of parallel structures are indicated by the degree that the
428: number of conjunctions reported under word usage exceeds the
429: compound sentence measures.
430: .IP "\fINouns and Adjectives.\fR "
431: A ratio of nouns to adjectives near unity may indicate the over-use of modifiers.
432: Some technical writers qualify every noun with one or more
433: adjectives.
434: Qualifiers in phrases like ``simple linear single-link network model''
435: often lend more obscurity than precision to a text.
436: .IP "\fINominalizations\fR "
437: are verbs that are changed to nouns by adding one of the suffixes
438: ``ment'', ``ance'', ``ence'', or ``ion''.
439: Examples are accomplishment, admittance, adherence, and abbreviation.
440: When a writer transforms a nominalized sentence to a non-nominalized
441: sentence, she/he increases the effectiveness of the sentence in
442: several ways.
443: The noun becomes an active verb and frequently one complicated clause
444: becomes two shorter clauses.
445: For example,
446: .DS
447: Their inclusion of this provision is admission of the importance of the system.
448: When they included this provision, they admitted the importance of the system.
449: .DE
450: Coleman found that the transformed sentences were easier to
451: learn, even when the transformation produced sentences that were
452: slightly longer, provided the transformation broke one clause into two.
453: Writers who find their document contains many
454: nominalizations may want to transform some of the sentences
455: to use active verbs.
456: .NH 2
457: Sentence openers
458: .PP
459: Another agreed upon principle of style is variety in sentence openers.
460: Because STYLE determines the type of sentence opener by
461: looking at the part of speech of the first word in the sentence,
462: the sentences counted under the heading ``subject opener'' may not
463: all really begin with the subject.
464: However, a large percentage of sentences in this category
465: still indicates lack of variety in sentence openers.
466: Other sentence opener measures help the user determine
467: if there are transitions between sentences and where
468: the subordination occurs.
469: Adverbs and conjunctions at the beginning of sentences are mechanisms for
470: transition between sentences.
471: A pronoun at the beginning shows a link to something previously mentioned
472: and indicates connectivity.
473: .PP
474: The location of subordination can be determined by comparing
475: the number of sentences that begin with a subordinator with
476: the number of sentences with complex clauses.
477: If few sentences start with subordinate conjunctions then
478: the subordination is embedded or at the end of the complex sentences.
479: For variety the writer may want to transform some sentences
480: to have leading subordination.
481: .PP
482: The last category of openers, expletives, is commonly
483: overworked in technical writing.
484: Expletives are the words ``it'' and ``there'', usually with the verb ``to be'',
485: in constructions where the subject follows the verb.
486: For example,
487: .DS
488: There are three streets used by the traffic.
489: There are too many users on this system.
490: .DE
491: This construction tends to emphasize the object rather than the
492: subject of the sentence.
493: The flag ``\-e'' will cause STYLE to print all
494: sentences that begin with an expletive.
495: .NH 1
496: DICTION
497: .PP
498: The program DICTION prints all sentences in a document containing
499: phrases that are either frequently misused or indicate wordiness.
500: The program, an extension of Aho's FGREP [12] string
501: matching program,
502: takes as input a file of phrases or patterns to be matched and a file
503: of text to be searched.
504: A data base of about 450 phrases has been compiled as a default
505: pattern file for DICTION.
506: Before attempting to locate phrases, the program maps
507: upper case letters to lower case and substitutes blanks for
508: punctuation.
509: Sentence boundaries were deemed less critical in DICTION than
510: in STYLE, so abbreviations and other uses of the character
511: ``.'' are not treated specially.
512: DICTION brackets all pattern matches in a sentence with the characters
513: ``['' ``]'' .
514: Although many of the phrases in the default data base are correct
515: in some contexts, in others they indicate wordiness.
516: Some examples of the phrases and suggested alternatives are:
517: .DS
518: .TS
519: cc
520: ll.
521: Phrase Alternative
522: a large number of many
523: arrive at a decision decide
524: collect together collect
525: for this reason so
526: pertaining to about
527: through the use of by or with
528: utilize use
529: with the exception of except
530: .TE
531: .DE
532: Appendix 2 contains a complete list of the default file.
533: Some of the entries are short forms of problem phrases.
534: For example, the phrase ``the fact'' is found in all of the following
535: and is sufficient to point out the wordiness to the user:
536: .DS
537: .TS
538: cc
539: ll.
540: Phrase Alternative
541: accounted for by the fact that caused by
542: an example of this is the fact that thus
543: based on the fact that because
544: despite the fact that although
545: due to the fact that because
546: in light of the fact that because
547: in view of the fact that since
548: notwithstanding the fact that although
549: .TE
550: .DE
551: Entries in Appendix 2 preceded by ``~'' are not matched.
552: See Section 7 for details on the use of ``~''.
553: .PP
554: The user may supply her/his own pattern file with the flag ``\-f patfile''.
555: In this case the default file will be loaded first, followed by the user file.
556: This mechanism allows users to suppress
557: patterns contained in the default file or to include their own pet peeves that are not in the default file.
558: The flag ``\-n'' will exclude the default file altogether.
559: In constructing a pattern file, blanks should be used before and after each
560: phrase to avoid matching substrings in words.
561: For example, to find all occurrences of the word ``the'', the pattern
562: `` the '' should be used.
563: The blanks cause only the word ``the'' to be matched and not the
564: string ``the'' in words like there, other, and therefore.
565: One side effect of surrounding the words with blanks is that
566: when two phrases occur without intervening words, only the
567: first will be matched.
568: .NH 1
569: EXPLAIN
570: .PP
571: The last program, EXPLAIN, is an interactive thesaurus for
572: phrases found by DICTION.
573: The user types one of the phrases bracketed by DICTION
574: and EXPLAIN responds with suggested substitutions for the phrase
575: that will improve the diction of the document.
576: .KF
577: .DS C
578: Table 1
579: Text Statistics on 20 Technical Documents
580: .TS
581: cccccc
582: llnnnn.
583: variable minimum maximum mean standard deviation
584: _
585: Readability Kincaid 9.5 16.9 13.3 2.2
586: automated 9.0 17.4 13.3 2.5
587: Cole-Liau 10.0 16.0 12.7 1.8
588: Flesch 8.9 17.0 14.4 2.2
589: _
590: sentence info. av sent length 15.5 30.3 21.6 4.0
591: av word length 4.61 5.63 5.08 .29
592: av nonfunction length 5.72 7.30 6.52 .45
593: short sent 23% 46% 33% 5.9
594: long sent 7% 20% 14% 2.9
595: _
596: sentence types simple 31% 71% 49% 11.4
597: complex 19% 50% 33% 8.3
598: compound 2% 14% 7% 3.3
599: compound-complex 2% 19% 10% 4.8
600: _
601: verb types tobe 26% 64% 44.7% 10.3
602: auxiliary 10% 40% 21% 8.7
603: infinitives 8% 24% 15.1% 4.8
604: passives 12% 50% 29% 9.3
605: _
606: word usage prepositions 10.1% 15.0% 12.3% 1.6
607: conjunction 1.8% 4.8% 3.4% .9
608: adverbs 1.2% 5.0% 3.4% 1.0
609: nouns 23.6% 31.6% 27.8% 1.7
610: adjectives 15.4% 27.1% 21.1% 3.4
611: pronouns 1.2% 8.4% 2.5% 1.1
612: nominalizations 2% 5% 3.3% .8
613: _
614: sentence openers prepositions 6% 19% 12% 3.4
615: adverbs 0% 20% 9% 4.6
616: subject 56% 85% 70% 8.0
617: verbs 0% 4% 1% 1.0
618: subordinating conj 1% 12% 5% 2.7
619: conjunctions 0% 4% 0% 1.5
620: expletives 0% 6% 2% 1.7
621: .TE
622: .DE
623: .KE
624: .NH 1
625: Results
626: .NH 2
627: STYLE
628: .PP
629: To get baseline statistics and check the program's accuracy,
630: we ran STYLE on 20 technical documents.
631: There were a total of 3287 sentences in the sample.
632: The shortest document was 67 sentences long; the longest 339 sentences.
633: The documents covered a wide range of subject matter, including
634: theoretical computing, physics, psychology, engineering, and
635: affirmative action.
636: Table 1 gives the range, median, and standard deviation of the various style measures.
637: As you will note most of the measurements have a fairly wide range of values
638: across the sample documents.
639: .PP
640: As a comparison, Table 2 gives the median results
641: for two different technical authors, a sample of instructional material, and a sample of the
642: Federalist Papers.
643: The two authors show similar styles, although author 2
644: uses somewhat shorter sentences and longer words than author 1.
645: Author 1 uses all types of sentences, while author 2 prefers
646: simple and complex sentences, using few compound or compound-complex sentences.
647: The other major difference in the styles of these authors is the location
648: of subordination.
649: Author 1 seems to prefer embedded or trailing subordination, while
650: author 2 begins many sentences with the subordinate clause.
651: The documents tested for both authors 1 and 2 were technical documents,
652: written for a technical audience.
653: The instructional documents, which are written for craftspeople,
654: vary surprisingly little from the two technical samples.
655: The sentences and words are a little longer,
656: and they contain many passive and auxiliary verbs, few adverbs, and almost
657: no pronouns.
658: The instructional documents contain many imperative sentences, so there are
659: many sentence with verb openers.
660: The sample of Federalist Papers contrasts with the other
661: samples in almost every way.
662: .KF
663: .DS C
664: Table 2
665: Text Statistics on Single Authors
666: .TS
667: cccccc
668: llnnnn.
669: variable author 1 author 2 inst. FED
670: _
671: readability Kincaid 11.0 10.3 10.8 16.3
672: automated 11.0 10.3 11.9 17.8
673: Coleman-Liau 9.3 10.1 10.2 12.3
674: Flesch 10.3 10.7 10.1 15.0
675: _
676: sentence info av sent length 22.64 19.61 22.78 31.85
677: av word length 4.47 4.66 4.65 4.95
678: av nonfunction length 5.64 5.92 6.04 6.87
679: short sent 35% 43% 35% 40%
680: long sent 18% 15% 16% 21%
681: _
682: sentence types simple 36% 43% 40% 31%
683: complex 34% 41% 37% 34%
684: compound 13% 7% 4% 10%
685: compound-complex 16% 8% 14% 25%
686: _
687: verb type tobe 42% 43% 45% 37%
688: auxiliary 17% 19% 32% 32%
689: infinitives 17% 15% 12% 21%
690: passives 20% 19% 36% 20%
691: _
692: word usage prepositions 10.0% 10.8% 12.3% 15.9%
693: conjunctions 3.2% 2.4% 3.9% 3.4%
694: adverbs 5.05% 4.6% 3.5% 3.7%
695: nouns 27.7% 26.5% 29.1% 24.9%
696: adjectives 17.0% 19.0% 15.4% 12.4%
697: pronouns 5.3% 4.3% 2.1% 6.5%
698: nominalizations 1% 2% 2% 3%
699: _
700: sentence openers prepositions 11% 14% 6% 5%
701: adverbs 9% 9% 6% 4%
702: subject 65% 59% 54% 66%
703: verb 3% 2% 14% 2%
704: subordinating conj 8% 14% 11% 3%
705: conjunction 1% 0% 0% 3%
706: expletives 3% 3% 0% 3%
707: .TE
708: .DE
709: .KE
710: .NH 2
711: DICTION
712: .PP
713: In the few weeks that DICTION has been available
714: to users
715: about 35,000 sentences have been run with about
716: 5,000 string matches.
717: The authors using the program seem to make
718: the suggested changes about 50-75% of the time.
719: To date, almost 200 of the 450 strings in the default
720: file have been matched.
721: Although most of these phrases are valid and correct
722: in some contexts, the 50-75% change rate seems to
723: show that the phrases are used much more often than
724: concise diction warrants.
725: .NH 1
726: Accuracy
727: .NH 2
728: Sentence Identification
729: .PP
730: The correctness of the STYLE output on the 20 document sample was checked
731: in detail.
732: STYLE misidentified
733: 129 sentence fragments as sentences
734: and incorrectly joined two or more sentences 75 times
735: in the 3287 sentence sample.
736: The problems were usually because of nonstandard formatting
737: commands, unknown abbreviations, or lists of non-sentences.
738: An impossibly long sentence found as the longest sentence in
739: the document usually is the result of a long list
740: of non-sentences.
741: .NH 2
742: Sentence Types
743: .PP
744: Style correctly identified sentence type on 86.5% of
745: the sentences in the sample.
746: The type distribution of the sentences was
747: 52.5% simple, 29.9% complex, 8.5% compound and
748: 9% compound-complex.
749: The program reported 49.5% simple, 31.9% complex,
750: 8% compound and 10.4% compound-complex.
751: Looking at the errors on the individual
752: documents, the number of simple sentences was
753: under-reported by about 4% and the complex and compound-complex
754: were over-reported by 3% and 2%, respectively.
755: The following matrix shows the programs output
756: vs. the actual sentence type.
757: .DS C
758: .TS
759: csssss
760: cccccc
761: clnnnn.
762: Program Results
763: simple complex compound comp-complex
764: Actual simple 1566 132 49 17
765: Sentence complex 47 892 6 65
766: Type compound 40 6 207 23
767: comp-complex 0 52 5 249
768: .TE
769: .DE
770: .PP
771: The system's inability to find imperative sentences seems to
772: have little effect on most of the style statistics.
773: A document with half of its sentences imperative was run, with and
774: without the imperative end marker.
775: The results were identical except for the expected errors of not finding
776: verbs as sentence openers, not counting the imperative sentences,
777: and a slight difference (1%) in the number of nouns
778: and adjectives reported.
779: .NH 2
780: Word Usage
781: .PP
782: The accuracy of identifying word types reflects
783: that of PARTS, which is about 95% correct.
784: The largest source of confusion is between nouns and
785: adjectives.
786: The verb counts were checked on about 20 sentences from each
787: document and found to be about 98% correct.
788: .NH 1
789: Technical Details
790: .NH 2
791: Finding Sentences
792: .PP
793: The formatting commands embedded in the text increase the difficulty
794: of finding sentences.
795: Not all text in a document is in sentence form; there are headings,
796: tables, equations and lists, for example.
797: Headings like ``Finding Sentences'' above should be discarded, not
798: attached to the next sentence.
799: However, since many of the documents are formatted to be phototypeset,
800: and contain font changes, which usually operate on the
801: most important words in the document,
802: discarding all formatting commands is not correct.
803: To improve the programs' ability to find sentence boundaries, the deformatting program, DEROFF [13],
804: has been given some knowledge of the formatting packages used on the
805: .UX
806: operating system.
807: DEROFF will now do the following:
808: .IP 1.
809: Suppress all formatting macros that
810: are used for titles, headings, author's name, etc.
811: .IP 2.
812: Suppress the arguments to the macros for titles, headings, author's name, etc.
813: .IP 3.
814: Suppress displays, tables, footnotes and text that is centered or in no-fill mode.
815: .IP 4.
816: Substitute a place holder for equations and check
817: for hidden end markers.
818: The place holder is necessary because many typists and authors use
819: the equation setter to change fonts on important words.
820: For this reason, header files containing the definition of
821: the EQN delimiters must also be included as input to STYLE.
822: End markers are often hidden when an equation ends a sentence
823: and the period is typed
824: inside the EQN delimiters.
825: .IP 5.
826: Add a "." after lists.
827: If the flag \-ml is also used, all lists are suppressed.
828: This is a separate flag because of the variety of ways the
829: list macros are used.
830: Often, lists are sentences that should be included in the analysis.
831: The user must determine how lists are used in the document to be analyzed.
832: .PP
833: Both STYLE and DICTION call DEROFF before they look at the text.
834: The user should supply the \-ml flag if the document contains
835: many lists of non-sentences that should be skipped.
836: .NH 2
837: Details of DICTION
838: .PP
839: The program DICTION is based on the string matching program FGREP.
840: FGREP takes as input a file of patterns to be matched and a file
841: to be searched and outputs each line that contains
842: any of the patterns
843: with no indication of which pattern was matched.
844: The following changes have been added to FGREP:
845: .IP 1.
846: The basic unit that DICTION operates on is a sentence rather than a line.
847: Each sentence that contains one of the patterns is output.
848: .IP 2.
849: Upper case letters are mapped to lower case.
850: .IP 3.
851: Punctuation is replaced by blanks.
852: .IP 4
853: All pattern matches in the sentence are found and surrounded with
854: ``['' ``]'' .
855: .IP 5.
856: A method for suppressing a string match has been added.
857: Any pattern that begins with ``~'' will not be matched.
858: Because the matching algorithm finds the longest
859: substring, the suppression of a match allows words in some
860: correct contexts not to be matched while allowing
861: the word in another context to be found.
862: For example, the word ``which'' is often incorrectly used
863: instead of ``that'' in restrictive clauses.
864: However, ``which'' is usually correct when preceded by a preposition
865: or ``,''.
866: The default pattern file suppresses the match
867: of the common prepositions or a double
868: blank followed by ``which'' and therefore matches only
869: the suspect uses.
870: The double blank accounts for the replaced comma.
871: .NH
872: Conclusions
873: .PP
874: A system of writing tools that measure some of the
875: objective characteristics of writing style has been developed.
876: The tools are sufficiently general that they may be applied to
877: documents on any subject with equal accuracy.
878: Although the measurements are only of the surface
879: structure of the text, they do point out problem areas.
880: In addition to helping writers produce better documents,
881: these programs may be useful for studying
882: the writing process and finding other formulae for measuring
883: readability.
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.