|
|
1.1 ! root 1: .\" @(#)rm1 6.1 (Berkeley) 5/22/86 ! 2: .\" ! 3: .EQ ! 4: delim $$ ! 5: .EN ! 6: .NH 1 ! 7: Introduction ! 8: .PP ! 9: Computers have become important ! 10: in the document preparation process, with programs ! 11: to check for spelling errors and to format documents. ! 12: As the amount of text stored on line increases, it becomes ! 13: feasible and attractive to study writing ! 14: style and to attempt to help the writer in producing readable ! 15: documents. ! 16: The system of writing tools described here is a first step toward such help. ! 17: The system includes programs and a data base to ! 18: analyze writing style at the word and sentence level. ! 19: We use the term ``style'' in this paper to describe the ! 20: results of a writer's particular choices among individual words and ! 21: sentence forms. ! 22: Although many judgements of style are subjective, ! 23: particularly those of word choice, ! 24: there are some objective measures that experts ! 25: agree lead to good style. ! 26: Three programs have been written to measure some of ! 27: the objectively definable characteristics of writing style ! 28: and to identify some commonly misused or unnecessary phrases. ! 29: Although a document that conforms to the stylistic rules ! 30: is not guaranteed to be coherent and readable, one that ! 31: violates all of the rules is likely to be ! 32: difficult or tedious to read. ! 33: The program STYLE calculates readability, sentence length variability, ! 34: sentence type, word usage and sentence openers at a rate of about 400 words per second ! 35: on a PDP11/70 running the ! 36: .UX ! 37: Operating System. ! 38: It assumes that the sentences are well-formed, i. e. that ! 39: each sentence has a verb and that the subject and verb agree in number. ! 40: DICTION identifies phrases that are either bad usage or unnecessarily wordy. ! 41: EXPLAIN acts as a thesaurus for the phrases found by DICTION. ! 42: Sections 2, 3, and 4 describe the programs; Section 5 gives the results ! 43: on a cross-section of technical documents; Section 6 discusses ! 44: accuracy and problems; Section 7 gives implementation details. ! 45: .NH 1 ! 46: STYLE ! 47: .PP ! 48: The program STYLE reads a document and prints a summary of ! 49: readability indices, sentence length and type, word usage, ! 50: and sentence openers. ! 51: It may also be used to locate all sentences in a document ! 52: longer than a given length, of readability index higher than a given ! 53: number, those containing a passive verb, or those beginning with an expletive. ! 54: STYLE ! 55: is based on the system for finding English word classes or parts of speech, PARTS [1]. ! 56: PARTS is a set of programs that uses a small dictionary (about 350 words) ! 57: and suffix rules to partially assign word classes to ! 58: English text. ! 59: It then uses experimentally derived rules of word order to assign ! 60: word classes to all words in the text with an accuracy of about 95%. ! 61: Because PARTS uses only a small dictionary and general rules, it works ! 62: on text about any subject, from physics to psychology. ! 63: Style measures have been built into the output phase ! 64: of the programs that make up PARTS. ! 65: Some of the measures are simple counters of the word classes ! 66: found by PARTS; many are more complicated. ! 67: For example, the verb count is the total number of verb phrases. ! 68: This includes phrases like: ! 69: .DS ! 70: has been going ! 71: was only going ! 72: to go ! 73: .DE ! 74: each of which each counts as one verb. ! 75: Figure 1 shows the output of STYLE run on a paper by Kernighan and Mashey ! 76: about the ! 77: .UX ! 78: programming environment [2]. ! 79: .KF ! 80: .sp 2 ! 81: .TS ! 82: box; ! 83: l1l. ! 84: programming environment ! 85: readability grades: ! 86: (Kincaid) 12.3 (auto) 12.8 (Coleman-Liau) 11.8 (Flesch) 13.5 (46.3) ! 87: sentence info: ! 88: no. sent 335 no. wds 7419 ! 89: av sent leng 22.1 av word leng 4.91 ! 90: no. questions 0 no. imperatives 0 ! 91: no. nonfunc wds 4362 58.8% av leng 6.38 ! 92: short sent (<17) 35% (118) long sent (>32) 16% (55) ! 93: longest sent 82 wds at sent 174; shortest sent 1 wds at sent 117 ! 94: sentence types: ! 95: simple 34% (114) complex 32% (108) ! 96: compound 12% (41) compound-complex 21% (72) ! 97: word usage: ! 98: verb types as % of total verbs ! 99: tobe 45% (373) aux 16% (133) inf 14% (114) ! 100: passives as % of non-inf verbs 20% (144) ! 101: types as % of total ! 102: prep 10.8% (804) conj 3.5% (262) adv 4.8% (354) ! 103: noun 26.7% (1983) adj 18.7% (1388) pron 5.3% (393) ! 104: nominalizations 2 % (155) ! 105: sentence beginnings: ! 106: subject opener: noun (63) pron (43) pos (0) adj (58) art (62) tot 67% ! 107: prep 12% (39) adv 9% (31) ! 108: verb 0% (1) sub_conj 6% (20) conj 1% (5) ! 109: expletives 4% (13) ! 110: .TE ! 111: .sp ! 112: .ce ! 113: Figure 1 ! 114: .sp ! 115: .KE ! 116: As the example shows, STYLE output is in five parts. ! 117: After a brief discussion of sentences, we will describe the parts in order. ! 118: .NH 2 ! 119: What is a sentence? ! 120: .PP ! 121: Readers of documents have little ! 122: trouble deciding where the sentences end. ! 123: People don't even have to stop and think about uses of the ! 124: character ``.'' in constructions like ! 125: 1.25, A. J. Jones, Ph.D., i. e., or etc. . ! 126: When a computer reads a document, ! 127: finding the end of sentences is not as easy. ! 128: First we must throw away the printer's marks and formatting ! 129: commands that litter the text in computer form. ! 130: Then STYLE ! 131: defines a sentence ! 132: as a string of words ending in one of: ! 133: .DS ! 134: . ! ? /. ! 135: .DE ! 136: The end marker ``/.'' may be used to indicate an imperative sentence. ! 137: Imperative sentences that are not so marked are not identified as imperative. ! 138: STYLE properly handles numbers with embedded decimal points and commas, ! 139: strings of letters and numbers with embedded decimal points used for ! 140: naming computer file names, and ! 141: the common ! 142: abbreviations listed in Appendix 1. ! 143: Numbers that end sentences, like the preceding sentence, cause ! 144: a sentence break if the next word begins with a capital letter. ! 145: Initials only cause a sentence break if the next word begins with ! 146: a capital and is found in the dictionary of function words used by PARTS. ! 147: So the string ! 148: .DS ! 149: J. D. JONES ! 150: .DE ! 151: does not cause a break, but the string ! 152: .DS ! 153: ... system H. The ... ! 154: .DE ! 155: does. ! 156: With these rules most sentences are broken at the proper place, ! 157: although occasionally ! 158: either two sentences are called one or a fragment is called ! 159: a sentence. ! 160: More on this later. ! 161: .NH 2 ! 162: Readability Grades ! 163: .PP ! 164: The first section of STYLE output consists of four readability indices. ! 165: As Klare points out in [3] readability indices may be used to ! 166: estimate the reading skills needed by the reader to understand a document. ! 167: The readability indices reported by STYLE are based on ! 168: measures of sentence and word lengths. ! 169: Although the indices ! 170: may not measure whether the document is coherent ! 171: and well organized, ! 172: experience has shown that high indices seem to be indicators of stylistic ! 173: difficulty. ! 174: Documents with short sentences and short words have low scores; ! 175: those with long sentences and many polysyllabic words have high scores. ! 176: The 4 formulae reported are Kincaid Formula [4], Automated Readability Index [5], ! 177: Coleman-Liau Formula [6] ! 178: and a normalized version of Flesch Reading Ease Score [7]. ! 179: The formulae differ because they were experimentally derived using different texts ! 180: and subject groups. ! 181: We will discuss each of the formulae briefly; for a more ! 182: detailed discussion the reader should see [3]. ! 183: .PP ! 184: The Kincaid Formula, given by: ! 185: .EQ ! 186: Reading_Grade = 11.8 * syl_per_wd + .39 * wds_per_sent - 15.59 ! 187: .EN ! 188: .br ! 189: was based on Navy training manuals that ranged in difficulty ! 190: from 5.5 to 16.3 in reading grade level. ! 191: The score reported by this formula tends to be in the mid-range of the ! 192: 4 scores. ! 193: Because it is based on adult training manuals rather than ! 194: school book text, this formula is probably the best ! 195: one to apply to technical documents. ! 196: .PP ! 197: The Automated Readability Index (ARI), based on text from ! 198: grades 0 to 7, was derived to be easy to automate. ! 199: The formula is: ! 200: .EQ ! 201: Reading_Grade = 4.71 * let_per_wd + .5 * wds_per_sent - 21.43 ! 202: .EN ! 203: .br ! 204: ARI tends to produce scores that are higher than Kincaid and ! 205: Coleman-Liau but are usually slightly lower than Flesch. ! 206: .PP ! 207: The Coleman-Liau Formula, based on text ranging in ! 208: difficulty from .4 to 16.3, is: ! 209: .EQ ! 210: Reading_Grade = 5.89 * let_per_wd - .3 * sent_per_100_wds - 15.8 ! 211: .EN ! 212: .br ! 213: Of the four formulae this one usually gives the lowest ! 214: grade when applied to technical documents. ! 215: .PP ! 216: The last formula, the Flesch Reading Ease Score, is based ! 217: on grade school text covering grades 3 to 12. ! 218: The formula, given by: ! 219: .EQ ! 220: Reading_Score = 206.835 - 84.6 * syl_per_wd - 1.015 * wds_per_sent ! 221: .EN ! 222: .br ! 223: is usually reported in the range 0 (very difficult) to 100 (very easy). ! 224: The score reported by STYLE is scaled to be comparable to ! 225: the other formulas, ! 226: except that the maximum grade level reported is set to 17. ! 227: The Flesch score is usually the highest of the 4 scores ! 228: on technical documents. ! 229: .PP ! 230: Coke [8] found that the Kincaid Formula is probably the best predictor for ! 231: technical documents; ! 232: both ARI and Flesch tend to overestimate ! 233: the difficulty; Coleman-Liau tend to underestimate. ! 234: On text in the range of grades 7 to 9 ! 235: the four formulas tend to be about the same. ! 236: On easy text the Coleman-Liau formula is probably ! 237: preferred since it is reasonably accurate at the lower ! 238: grades and it is safer to present text that is a little too ! 239: easy than a little too hard. ! 240: .PP ! 241: If a document has particularly difficult technical content, especially if ! 242: it includes a lot of mathematics, ! 243: it is probably best to make the text very easy to read, i.e. a lower ! 244: readability index by shortening the sentences and words. ! 245: This will allow the reader to concentrate on the technical ! 246: content and not the long sentences. ! 247: The user should remember that these indices are estimators; ! 248: they should not be taken as absolute numbers. ! 249: STYLE called with ``\-r number'' will print all sentences with ! 250: an Automated Readability Index equal to or greater than ``number''. ! 251: .NH 2 ! 252: Sentence length and structure ! 253: .PP ! 254: The next two sections of STYLE output deal with sentence length and structure. ! 255: Almost all books on writing style or effective writing emphasize ! 256: the importance of variety in sentence length and structure for good writing. ! 257: Ewing's first rule in discussing style in the book ! 258: .I ! 259: Writing for Results ! 260: .R ! 261: [9] is: ! 262: .DS ! 263: ``Vary the sentence structure and length of your sentences.'' ! 264: .DE ! 265: Leggett, Mead and Charvat break this rule into 3 in ! 266: .I ! 267: Prentice-Hall Handbook for Writers ! 268: .R ! 269: [10] as follows: ! 270: .DS ! 271: ``34a. Avoid the overuse of short simple sentences.'' ! 272: ``34b. Avoid the overuse of long compound sentences.'' ! 273: ``34c. Use various sentence structures to avoid monotony and increase effectiveness.'' ! 274: .DE ! 275: Although experts agree that these rules are important, not all writers ! 276: follow them. ! 277: Sample technical documents have been found with almost no ! 278: sentence length or type variability. ! 279: One document had 90% of its sentences about the same ! 280: length as the average; ! 281: another was made up almost entirely of simple sentences (80%). ! 282: .PP ! 283: The output sections labeled ``sentence info'' and ``sentence types'' give ! 284: both length and structure measures. ! 285: STYLE reports on the number and average length of both ! 286: sentences and words, ! 287: and number of questions and imperative sentences (those ending in ``/.''). ! 288: The measures of non-function words are an attempt to look at the content ! 289: words in the document. ! 290: In English ! 291: non-function words are nouns, adjectives, adverbs, and non-auxiliary verbs; ! 292: function words are prepositions, conjunctions, articles, and auxiliary ! 293: verbs. ! 294: Since most function words are short, they tend to lower the average ! 295: word length. ! 296: The average length of non-function words may be a more useful measure for comparing ! 297: word choice of different writers than the total average word length. ! 298: The percentages of short and long sentences measure sentence ! 299: length variability. ! 300: Short sentences are those at least 5 words less than the ! 301: average; long sentences are those at least 10 words longer than the average. ! 302: Last in the sentence information section is the ! 303: length and location of the longest and shortest sentences. ! 304: If the flag ``\-l number'' is used, STYLE will print all sentences ! 305: longer than ``number''. ! 306: .PP ! 307: Because of the difficulties in dealing with the many uses of commas and conjunctions ! 308: in English, sentence type definitions ! 309: vary slightly from those of standard textbooks, but still measure ! 310: the same constructional activity. ! 311: .IP 1. ! 312: A simple sentence has one verb and no dependent clause. ! 313: .IP 2. ! 314: A complex sentence has one independent ! 315: clause and one dependent clause, each with one verb. ! 316: Complex sentences are found by identifying sentences that contain either ! 317: a subordinate conjunction or a clause beginning with words like ``that'' ! 318: or ``who''. ! 319: The preceding sentence has such a clause. ! 320: .IP 3. ! 321: A compound sentence has more than one verb and no dependent ! 322: clause. ! 323: Sentences joined by ``;'' are also counted as compound. ! 324: .IP 4. ! 325: A compound-complex sentence has either several dependent clauses ! 326: or one dependent clause and a compound verb in either ! 327: the dependent or independent clause. ! 328: .PP ! 329: Even using these broader definitions, simple ! 330: sentences dominate many of the technical documents that ! 331: have been tested, ! 332: but the example in Figure 1 shows variety in both sentence structure and ! 333: sentence length. ! 334: .NH 2 ! 335: Word Usage ! 336: .PP ! 337: The word usage measures are an attempt to identify ! 338: some other constructional features of writing style. ! 339: There are many different ways in English to ! 340: say the same thing. ! 341: The constructions differ from one another ! 342: in the form of the words used. ! 343: The following sentences all convey approximately the ! 344: same meaning but differ in word usage: ! 345: .DS ! 346: The cxio program is used to perform all communication between the systems. ! 347: The cxio program performs all communications between the systems. ! 348: The cxio program is used to communicate between the systems. ! 349: The cxio program communicates between the systems. ! 350: All communication between the systems is performed by the cxio program. ! 351: .DE ! 352: The distribution of the parts of speech and verb constructions ! 353: helps identify overuse of particular constructions. ! 354: Although the measures used by STYLE are crude, they do point out ! 355: problem areas. ! 356: For each category, STYLE reports a percentage and a raw count. ! 357: In addition to looking at the percentage, the user ! 358: may find it useful to compare the raw count with the number of sentences. ! 359: If, for example, the number of infinitives is almost equal to the number ! 360: of sentences, then many of the sentences in the document are constructed ! 361: like the first and third in the preceding example. ! 362: The user may want to transform some of these sentences into another form. ! 363: Some of the implications of the word usage measures are discussed below. ! 364: .IP "\fIVerbs\fR " ! 365: are measured in several different ways to ! 366: try to determine what types of verb constructions are ! 367: most frequent in the document. ! 368: Technical writing tends to contain many ! 369: passive verb constructions and other usage of the verb ``to be''. ! 370: The category of verbs labeled ``tobe'' measures both passives and sentences of ! 371: the form: ! 372: .DS ! 373: .I ! 374: subject tobe predicate ! 375: .R ! 376: .DE ! 377: In counting verbs, whole verb phrases are counted as one verb. ! 378: Verb phrases containing auxiliary verbs are counted in the category ! 379: ``aux''. ! 380: The verb phrases counted here are those whose tense is not ! 381: simple present or simple past. ! 382: It might eventually be useful to do more detailed measures ! 383: of verb tense or mood. ! 384: Infinitives are listed as ``inf''. ! 385: The percentages reported for these three categories are based on ! 386: the total number of verb phrases found. ! 387: These categories are not mutually exclusive; ! 388: they cannot be added, since, for example, ! 389: ``to be going'' counts as both ``tobe'' and ``inf''. ! 390: Use of these three types of verb constructions varies significantly among authors. ! 391: .sp 2 ! 392: STYLE reports passive verbs as a percentage of the finite verbs in the ! 393: document. ! 394: Most style books warn against the overuse of passive verbs. ! 395: Coleman [11] has shown that sentences with ! 396: active verbs are easier to learn than those ! 397: with passive verbs. ! 398: Although the inverted object-subject order of the passive ! 399: voice seems to emphasize the object, Coleman's experiments ! 400: showed that there is little difference in retention ! 401: by word position. He also showed that the direct object of an active verb ! 402: is retained better than the subject of a passive verb. ! 403: These experiments support the advice of the style books suggesting ! 404: that writers should try to use active verbs wherever possible. ! 405: The flag ``\-p'' causes STYLE to print all sentences containing passive verbs. ! 406: .PP ! 407: .IP "\fIPronouns\fR " ! 408: add cohesiveness and connectivity to a document ! 409: by providing back-reference. ! 410: They are often a short-hand notation for something ! 411: previously mentioned, and therefore connect the sentence containing the pronoun with the ! 412: word to which the pronoun refers. ! 413: Although there are other mechanisms for such connections, documents ! 414: with no pronouns tend to be wordy and to have little connectivity. ! 415: .IP "\fIAdverbs\fR " ! 416: can provide transition between sentences and order ! 417: in time and space. ! 418: In performing these functions, adverbs, like pronouns, provide ! 419: connectivity and cohesiveness. ! 420: .IP "\fIConjunctions\fR " ! 421: provide parallelism in a document by connecting two or more ! 422: equal units. ! 423: These units may be whole sentences, verb phrases, nouns, adjectives, or ! 424: prepositional phrases. ! 425: The compound and compound-complex sentences reported under ! 426: sentence type are parallel structures. ! 427: Other uses of parallel structures are indicated by the degree that the ! 428: number of conjunctions reported under word usage exceeds the ! 429: compound sentence measures. ! 430: .IP "\fINouns and Adjectives.\fR " ! 431: A ratio of nouns to adjectives near unity may indicate the over-use of modifiers. ! 432: Some technical writers qualify every noun with one or more ! 433: adjectives. ! 434: Qualifiers in phrases like ``simple linear single-link network model'' ! 435: often lend more obscurity than precision to a text. ! 436: .IP "\fINominalizations\fR " ! 437: are verbs that are changed to nouns by adding one of the suffixes ! 438: ``ment'', ``ance'', ``ence'', or ``ion''. ! 439: Examples are accomplishment, admittance, adherence, and abbreviation. ! 440: When a writer transforms a nominalized sentence to a non-nominalized ! 441: sentence, she/he increases the effectiveness of the sentence in ! 442: several ways. ! 443: The noun becomes an active verb and frequently one complicated clause ! 444: becomes two shorter clauses. ! 445: For example, ! 446: .DS ! 447: Their inclusion of this provision is admission of the importance of the system. ! 448: When they included this provision, they admitted the importance of the system. ! 449: .DE ! 450: Coleman found that the transformed sentences were easier to ! 451: learn, even when the transformation produced sentences that were ! 452: slightly longer, provided the transformation broke one clause into two. ! 453: Writers who find their document contains many ! 454: nominalizations may want to transform some of the sentences ! 455: to use active verbs. ! 456: .NH 2 ! 457: Sentence openers ! 458: .PP ! 459: Another agreed upon principle of style is variety in sentence openers. ! 460: Because STYLE determines the type of sentence opener by ! 461: looking at the part of speech of the first word in the sentence, ! 462: the sentences counted under the heading ``subject opener'' may not ! 463: all really begin with the subject. ! 464: However, a large percentage of sentences in this category ! 465: still indicates lack of variety in sentence openers. ! 466: Other sentence opener measures help the user determine ! 467: if there are transitions between sentences and where ! 468: the subordination occurs. ! 469: Adverbs and conjunctions at the beginning of sentences are mechanisms for ! 470: transition between sentences. ! 471: A pronoun at the beginning shows a link to something previously mentioned ! 472: and indicates connectivity. ! 473: .PP ! 474: The location of subordination can be determined by comparing ! 475: the number of sentences that begin with a subordinator with ! 476: the number of sentences with complex clauses. ! 477: If few sentences start with subordinate conjunctions then ! 478: the subordination is embedded or at the end of the complex sentences. ! 479: For variety the writer may want to transform some sentences ! 480: to have leading subordination. ! 481: .PP ! 482: The last category of openers, expletives, is commonly ! 483: overworked in technical writing. ! 484: Expletives are the words ``it'' and ``there'', usually with the verb ``to be'', ! 485: in constructions where the subject follows the verb. ! 486: For example, ! 487: .DS ! 488: There are three streets used by the traffic. ! 489: There are too many users on this system. ! 490: .DE ! 491: This construction tends to emphasize the object rather than the ! 492: subject of the sentence. ! 493: The flag ``\-e'' will cause STYLE to print all ! 494: sentences that begin with an expletive. ! 495: .NH 1 ! 496: DICTION ! 497: .PP ! 498: The program DICTION prints all sentences in a document containing ! 499: phrases that are either frequently misused or indicate wordiness. ! 500: The program, an extension of Aho's FGREP [12] string ! 501: matching program, ! 502: takes as input a file of phrases or patterns to be matched and a file ! 503: of text to be searched. ! 504: A data base of about 450 phrases has been compiled as a default ! 505: pattern file for DICTION. ! 506: Before attempting to locate phrases, the program maps ! 507: upper case letters to lower case and substitutes blanks for ! 508: punctuation. ! 509: Sentence boundaries were deemed less critical in DICTION than ! 510: in STYLE, so abbreviations and other uses of the character ! 511: ``.'' are not treated specially. ! 512: DICTION brackets all pattern matches in a sentence with the characters ! 513: ``['' ``]'' . ! 514: Although many of the phrases in the default data base are correct ! 515: in some contexts, in others they indicate wordiness. ! 516: Some examples of the phrases and suggested alternatives are: ! 517: .DS ! 518: .TS ! 519: cc ! 520: ll. ! 521: Phrase Alternative ! 522: a large number of many ! 523: arrive at a decision decide ! 524: collect together collect ! 525: for this reason so ! 526: pertaining to about ! 527: through the use of by or with ! 528: utilize use ! 529: with the exception of except ! 530: .TE ! 531: .DE ! 532: Appendix 2 contains a complete list of the default file. ! 533: Some of the entries are short forms of problem phrases. ! 534: For example, the phrase ``the fact'' is found in all of the following ! 535: and is sufficient to point out the wordiness to the user: ! 536: .DS ! 537: .TS ! 538: cc ! 539: ll. ! 540: Phrase Alternative ! 541: accounted for by the fact that caused by ! 542: an example of this is the fact that thus ! 543: based on the fact that because ! 544: despite the fact that although ! 545: due to the fact that because ! 546: in light of the fact that because ! 547: in view of the fact that since ! 548: notwithstanding the fact that although ! 549: .TE ! 550: .DE ! 551: Entries in Appendix 2 preceded by ``~'' are not matched. ! 552: See Section 7 for details on the use of ``~''. ! 553: .PP ! 554: The user may supply her/his own pattern file with the flag ``\-f patfile''. ! 555: In this case the default file will be loaded first, followed by the user file. ! 556: This mechanism allows users to suppress ! 557: patterns contained in the default file or to include their own pet peeves that are not in the default file. ! 558: The flag ``\-n'' will exclude the default file altogether. ! 559: In constructing a pattern file, blanks should be used before and after each ! 560: phrase to avoid matching substrings in words. ! 561: For example, to find all occurrences of the word ``the'', the pattern ! 562: `` the '' should be used. ! 563: The blanks cause only the word ``the'' to be matched and not the ! 564: string ``the'' in words like there, other, and therefore. ! 565: One side effect of surrounding the words with blanks is that ! 566: when two phrases occur without intervening words, only the ! 567: first will be matched. ! 568: .NH 1 ! 569: EXPLAIN ! 570: .PP ! 571: The last program, EXPLAIN, is an interactive thesaurus for ! 572: phrases found by DICTION. ! 573: The user types one of the phrases bracketed by DICTION ! 574: and EXPLAIN responds with suggested substitutions for the phrase ! 575: that will improve the diction of the document. ! 576: .KF ! 577: .DS C ! 578: Table 1 ! 579: Text Statistics on 20 Technical Documents ! 580: .TS ! 581: cccccc ! 582: llnnnn. ! 583: variable minimum maximum mean standard deviation ! 584: _ ! 585: Readability Kincaid 9.5 16.9 13.3 2.2 ! 586: automated 9.0 17.4 13.3 2.5 ! 587: Cole-Liau 10.0 16.0 12.7 1.8 ! 588: Flesch 8.9 17.0 14.4 2.2 ! 589: _ ! 590: sentence info. av sent length 15.5 30.3 21.6 4.0 ! 591: av word length 4.61 5.63 5.08 .29 ! 592: av nonfunction length 5.72 7.30 6.52 .45 ! 593: short sent 23% 46% 33% 5.9 ! 594: long sent 7% 20% 14% 2.9 ! 595: _ ! 596: sentence types simple 31% 71% 49% 11.4 ! 597: complex 19% 50% 33% 8.3 ! 598: compound 2% 14% 7% 3.3 ! 599: compound-complex 2% 19% 10% 4.8 ! 600: _ ! 601: verb types tobe 26% 64% 44.7% 10.3 ! 602: auxiliary 10% 40% 21% 8.7 ! 603: infinitives 8% 24% 15.1% 4.8 ! 604: passives 12% 50% 29% 9.3 ! 605: _ ! 606: word usage prepositions 10.1% 15.0% 12.3% 1.6 ! 607: conjunction 1.8% 4.8% 3.4% .9 ! 608: adverbs 1.2% 5.0% 3.4% 1.0 ! 609: nouns 23.6% 31.6% 27.8% 1.7 ! 610: adjectives 15.4% 27.1% 21.1% 3.4 ! 611: pronouns 1.2% 8.4% 2.5% 1.1 ! 612: nominalizations 2% 5% 3.3% .8 ! 613: _ ! 614: sentence openers prepositions 6% 19% 12% 3.4 ! 615: adverbs 0% 20% 9% 4.6 ! 616: subject 56% 85% 70% 8.0 ! 617: verbs 0% 4% 1% 1.0 ! 618: subordinating conj 1% 12% 5% 2.7 ! 619: conjunctions 0% 4% 0% 1.5 ! 620: expletives 0% 6% 2% 1.7 ! 621: .TE ! 622: .DE ! 623: .KE ! 624: .NH 1 ! 625: Results ! 626: .NH 2 ! 627: STYLE ! 628: .PP ! 629: To get baseline statistics and check the program's accuracy, ! 630: we ran STYLE on 20 technical documents. ! 631: There were a total of 3287 sentences in the sample. ! 632: The shortest document was 67 sentences long; the longest 339 sentences. ! 633: The documents covered a wide range of subject matter, including ! 634: theoretical computing, physics, psychology, engineering, and ! 635: affirmative action. ! 636: Table 1 gives the range, median, and standard deviation of the various style measures. ! 637: As you will note most of the measurements have a fairly wide range of values ! 638: across the sample documents. ! 639: .PP ! 640: As a comparison, Table 2 gives the median results ! 641: for two different technical authors, a sample of instructional material, and a sample of the ! 642: Federalist Papers. ! 643: The two authors show similar styles, although author 2 ! 644: uses somewhat shorter sentences and longer words than author 1. ! 645: Author 1 uses all types of sentences, while author 2 prefers ! 646: simple and complex sentences, using few compound or compound-complex sentences. ! 647: The other major difference in the styles of these authors is the location ! 648: of subordination. ! 649: Author 1 seems to prefer embedded or trailing subordination, while ! 650: author 2 begins many sentences with the subordinate clause. ! 651: The documents tested for both authors 1 and 2 were technical documents, ! 652: written for a technical audience. ! 653: The instructional documents, which are written for craftspeople, ! 654: vary surprisingly little from the two technical samples. ! 655: The sentences and words are a little longer, ! 656: and they contain many passive and auxiliary verbs, few adverbs, and almost ! 657: no pronouns. ! 658: The instructional documents contain many imperative sentences, so there are ! 659: many sentence with verb openers. ! 660: The sample of Federalist Papers contrasts with the other ! 661: samples in almost every way. ! 662: .KF ! 663: .DS C ! 664: Table 2 ! 665: Text Statistics on Single Authors ! 666: .TS ! 667: cccccc ! 668: llnnnn. ! 669: variable author 1 author 2 inst. FED ! 670: _ ! 671: readability Kincaid 11.0 10.3 10.8 16.3 ! 672: automated 11.0 10.3 11.9 17.8 ! 673: Coleman-Liau 9.3 10.1 10.2 12.3 ! 674: Flesch 10.3 10.7 10.1 15.0 ! 675: _ ! 676: sentence info av sent length 22.64 19.61 22.78 31.85 ! 677: av word length 4.47 4.66 4.65 4.95 ! 678: av nonfunction length 5.64 5.92 6.04 6.87 ! 679: short sent 35% 43% 35% 40% ! 680: long sent 18% 15% 16% 21% ! 681: _ ! 682: sentence types simple 36% 43% 40% 31% ! 683: complex 34% 41% 37% 34% ! 684: compound 13% 7% 4% 10% ! 685: compound-complex 16% 8% 14% 25% ! 686: _ ! 687: verb type tobe 42% 43% 45% 37% ! 688: auxiliary 17% 19% 32% 32% ! 689: infinitives 17% 15% 12% 21% ! 690: passives 20% 19% 36% 20% ! 691: _ ! 692: word usage prepositions 10.0% 10.8% 12.3% 15.9% ! 693: conjunctions 3.2% 2.4% 3.9% 3.4% ! 694: adverbs 5.05% 4.6% 3.5% 3.7% ! 695: nouns 27.7% 26.5% 29.1% 24.9% ! 696: adjectives 17.0% 19.0% 15.4% 12.4% ! 697: pronouns 5.3% 4.3% 2.1% 6.5% ! 698: nominalizations 1% 2% 2% 3% ! 699: _ ! 700: sentence openers prepositions 11% 14% 6% 5% ! 701: adverbs 9% 9% 6% 4% ! 702: subject 65% 59% 54% 66% ! 703: verb 3% 2% 14% 2% ! 704: subordinating conj 8% 14% 11% 3% ! 705: conjunction 1% 0% 0% 3% ! 706: expletives 3% 3% 0% 3% ! 707: .TE ! 708: .DE ! 709: .KE ! 710: .NH 2 ! 711: DICTION ! 712: .PP ! 713: In the few weeks that DICTION has been available ! 714: to users ! 715: about 35,000 sentences have been run with about ! 716: 5,000 string matches. ! 717: The authors using the program seem to make ! 718: the suggested changes about 50-75% of the time. ! 719: To date, almost 200 of the 450 strings in the default ! 720: file have been matched. ! 721: Although most of these phrases are valid and correct ! 722: in some contexts, the 50-75% change rate seems to ! 723: show that the phrases are used much more often than ! 724: concise diction warrants. ! 725: .NH 1 ! 726: Accuracy ! 727: .NH 2 ! 728: Sentence Identification ! 729: .PP ! 730: The correctness of the STYLE output on the 20 document sample was checked ! 731: in detail. ! 732: STYLE misidentified ! 733: 129 sentence fragments as sentences ! 734: and incorrectly joined two or more sentences 75 times ! 735: in the 3287 sentence sample. ! 736: The problems were usually because of nonstandard formatting ! 737: commands, unknown abbreviations, or lists of non-sentences. ! 738: An impossibly long sentence found as the longest sentence in ! 739: the document usually is the result of a long list ! 740: of non-sentences. ! 741: .NH 2 ! 742: Sentence Types ! 743: .PP ! 744: Style correctly identified sentence type on 86.5% of ! 745: the sentences in the sample. ! 746: The type distribution of the sentences was ! 747: 52.5% simple, 29.9% complex, 8.5% compound and ! 748: 9% compound-complex. ! 749: The program reported 49.5% simple, 31.9% complex, ! 750: 8% compound and 10.4% compound-complex. ! 751: Looking at the errors on the individual ! 752: documents, the number of simple sentences was ! 753: under-reported by about 4% and the complex and compound-complex ! 754: were over-reported by 3% and 2%, respectively. ! 755: The following matrix shows the programs output ! 756: vs. the actual sentence type. ! 757: .DS C ! 758: .TS ! 759: csssss ! 760: cccccc ! 761: clnnnn. ! 762: Program Results ! 763: simple complex compound comp-complex ! 764: Actual simple 1566 132 49 17 ! 765: Sentence complex 47 892 6 65 ! 766: Type compound 40 6 207 23 ! 767: comp-complex 0 52 5 249 ! 768: .TE ! 769: .DE ! 770: .PP ! 771: The system's inability to find imperative sentences seems to ! 772: have little effect on most of the style statistics. ! 773: A document with half of its sentences imperative was run, with and ! 774: without the imperative end marker. ! 775: The results were identical except for the expected errors of not finding ! 776: verbs as sentence openers, not counting the imperative sentences, ! 777: and a slight difference (1%) in the number of nouns ! 778: and adjectives reported. ! 779: .NH 2 ! 780: Word Usage ! 781: .PP ! 782: The accuracy of identifying word types reflects ! 783: that of PARTS, which is about 95% correct. ! 784: The largest source of confusion is between nouns and ! 785: adjectives. ! 786: The verb counts were checked on about 20 sentences from each ! 787: document and found to be about 98% correct. ! 788: .NH 1 ! 789: Technical Details ! 790: .NH 2 ! 791: Finding Sentences ! 792: .PP ! 793: The formatting commands embedded in the text increase the difficulty ! 794: of finding sentences. ! 795: Not all text in a document is in sentence form; there are headings, ! 796: tables, equations and lists, for example. ! 797: Headings like ``Finding Sentences'' above should be discarded, not ! 798: attached to the next sentence. ! 799: However, since many of the documents are formatted to be phototypeset, ! 800: and contain font changes, which usually operate on the ! 801: most important words in the document, ! 802: discarding all formatting commands is not correct. ! 803: To improve the programs' ability to find sentence boundaries, the deformatting program, DEROFF [13], ! 804: has been given some knowledge of the formatting packages used on the ! 805: .UX ! 806: operating system. ! 807: DEROFF will now do the following: ! 808: .IP 1. ! 809: Suppress all formatting macros that ! 810: are used for titles, headings, author's name, etc. ! 811: .IP 2. ! 812: Suppress the arguments to the macros for titles, headings, author's name, etc. ! 813: .IP 3. ! 814: Suppress displays, tables, footnotes and text that is centered or in no-fill mode. ! 815: .IP 4. ! 816: Substitute a place holder for equations and check ! 817: for hidden end markers. ! 818: The place holder is necessary because many typists and authors use ! 819: the equation setter to change fonts on important words. ! 820: For this reason, header files containing the definition of ! 821: the EQN delimiters must also be included as input to STYLE. ! 822: End markers are often hidden when an equation ends a sentence ! 823: and the period is typed ! 824: inside the EQN delimiters. ! 825: .IP 5. ! 826: Add a "." after lists. ! 827: If the flag \-ml is also used, all lists are suppressed. ! 828: This is a separate flag because of the variety of ways the ! 829: list macros are used. ! 830: Often, lists are sentences that should be included in the analysis. ! 831: The user must determine how lists are used in the document to be analyzed. ! 832: .PP ! 833: Both STYLE and DICTION call DEROFF before they look at the text. ! 834: The user should supply the \-ml flag if the document contains ! 835: many lists of non-sentences that should be skipped. ! 836: .NH 2 ! 837: Details of DICTION ! 838: .PP ! 839: The program DICTION is based on the string matching program FGREP. ! 840: FGREP takes as input a file of patterns to be matched and a file ! 841: to be searched and outputs each line that contains ! 842: any of the patterns ! 843: with no indication of which pattern was matched. ! 844: The following changes have been added to FGREP: ! 845: .IP 1. ! 846: The basic unit that DICTION operates on is a sentence rather than a line. ! 847: Each sentence that contains one of the patterns is output. ! 848: .IP 2. ! 849: Upper case letters are mapped to lower case. ! 850: .IP 3. ! 851: Punctuation is replaced by blanks. ! 852: .IP 4 ! 853: All pattern matches in the sentence are found and surrounded with ! 854: ``['' ``]'' . ! 855: .IP 5. ! 856: A method for suppressing a string match has been added. ! 857: Any pattern that begins with ``~'' will not be matched. ! 858: Because the matching algorithm finds the longest ! 859: substring, the suppression of a match allows words in some ! 860: correct contexts not to be matched while allowing ! 861: the word in another context to be found. ! 862: For example, the word ``which'' is often incorrectly used ! 863: instead of ``that'' in restrictive clauses. ! 864: However, ``which'' is usually correct when preceded by a preposition ! 865: or ``,''. ! 866: The default pattern file suppresses the match ! 867: of the common prepositions or a double ! 868: blank followed by ``which'' and therefore matches only ! 869: the suspect uses. ! 870: The double blank accounts for the replaced comma. ! 871: .NH ! 872: Conclusions ! 873: .PP ! 874: A system of writing tools that measure some of the ! 875: objective characteristics of writing style has been developed. ! 876: The tools are sufficiently general that they may be applied to ! 877: documents on any subject with equal accuracy. ! 878: Although the measurements are only of the surface ! 879: structure of the text, they do point out problem areas. ! 880: In addition to helping writers produce better documents, ! 881: these programs may be useful for studying ! 882: the writing process and finding other formulae for measuring ! 883: readability.
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.