43BSDReno/share/doc/usd/32.diction/rm1 - annotate

Return to rm1 CVS log
Up to [CSRG BSD Unix] / 43BSDReno / share / doc / usd / 32.diction
Annotation of 43BSDReno/share/doc/usd/32.diction/rm1, revision 1.1.1.1

1.1       root        1: .\"    @(#)rm1 6.1 (Berkeley) 5/22/86
                      2: .\"
                      3: .EQ
                      4: delim $$
                      5: .EN
                      6: .NH 1
                      7: Introduction
                      8: .PP
                      9: Computers have become important
                     10: in the document preparation process, with programs
                     11: to check for spelling errors and to format documents.
                     12: As the amount of text stored on line increases, it becomes
                     13: feasible and attractive to study writing
                     14: style and to attempt to help the writer in producing readable
                     15: documents.
                     16: The system of writing tools described here is a first step toward such help.
                     17: The system includes programs and a data base to
                     18: analyze writing style at the word and sentence level.
                     19: We use the term ``style'' in this paper to describe the
                     20: results of a writer's particular choices among individual words and
                     21: sentence forms.
                     22: Although many judgements of style are subjective,
                     23: particularly those of word choice,
                     24: there are some objective measures that experts
                     25: agree lead to good style.
                     26: Three programs have been written to measure some of
                     27: the objectively definable characteristics of writing style
                     28: and to identify some commonly misused or unnecessary phrases.
                     29: Although a document that conforms to the stylistic rules
                     30: is not guaranteed to be coherent and readable, one that
                     31: violates all of the rules is likely to be
                     32: difficult or tedious to read.
                     33: The program STYLE calculates readability, sentence length variability,
                     34: sentence type, word usage and sentence openers at a rate of about 400 words per second
                     35: on a PDP11/70 running the
                     36: .UX
                     37: Operating System.
                     38: It assumes that the sentences are well-formed, i. e. that
                     39: each sentence has a verb and that the subject and verb agree in number.
                     40: DICTION identifies phrases that are either bad usage or unnecessarily wordy.
                     41: EXPLAIN acts as a thesaurus for the phrases found by DICTION.
                     42: Sections 2, 3, and 4 describe the programs; Section 5 gives the results
                     43: on a cross-section of technical documents; Section 6 discusses
                     44: accuracy and problems; Section 7 gives implementation details.
                     45: .NH 1
                     46: STYLE
                     47: .PP
                     48: The program STYLE reads a document and prints a summary of
                     49: readability indices, sentence length and type, word usage,
                     50: and sentence openers.
                     51: It may also be used to locate all sentences in a document
                     52: longer than a given length, of readability index higher than a given
                     53: number, those containing a passive verb, or those beginning with an expletive.
                     54: STYLE
                     55: is based on the system for finding English word classes or parts of speech, PARTS [1].
                     56: PARTS is a set of programs that uses a small dictionary (about 350 words)
                     57: and suffix rules to partially assign word classes to
                     58: English text.
                     59: It then uses experimentally derived rules of word order to assign
                     60: word classes to all words in the text with an accuracy of about 95%.
                     61: Because PARTS uses only a small dictionary and general rules, it works
                     62: on text about any subject, from physics to psychology.
                     63: Style measures have been built into the output phase
                     64: of the programs that make up PARTS.
                     65: Some of the measures are simple counters of the word classes
                     66: found by PARTS; many are more complicated.
                     67: For example, the verb count is the total number of verb phrases.
                     68: This includes phrases like:
                     69: .DS
                     70: has been going
                     71: was only going
                     72: to go
                     73: .DE
                     74: each of which each counts as one verb.
                     75: Figure 1 shows the output of STYLE run on a paper by Kernighan and Mashey
                     76: about the
                     77: .UX
                     78: programming environment [2].
                     79: .KF
                     80: .sp 2
                     81: .TS
                     82: box;
                     83: l1l.
                     84: programming environment
                     85: readability grades:
                     86:        (Kincaid) 12.3  (auto) 12.8  (Coleman-Liau) 11.8  (Flesch) 13.5 (46.3)
                     87: sentence info:
                     88:        no. sent 335 no. wds 7419
                     89:        av sent leng 22.1 av word leng 4.91
                     90:        no. questions 0 no. imperatives 0
                     91:        no. nonfunc wds 4362  58.8%   av leng 6.38
                     92:        short sent (<17) 35% (118) long sent (>32)  16% (55)
                     93:        longest sent 82 wds at sent 174; shortest sent 1 wds at sent 117
                     94: sentence types:
                     95:        simple  34% (114) complex  32% (108)
                     96:        compound  12% (41) compound-complex  21% (72)
                     97: word usage:
                     98:        verb types as % of total verbs
                     99:        tobe  45% (373) aux  16% (133) inf  14% (114)
                    100:        passives as % of non-inf verbs  20% (144)
                    101:        types as % of total
                    102:        prep 10.8% (804) conj 3.5% (262) adv 4.8% (354)
                    103:        noun 26.7% (1983) adj 18.7% (1388) pron 5.3% (393)
                    104:        nominalizations   2 % (155)
                    105: sentence beginnings:
                    106:        subject opener: noun (63) pron (43) pos (0) adj (58) art (62) tot  67%
                    107:        prep  12% (39) adv   9% (31) 
                    108:        verb   0% (1)  sub_conj   6% (20) conj   1% (5)
                    109:        expletives   4% (13)
                    110: .TE
                    111: .sp
                    112: .ce
                    113: Figure 1
                    114: .sp
                    115: .KE
                    116: As the example shows, STYLE output is in five parts.
                    117: After a brief discussion of sentences, we will describe the parts in order.
                    118: .NH 2
                    119: What is a sentence?
                    120: .PP
                    121: Readers of documents have little
                    122: trouble deciding where the sentences end.
                    123: People don't even have to stop and think about uses of the
                    124: character ``.'' in constructions like
                    125: 1.25, A. J. Jones, Ph.D., i. e., or etc. .
                    126: When a computer reads a document,
                    127: finding the end of sentences is not as easy.
                    128: First we must throw away the printer's marks and formatting
                    129: commands that litter the text in computer form.
                    130: Then STYLE
                    131: defines a sentence
                    132: as a string of words ending in one of:
                    133: .DS
                    134:  . ! ? /.
                    135: .DE
                    136: The end marker ``/.'' may be used to indicate an imperative sentence.
                    137: Imperative sentences that are not so marked are not identified as imperative.
                    138: STYLE properly handles numbers with embedded decimal points and commas,
                    139: strings of letters and numbers with embedded decimal points used for
                    140: naming computer file names, and
                    141: the common
                    142: abbreviations listed in Appendix 1.
                    143: Numbers that end sentences, like the preceding sentence, cause
                    144: a sentence break if the next word begins with a capital letter.
                    145: Initials only cause a sentence break if the next word begins with
                    146: a capital and is found in the dictionary of function words used by PARTS.
                    147: So the string
                    148: .DS
                    149: J. D. JONES
                    150: .DE
                    151: does not cause a break, but the string
                    152: .DS
                    153:  ... system H.  The ...
                    154: .DE
                    155: does.
                    156: With these rules most sentences are broken at the proper place,
                    157: although occasionally
                    158: either two sentences are called one or a fragment is called
                    159: a sentence.
                    160: More on this later.
                    161: .NH 2
                    162: Readability Grades
                    163: .PP
                    164: The first section of STYLE output consists of four readability indices.
                    165: As Klare points out in [3] readability indices may be used to
                    166: estimate the reading skills needed by the reader to understand a document.
                    167: The readability indices reported by STYLE are based on
                    168: measures of sentence and word lengths.
                    169: Although the indices
                    170: may not measure whether the document is coherent
                    171: and well organized,
                    172: experience has shown that high indices seem to be indicators of stylistic
                    173: difficulty.
                    174: Documents with short sentences and short words have low scores;
                    175: those with long sentences and many polysyllabic words have high scores.
                    176: The 4 formulae reported are Kincaid Formula [4], Automated Readability Index [5],
                    177: Coleman-Liau Formula [6]
                    178: and a normalized version of Flesch Reading Ease Score [7].
                    179: The formulae differ because they  were experimentally derived using different texts
                    180: and subject groups.
                    181: We will discuss each of the formulae briefly; for a more
                    182: detailed discussion the reader should see [3].
                    183: .PP
                    184: The Kincaid Formula, given by:
                    185: .EQ
                    186: Reading_Grade = 11.8 * syl_per_wd + .39 * wds_per_sent - 15.59
                    187: .EN
                    188: .br
                    189: was based on Navy training manuals that ranged in difficulty
                    190: from 5.5 to 16.3 in reading grade level.
                    191: The score reported by this formula tends to be in the mid-range of the
                    192: 4 scores.
                    193: Because it is based on adult training manuals rather than
                    194: school book text, this formula is probably the best
                    195: one to apply to technical documents.
                    196: .PP
                    197: The Automated Readability Index (ARI), based on text from
                    198: grades 0 to 7, was derived to be easy to automate.
                    199: The formula is:
                    200: .EQ
                    201: Reading_Grade = 4.71 * let_per_wd + .5 * wds_per_sent - 21.43
                    202: .EN
                    203: .br
                    204: ARI tends to produce scores that are higher than Kincaid and
                    205: Coleman-Liau but are usually slightly lower than Flesch.
                    206: .PP
                    207: The Coleman-Liau Formula, based on text ranging in
                    208: difficulty from .4 to 16.3, is:
                    209: .EQ
                    210: Reading_Grade = 5.89 * let_per_wd - .3 * sent_per_100_wds - 15.8
                    211: .EN
                    212: .br
                    213: Of the four formulae this one usually gives the lowest
                    214: grade when applied to technical documents.
                    215: .PP
                    216: The last formula, the Flesch Reading Ease Score, is based
                    217: on grade school text covering grades 3 to 12.
                    218: The formula, given by:
                    219: .EQ
                    220: Reading_Score = 206.835 - 84.6 * syl_per_wd - 1.015 * wds_per_sent
                    221: .EN
                    222: .br
                    223: is usually reported in the range 0 (very difficult) to 100 (very easy).
                    224: The score reported by STYLE is scaled to be comparable to
                    225: the other formulas,
                    226: except that the maximum grade level reported is set to 17.
                    227: The Flesch score is usually the highest of the 4 scores
                    228: on technical documents.
                    229: .PP
                    230: Coke [8] found that the Kincaid Formula is probably the best predictor for
                    231: technical documents;
                    232: both ARI and Flesch tend to overestimate
                    233: the difficulty; Coleman-Liau tend to underestimate.
                    234: On text in the range of grades 7 to 9
                    235: the four formulas tend to be about the same.
                    236: On easy text the Coleman-Liau formula is probably
                    237: preferred since it is reasonably accurate at the lower
                    238: grades and it is safer to present text that is a little too
                    239: easy than a little too hard.
                    240: .PP
                    241: If a document has particularly difficult technical content, especially if
                    242: it includes a lot of mathematics,
                    243: it is probably best to make the text very easy to read, i.e. a lower
                    244: readability index by shortening the sentences and words.
                    245: This will allow the reader to concentrate on the technical
                    246: content and not the long sentences.
                    247: The user should remember that these indices are estimators;
                    248: they should not be taken as absolute numbers.
                    249: STYLE called with ``\-r number'' will print all sentences with
                    250: an Automated Readability Index equal to or greater than ``number''.
                    251: .NH 2
                    252: Sentence length and structure
                    253: .PP
                    254: The next two sections of STYLE output deal with sentence length and structure.
                    255: Almost all books on writing style or effective writing emphasize
                    256: the importance of variety in sentence length and structure for good writing.
                    257: Ewing's first rule in discussing style in the book
                    258: .I
                    259: Writing for Results
                    260: .R
                    261: [9] is:
                    262: .DS
                    263: ``Vary the sentence structure and length of your sentences.''
                    264: .DE
                    265: Leggett, Mead and Charvat break this rule into 3 in
                    266: .I
                    267: Prentice-Hall Handbook for Writers
                    268: .R
                    269: [10] as follows:
                    270: .DS
                    271: ``34a. Avoid the overuse of short simple sentences.''
                    272: ``34b. Avoid the overuse of long compound sentences.''
                    273: ``34c. Use various sentence structures to avoid monotony and increase effectiveness.''
                    274: .DE
                    275: Although experts agree that these rules are important, not all writers
                    276: follow them.
                    277: Sample technical documents have been found with almost no
                    278: sentence length or type variability.
                    279: One document had 90% of its sentences about the same
                    280: length as the average;
                    281: another was made up almost entirely of simple sentences (80%).
                    282: .PP
                    283: The output sections labeled ``sentence info'' and ``sentence types'' give
                    284: both length and structure measures.
                    285: STYLE reports on the number and average length of both
                    286: sentences and words,
                    287: and number of questions and imperative sentences (those ending in ``/.'').
                    288: The measures of non-function words are an attempt to look at the content
                    289: words in the document.
                    290: In English
                    291: non-function words are nouns, adjectives, adverbs, and non-auxiliary verbs;
                    292: function words are prepositions, conjunctions, articles, and auxiliary
                    293: verbs.
                    294: Since most function words are short, they tend to lower the average
                    295: word length.
                    296: The average length of non-function words may be a more useful measure for comparing
                    297: word choice of different writers than the total average word length.
                    298: The percentages of short and long sentences measure sentence
                    299: length variability.
                    300: Short sentences are those at least 5 words less than the
                    301: average; long sentences are those at least 10 words longer than the average.
                    302: Last in the sentence information section is the
                    303: length and location of the longest and shortest sentences.
                    304: If the flag ``\-l number'' is used, STYLE will print all sentences
                    305: longer than ``number''.
                    306: .PP
                    307: Because of the difficulties in dealing with the many uses of commas and conjunctions
                    308: in English, sentence type definitions
                    309: vary slightly from those of standard textbooks, but still measure
                    310: the same constructional activity.
                    311: .IP 1.
                    312: A simple sentence has one verb and no dependent clause.
                    313: .IP 2.
                    314: A complex sentence has one independent
                    315: clause and one dependent clause, each with one verb.
                    316: Complex sentences are found by identifying sentences that contain either
                    317: a subordinate conjunction or a clause beginning with words like ``that''
                    318: or ``who''.
                    319: The preceding sentence has such a clause.
                    320: .IP 3.
                    321: A compound sentence has more than one verb and no dependent
                    322: clause.
                    323: Sentences joined by ``;'' are also counted as compound.
                    324: .IP 4.
                    325: A compound-complex sentence has either several dependent clauses
                    326: or one dependent clause and a compound verb in either
                    327: the dependent or independent clause.
                    328: .PP
                    329: Even using these broader definitions, simple
                    330: sentences dominate many of the technical documents that
                    331: have been tested,
                    332: but the example in Figure 1 shows variety in both sentence structure and
                    333: sentence length.
                    334: .NH 2
                    335: Word Usage
                    336: .PP
                    337: The word usage measures are an attempt to identify
                    338: some other constructional features of writing style.
                    339: There are many different ways in English to
                    340: say the same thing.
                    341: The constructions differ from one another
                    342: in the form of the words used.
                    343: The following sentences all convey approximately the
                    344: same meaning but differ in word usage:
                    345: .DS
                    346: The cxio program is used to perform all communication between the systems.
                    347: The cxio program performs all communications between the systems.
                    348: The cxio program is used to communicate between the systems.
                    349: The cxio program communicates between the systems.
                    350: All communication between the systems is performed by the cxio program.
                    351: .DE
                    352: The  distribution of the parts of speech and verb constructions
                    353: helps identify overuse of particular constructions.
                    354: Although the measures used by STYLE are crude, they do point out
                    355: problem areas.
                    356: For each category, STYLE reports a percentage and a raw count.
                    357: In addition to looking at the percentage, the user
                    358: may find it useful to compare the raw count with the number of sentences.
                    359: If, for example, the number of infinitives is almost equal to the number
                    360: of sentences, then many of the sentences in the document are constructed
                    361: like the first and third in the preceding example.
                    362: The user may want to transform some of these sentences into another form.
                    363: Some of the implications of the word usage measures are discussed below.
                    364: .IP "\fIVerbs\fR "
                    365: are measured in several different ways to
                    366: try to determine what types of verb constructions are
                    367: most frequent in the document.
                    368: Technical writing tends to contain many
                    369: passive verb constructions and other usage of the verb ``to be''.
                    370: The category of verbs labeled ``tobe'' measures both passives and sentences of
                    371: the form:
                    372: .DS
                    373: .I
                    374: subject tobe predicate
                    375: .R
                    376: .DE
                    377: In counting verbs, whole verb phrases are counted as one verb.
                    378: Verb phrases containing auxiliary verbs are counted in the category
                    379: ``aux''.
                    380: The verb phrases counted here are those whose tense is not
                    381: simple present or simple past.
                    382: It might eventually be useful to do more detailed measures
                    383: of verb tense or mood.
                    384: Infinitives are listed as ``inf''.
                    385: The percentages reported for these three categories are based on
                    386: the total number of verb phrases found.
                    387: These categories are not mutually exclusive;
                    388: they cannot be added, since, for example,
                    389: ``to be going'' counts as both ``tobe'' and ``inf''.
                    390: Use of these three types of verb constructions varies significantly among authors.
                    391: .sp 2
                    392: STYLE reports passive verbs as a percentage of the finite verbs in the
                    393: document.
                    394: Most style books warn against the overuse of passive verbs.
                    395: Coleman [11] has shown that sentences with
                    396: active verbs are easier to learn than those
                    397: with passive verbs.
                    398: Although the inverted object-subject order of the passive
                    399: voice seems to emphasize the object, Coleman's experiments
                    400: showed that there is little difference in retention
                    401: by word position. He also showed that the direct object of an active verb
                    402: is retained better than the subject of a passive verb.
                    403: These experiments support the advice of the style books suggesting
                    404: that writers should try to use active verbs wherever possible.
                    405: The flag ``\-p'' causes STYLE to print all sentences containing passive verbs.
                    406: .PP
                    407: .IP "\fIPronouns\fR "
                    408: add cohesiveness and connectivity to a document
                    409: by providing back-reference.
                    410: They are often a short-hand notation for something
                    411: previously mentioned, and therefore connect the sentence containing the pronoun with the
                    412: word to which the pronoun refers.
                    413: Although there are other mechanisms for such connections, documents
                    414: with no pronouns tend to be wordy and to have little connectivity.
                    415: .IP "\fIAdverbs\fR "
                    416: can provide transition between sentences and order
                    417: in time and space.
                    418: In performing these functions, adverbs, like pronouns, provide
                    419: connectivity and cohesiveness.
                    420: .IP "\fIConjunctions\fR "
                    421: provide parallelism in a document by connecting two or more
                    422: equal units.
                    423: These units may be whole sentences, verb phrases, nouns, adjectives, or
                    424: prepositional phrases.
                    425: The compound and compound-complex sentences reported under
                    426: sentence type are parallel structures.
                    427: Other uses of parallel structures are indicated by the degree that the
                    428: number of conjunctions reported under word usage exceeds the
                    429: compound sentence measures.
                    430: .IP "\fINouns and Adjectives.\fR "
                    431: A ratio of nouns to adjectives near unity may indicate the over-use of modifiers.
                    432: Some technical writers qualify every noun with one or more
                    433: adjectives.
                    434: Qualifiers in phrases like ``simple linear single-link network model''
                    435: often lend more obscurity than precision to a text.
                    436: .IP "\fINominalizations\fR "
                    437: are verbs that are changed to nouns by adding one of the suffixes
                    438: ``ment'', ``ance'', ``ence'', or ``ion''.
                    439: Examples are accomplishment, admittance, adherence, and abbreviation.
                    440: When a writer transforms a nominalized sentence to a non-nominalized
                    441: sentence, she/he increases the effectiveness of the sentence in
                    442: several ways.
                    443: The noun becomes an active verb and frequently one complicated clause
                    444: becomes two shorter clauses.
                    445: For example,
                    446: .DS
                    447: Their inclusion of this provision is admission of the importance of the system.
                    448: When they included this provision, they admitted the importance of the system.
                    449: .DE
                    450: Coleman found that the transformed sentences were easier to
                    451: learn, even when the transformation produced sentences that were
                    452: slightly longer, provided the transformation broke one clause into two.
                    453: Writers who find their document contains many
                    454: nominalizations may want to transform some of the sentences 
                    455: to use active verbs.
                    456: .NH 2
                    457: Sentence openers
                    458: .PP
                    459: Another agreed upon principle of style is variety in sentence openers.
                    460: Because STYLE determines the type of sentence opener by
                    461: looking at the part of speech of the first word in the sentence,
                    462: the sentences counted under the heading ``subject opener'' may not
                    463: all really begin with the subject.
                    464: However, a large percentage of sentences in this category
                    465: still indicates lack of variety in sentence openers.
                    466: Other sentence opener measures help the user determine
                    467: if there are transitions between sentences and where
                    468: the subordination occurs.
                    469: Adverbs and conjunctions at the beginning of sentences are mechanisms for
                    470: transition between sentences.
                    471: A pronoun at the beginning shows a link to something previously mentioned
                    472: and indicates connectivity.
                    473: .PP
                    474: The location of subordination can be determined by comparing
                    475: the number of sentences that begin with a subordinator with
                    476: the number of sentences with complex clauses.
                    477: If few sentences start with subordinate conjunctions then
                    478: the subordination is embedded or at the end of the complex sentences.
                    479: For variety the writer may want to transform some sentences
                    480: to have leading subordination.
                    481: .PP
                    482: The last category of openers, expletives, is commonly
                    483: overworked in technical writing.
                    484: Expletives are the words ``it'' and ``there'', usually with the verb ``to be'',
                    485: in constructions where the subject follows the verb.
                    486: For example,
                    487: .DS
                    488: There are three streets used by the traffic.
                    489: There are too many users on this system.
                    490: .DE
                    491: This construction tends to emphasize the object rather than the
                    492: subject of the sentence.
                    493: The flag ``\-e'' will cause STYLE to print all
                    494: sentences that begin with an expletive.
                    495: .NH 1
                    496: DICTION
                    497: .PP
                    498: The program DICTION prints all sentences in a document containing
                    499: phrases that are either frequently misused or indicate wordiness.
                    500: The program, an extension of Aho's FGREP [12] string
                    501: matching program,
                    502: takes as input a file of phrases or patterns to be matched and a file
                    503: of text to be searched.
                    504: A data base of about 450 phrases has been compiled as a default
                    505: pattern file for DICTION.
                    506: Before attempting to locate phrases, the program maps
                    507: upper case letters to lower case and substitutes blanks for
                    508: punctuation.
                    509: Sentence boundaries were deemed less critical in DICTION than
                    510: in STYLE, so abbreviations and other uses of the character
                    511: ``.'' are not treated specially.
                    512: DICTION brackets all pattern matches in a sentence with the characters
                    513: ``['' ``]'' .
                    514: Although many of the phrases in the default data base are correct
                    515: in some contexts, in others they indicate wordiness.
                    516: Some examples of the phrases and suggested alternatives are:
                    517: .DS
                    518: .TS
                    519: cc
                    520: ll.
                    521: Phrase Alternative
                    522: a large number of      many
                    523: arrive at a decision   decide
                    524: collect together       collect
                    525: for this reason        so
                    526: pertaining to  about
                    527: through the use of     by or with
                    528: utilize        use
                    529: with the exception of  except
                    530: .TE
                    531: .DE
                    532: Appendix 2 contains a complete list of the default file.
                    533: Some of the entries are short forms of problem phrases.
                    534: For example, the phrase ``the fact'' is found in all of the following
                    535: and is sufficient to point out the wordiness to the user:
                    536: .DS
                    537: .TS
                    538: cc
                    539: ll.
                    540: Phrase Alternative
                    541: accounted for by the fact that caused by
                    542: an example of this is the fact that    thus
                    543: based on the fact that because
                    544: despite the fact that  although
                    545: due to the fact that   because
                    546: in light of the fact that      because
                    547: in view of the fact that       since
                    548: notwithstanding the fact that  although
                    549: .TE
                    550: .DE
                    551: Entries in Appendix 2 preceded by ``~'' are not matched.
                    552: See Section 7 for details on the use of ``~''.
                    553: .PP
                    554: The user may supply her/his own pattern file with the flag ``\-f patfile''.
                    555: In this case the default file will be loaded first, followed by the user file.
                    556: This mechanism allows users to suppress
                    557: patterns contained in the default file or to include their own pet peeves that are not in the default file.
                    558: The flag ``\-n'' will exclude the default file altogether.
                    559: In constructing a pattern file, blanks should be used before and after each
                    560: phrase to avoid matching substrings in words.
                    561: For example, to find all occurrences of the word ``the'', the pattern
                    562: `` the '' should be used.
                    563: The blanks cause only the word ``the'' to be matched and not the
                    564: string ``the'' in words like there, other, and therefore.
                    565: One side effect of surrounding the words with blanks is that
                    566: when two phrases occur without intervening words, only the
                    567: first will be matched.
                    568: .NH 1
                    569: EXPLAIN
                    570: .PP
                    571: The last program, EXPLAIN, is an interactive thesaurus for
                    572: phrases found by DICTION.
                    573: The user types one of the phrases bracketed by DICTION
                    574: and EXPLAIN responds with suggested substitutions for the phrase
                    575: that will improve the diction of the document.
                    576: .KF
                    577: .DS C
                    578: Table 1
                    579: Text Statistics on 20 Technical Documents
                    580: .TS
                    581: cccccc
                    582: llnnnn.
                    583:        variable        minimum maximum mean    standard deviation
                    584: _
                    585: Readability    Kincaid 9.5     16.9    13.3    2.2
                    586:        automated       9.0     17.4    13.3    2.5
                    587:        Cole-Liau       10.0    16.0    12.7    1.8
                    588:        Flesch  8.9     17.0    14.4    2.2
                    589: _
                    590: sentence info. av sent length  15.5    30.3    21.6    4.0
                    591:        av word length  4.61    5.63    5.08    .29
                    592:        av nonfunction length   5.72    7.30    6.52    .45
                    593:        short sent      23%     46%     33%     5.9
                    594:        long sent       7%      20%     14%     2.9
                    595: _
                    596: sentence types simple  31%     71%     49%     11.4
                    597:        complex 19%     50%     33%     8.3
                    598:        compound        2%      14%     7%      3.3
                    599:        compound-complex        2%      19%     10%     4.8
                    600: _
                    601: verb types     tobe    26%     64%     44.7%   10.3
                    602:        auxiliary       10%     40%     21%     8.7
                    603:        infinitives     8%      24%     15.1%   4.8
                    604:        passives        12%     50%     29%     9.3
                    605: _
                    606: word usage     prepositions    10.1%   15.0%   12.3%   1.6
                    607:        conjunction     1.8%    4.8%    3.4%    .9
                    608:        adverbs 1.2%    5.0%    3.4%    1.0
                    609:        nouns   23.6%   31.6%   27.8%   1.7
                    610:        adjectives      15.4%   27.1%   21.1%   3.4
                    611:        pronouns        1.2%    8.4%    2.5%    1.1
                    612:        nominalizations 2%      5%      3.3%    .8
                    613: _
                    614: sentence openers       prepositions    6%      19%     12%     3.4
                    615:        adverbs 0%      20%     9%      4.6
                    616:        subject 56%     85%     70%     8.0
                    617:        verbs   0%      4%      1%      1.0
                    618:        subordinating conj      1%      12%     5%      2.7
                    619:        conjunctions    0%      4%      0%      1.5
                    620:        expletives      0%      6%      2%      1.7
                    621: .TE
                    622: .DE
                    623: .KE
                    624: .NH 1
                    625: Results
                    626: .NH 2
                    627: STYLE
                    628: .PP
                    629: To get baseline statistics and check the program's accuracy,
                    630: we ran STYLE on 20 technical documents.
                    631: There were a total of 3287 sentences in the sample.
                    632: The shortest document was 67 sentences long; the longest 339 sentences.
                    633: The documents covered a wide range of subject matter, including
                    634: theoretical computing, physics, psychology, engineering, and
                    635: affirmative action.
                    636: Table 1 gives the range, median, and standard deviation of the various style measures.
                    637: As you will note most of the measurements have a fairly wide range of values
                    638: across the sample documents.
                    639: .PP
                    640: As a comparison, Table 2 gives the median results
                    641: for two different technical authors, a sample of instructional material, and a sample of the
                    642: Federalist Papers.
                    643: The two authors show similar styles, although author 2
                    644: uses somewhat shorter sentences and longer words than author 1.
                    645: Author 1 uses all types of sentences, while author 2 prefers
                    646: simple and complex sentences, using few compound or compound-complex sentences.
                    647: The other major difference in the styles of these authors is the location
                    648: of subordination.
                    649: Author 1 seems to prefer embedded or trailing subordination, while
                    650: author 2 begins many sentences with the subordinate clause.
                    651: The documents tested for both authors 1 and 2 were technical documents,
                    652: written for a technical audience.
                    653: The instructional documents, which are written for craftspeople,
                    654: vary surprisingly little from the two technical samples.
                    655: The sentences and words are a little longer,
                    656: and they contain many passive and auxiliary verbs, few adverbs, and almost
                    657: no pronouns.
                    658: The instructional documents contain many imperative sentences, so there are
                    659: many sentence with verb openers.
                    660: The sample of Federalist Papers contrasts with the other
                    661: samples in almost every way.
                    662: .KF
                    663: .DS C
                    664: Table 2
                    665: Text Statistics on Single Authors
                    666: .TS
                    667: cccccc
                    668: llnnnn.
                    669:        variable        author 1        author 2        inst.   FED
                    670: _
                    671: readability    Kincaid 11.0    10.3    10.8    16.3
                    672:        automated       11.0    10.3    11.9    17.8
                    673:        Coleman-Liau    9.3     10.1    10.2    12.3
                    674:        Flesch  10.3    10.7    10.1    15.0
                    675: _
                    676: sentence info  av sent length  22.64   19.61   22.78   31.85
                    677:        av word length  4.47    4.66    4.65    4.95
                    678:        av nonfunction length   5.64    5.92    6.04    6.87
                    679:        short sent      35%     43%     35%     40%
                    680:        long sent       18%     15%     16%     21%
                    681: _
                    682: sentence types simple  36%     43%     40%     31%
                    683:        complex 34%     41%     37%     34%
                    684:        compound        13%     7%      4%      10%
                    685:        compound-complex        16%     8%      14%     25%
                    686: _
                    687: verb type      tobe    42%     43%     45%     37%
                    688:        auxiliary       17%     19%     32%     32%
                    689:        infinitives     17%     15%     12%     21%
                    690:        passives        20%     19%     36%     20%
                    691: _
                    692: word usage     prepositions    10.0%   10.8%   12.3%   15.9%
                    693:        conjunctions    3.2%    2.4%    3.9%    3.4%
                    694:        adverbs 5.05%   4.6%    3.5%    3.7%
                    695:        nouns   27.7%   26.5%   29.1%   24.9%
                    696:        adjectives      17.0%   19.0%   15.4%   12.4%
                    697:        pronouns        5.3%    4.3%    2.1%    6.5%
                    698:        nominalizations 1%      2%      2%      3%
                    699: _
                    700: sentence openers       prepositions    11%     14%     6%      5%
                    701:        adverbs 9%      9%      6%      4%
                    702:        subject 65%     59%     54%     66%
                    703:        verb    3%      2%      14%     2%
                    704:        subordinating conj      8%      14%     11%     3%
                    705:        conjunction     1%      0%      0%      3%
                    706:        expletives      3%      3%      0%      3%
                    707: .TE
                    708: .DE
                    709: .KE
                    710: .NH 2
                    711: DICTION
                    712: .PP
                    713: In the few weeks that DICTION has been available
                    714: to users
                    715: about 35,000 sentences have been run with about
                    716: 5,000 string matches.
                    717: The authors using the program seem to make
                    718: the suggested changes about 50-75% of the time.
                    719: To date, almost 200 of the 450 strings in the default
                    720: file have been matched.
                    721: Although most of these phrases are valid and correct
                    722: in some contexts, the 50-75% change rate seems to
                    723: show that the phrases are used much more often than
                    724: concise diction warrants.
                    725: .NH 1
                    726: Accuracy
                    727: .NH 2
                    728: Sentence Identification
                    729: .PP
                    730: The correctness of the STYLE output on the 20 document sample was checked
                    731: in detail.
                    732: STYLE misidentified
                    733: 129 sentence fragments as sentences
                    734: and incorrectly joined two or more sentences 75 times
                    735: in the 3287 sentence sample.
                    736: The problems were usually because of nonstandard formatting
                    737: commands, unknown abbreviations, or lists of non-sentences.
                    738: An impossibly long sentence found as the longest sentence in
                    739: the document usually is the result of a long list
                    740: of non-sentences.
                    741: .NH 2
                    742: Sentence Types
                    743: .PP
                    744: Style correctly identified sentence type on 86.5% of
                    745: the sentences in the sample.
                    746: The type distribution of the sentences was
                    747: 52.5% simple, 29.9% complex, 8.5% compound and
                    748: 9% compound-complex.
                    749: The program reported 49.5% simple, 31.9% complex,
                    750: 8% compound and 10.4% compound-complex.
                    751: Looking at the errors on the individual
                    752: documents, the number of simple sentences was
                    753: under-reported by about 4% and the complex and compound-complex
                    754: were over-reported by 3% and 2%, respectively.
                    755: The following matrix shows the programs output
                    756: vs. the actual sentence type.
                    757: .DS C
                    758: .TS
                    759: csssss
                    760: cccccc
                    761: clnnnn.
                    762: Program Results
                    763:                simple  complex compound        comp-complex
                    764: Actual simple  1566    132     49      17
                    765: Sentence       complex 47      892     6       65
                    766: Type   compound        40      6       207     23
                    767:        comp-complex    0       52      5       249
                    768: .TE
                    769: .DE
                    770: .PP
                    771: The system's inability to find imperative sentences seems to
                    772: have little effect on most of the style statistics.
                    773: A document with half of its sentences imperative was run, with and
                    774: without the imperative end marker.
                    775: The results were identical except for the expected errors of not finding
                    776: verbs as sentence openers, not counting the imperative sentences,
                    777: and a slight difference (1%) in the number of nouns
                    778: and adjectives reported.
                    779: .NH 2
                    780: Word Usage
                    781: .PP
                    782: The accuracy of identifying word types reflects
                    783: that of PARTS, which is about 95% correct.
                    784: The largest source of confusion is between nouns and
                    785: adjectives.
                    786: The verb counts were checked on about 20 sentences from each
                    787: document and found to be about 98% correct.
                    788: .NH 1
                    789: Technical Details
                    790: .NH 2
                    791: Finding Sentences
                    792: .PP
                    793: The formatting commands embedded in the text increase the difficulty
                    794: of finding sentences.
                    795: Not all text in a document is in sentence form; there are headings,
                    796: tables, equations and lists, for example.
                    797: Headings like ``Finding Sentences'' above should be discarded, not
                    798: attached to the next sentence.
                    799: However, since many of the documents are formatted to be phototypeset,
                    800: and contain font changes, which usually operate on the
                    801: most important words in the document,
                    802: discarding all formatting commands is not correct.
                    803: To improve the programs' ability to find sentence boundaries, the deformatting program, DEROFF [13],
                    804: has been given some knowledge of the formatting packages used on the
                    805: .UX
                    806: operating system.
                    807: DEROFF will now do the following:
                    808: .IP 1.
                    809: Suppress all formatting macros that
                    810: are used for titles, headings, author's name, etc.
                    811: .IP 2.
                    812: Suppress the arguments to the macros for titles, headings, author's name, etc.
                    813: .IP 3.
                    814: Suppress displays, tables, footnotes and text that is centered or in no-fill mode.
                    815: .IP 4.
                    816: Substitute a place holder for equations and check
                    817: for hidden end markers.
                    818: The place holder is necessary because many typists and authors use
                    819: the equation setter to change fonts on important words.
                    820: For this reason, header files containing the definition of
                    821: the EQN delimiters must also be included as input to STYLE.
                    822: End markers are often hidden when an equation ends a sentence
                    823: and the period is typed
                    824: inside the EQN delimiters.
                    825: .IP 5.
                    826: Add a "." after lists.
                    827: If the flag \-ml is also used, all lists are suppressed.
                    828: This is a separate flag because of the variety of ways the
                    829: list macros are used.
                    830: Often, lists are sentences that should be included in the analysis.
                    831: The user must determine how lists are used in the document to be analyzed.
                    832: .PP
                    833: Both STYLE and DICTION call DEROFF before they look at the text.
                    834: The user should supply the \-ml flag if the document contains
                    835: many lists of non-sentences that should be skipped.
                    836: .NH 2
                    837: Details of DICTION
                    838: .PP
                    839: The program DICTION is based on the string matching program FGREP.
                    840: FGREP takes as input a file of patterns to be matched and a file
                    841: to be searched and outputs each line that contains
                    842: any of the patterns
                    843: with no indication of which pattern was matched.
                    844: The following changes have been added to FGREP:
                    845: .IP 1.
                    846: The basic unit that DICTION operates on is a sentence rather than a line.
                    847: Each sentence that contains one of the patterns is output.
                    848: .IP 2.
                    849: Upper case letters are mapped to lower case.
                    850: .IP 3.
                    851: Punctuation is replaced by blanks.
                    852: .IP 4
                    853: All pattern matches in the sentence are found and surrounded with
                    854: ``['' ``]'' .
                    855: .IP 5.
                    856: A method for suppressing a string match has been added.
                    857: Any pattern that begins with ``~'' will not be matched.
                    858: Because the matching algorithm finds the longest
                    859: substring, the suppression of a match allows words in some
                    860: correct contexts not to be matched while allowing
                    861: the word in another context to be found.
                    862: For example, the word ``which'' is often incorrectly used
                    863: instead of ``that'' in restrictive clauses.
                    864: However, ``which'' is usually correct when preceded by a preposition
                    865: or ``,''.
                    866: The default pattern file suppresses the match
                    867: of the common prepositions or a double
                    868: blank followed by ``which'' and therefore matches only
                    869: the suspect uses.
                    870: The double blank accounts for the replaced comma.
                    871: .NH
                    872: Conclusions
                    873: .PP
                    874: A system of writing tools that measure some of the
                    875: objective characteristics of writing style has been developed.
                    876: The tools are sufficiently general that they may be applied to
                    877: documents on any subject with equal accuracy.
                    878: Although the measurements are only of the surface
                    879: structure of the text, they do point out problem areas.
                    880: In addition to helping writers produce better documents,
                    881: these programs may be useful for studying
                    882: the writing process and finding other formulae for measuring
                    883: readability.
unix.superglobalmegacorp.com
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.