43BSDReno/share/doc/usd/32.diction/rm1 - annotate

Return to rm1 CVS log
Up to [CSRG BSD Unix] / 43BSDReno / share / doc / usd / 32.diction
Annotation of 43BSDReno/share/doc/usd/32.diction/rm1, revision 1.1

1.1     ! root        1: .\"    @(#)rm1 6.1 (Berkeley) 5/22/86
        !             2: .\"
        !             3: .EQ
        !             4: delim $$
        !             5: .EN
        !             6: .NH 1
        !             7: Introduction
        !             8: .PP
        !             9: Computers have become important
        !            10: in the document preparation process, with programs
        !            11: to check for spelling errors and to format documents.
        !            12: As the amount of text stored on line increases, it becomes
        !            13: feasible and attractive to study writing
        !            14: style and to attempt to help the writer in producing readable
        !            15: documents.
        !            16: The system of writing tools described here is a first step toward such help.
        !            17: The system includes programs and a data base to
        !            18: analyze writing style at the word and sentence level.
        !            19: We use the term ``style'' in this paper to describe the
        !            20: results of a writer's particular choices among individual words and
        !            21: sentence forms.
        !            22: Although many judgements of style are subjective,
        !            23: particularly those of word choice,
        !            24: there are some objective measures that experts
        !            25: agree lead to good style.
        !            26: Three programs have been written to measure some of
        !            27: the objectively definable characteristics of writing style
        !            28: and to identify some commonly misused or unnecessary phrases.
        !            29: Although a document that conforms to the stylistic rules
        !            30: is not guaranteed to be coherent and readable, one that
        !            31: violates all of the rules is likely to be
        !            32: difficult or tedious to read.
        !            33: The program STYLE calculates readability, sentence length variability,
        !            34: sentence type, word usage and sentence openers at a rate of about 400 words per second
        !            35: on a PDP11/70 running the
        !            36: .UX
        !            37: Operating System.
        !            38: It assumes that the sentences are well-formed, i. e. that
        !            39: each sentence has a verb and that the subject and verb agree in number.
        !            40: DICTION identifies phrases that are either bad usage or unnecessarily wordy.
        !            41: EXPLAIN acts as a thesaurus for the phrases found by DICTION.
        !            42: Sections 2, 3, and 4 describe the programs; Section 5 gives the results
        !            43: on a cross-section of technical documents; Section 6 discusses
        !            44: accuracy and problems; Section 7 gives implementation details.
        !            45: .NH 1
        !            46: STYLE
        !            47: .PP
        !            48: The program STYLE reads a document and prints a summary of
        !            49: readability indices, sentence length and type, word usage,
        !            50: and sentence openers.
        !            51: It may also be used to locate all sentences in a document
        !            52: longer than a given length, of readability index higher than a given
        !            53: number, those containing a passive verb, or those beginning with an expletive.
        !            54: STYLE
        !            55: is based on the system for finding English word classes or parts of speech, PARTS [1].
        !            56: PARTS is a set of programs that uses a small dictionary (about 350 words)
        !            57: and suffix rules to partially assign word classes to
        !            58: English text.
        !            59: It then uses experimentally derived rules of word order to assign
        !            60: word classes to all words in the text with an accuracy of about 95%.
        !            61: Because PARTS uses only a small dictionary and general rules, it works
        !            62: on text about any subject, from physics to psychology.
        !            63: Style measures have been built into the output phase
        !            64: of the programs that make up PARTS.
        !            65: Some of the measures are simple counters of the word classes
        !            66: found by PARTS; many are more complicated.
        !            67: For example, the verb count is the total number of verb phrases.
        !            68: This includes phrases like:
        !            69: .DS
        !            70: has been going
        !            71: was only going
        !            72: to go
        !            73: .DE
        !            74: each of which each counts as one verb.
        !            75: Figure 1 shows the output of STYLE run on a paper by Kernighan and Mashey
        !            76: about the
        !            77: .UX
        !            78: programming environment [2].
        !            79: .KF
        !            80: .sp 2
        !            81: .TS
        !            82: box;
        !            83: l1l.
        !            84: programming environment
        !            85: readability grades:
        !            86:        (Kincaid) 12.3  (auto) 12.8  (Coleman-Liau) 11.8  (Flesch) 13.5 (46.3)
        !            87: sentence info:
        !            88:        no. sent 335 no. wds 7419
        !            89:        av sent leng 22.1 av word leng 4.91
        !            90:        no. questions 0 no. imperatives 0
        !            91:        no. nonfunc wds 4362  58.8%   av leng 6.38
        !            92:        short sent (<17) 35% (118) long sent (>32)  16% (55)
        !            93:        longest sent 82 wds at sent 174; shortest sent 1 wds at sent 117
        !            94: sentence types:
        !            95:        simple  34% (114) complex  32% (108)
        !            96:        compound  12% (41) compound-complex  21% (72)
        !            97: word usage:
        !            98:        verb types as % of total verbs
        !            99:        tobe  45% (373) aux  16% (133) inf  14% (114)
        !           100:        passives as % of non-inf verbs  20% (144)
        !           101:        types as % of total
        !           102:        prep 10.8% (804) conj 3.5% (262) adv 4.8% (354)
        !           103:        noun 26.7% (1983) adj 18.7% (1388) pron 5.3% (393)
        !           104:        nominalizations   2 % (155)
        !           105: sentence beginnings:
        !           106:        subject opener: noun (63) pron (43) pos (0) adj (58) art (62) tot  67%
        !           107:        prep  12% (39) adv   9% (31) 
        !           108:        verb   0% (1)  sub_conj   6% (20) conj   1% (5)
        !           109:        expletives   4% (13)
        !           110: .TE
        !           111: .sp
        !           112: .ce
        !           113: Figure 1
        !           114: .sp
        !           115: .KE
        !           116: As the example shows, STYLE output is in five parts.
        !           117: After a brief discussion of sentences, we will describe the parts in order.
        !           118: .NH 2
        !           119: What is a sentence?
        !           120: .PP
        !           121: Readers of documents have little
        !           122: trouble deciding where the sentences end.
        !           123: People don't even have to stop and think about uses of the
        !           124: character ``.'' in constructions like
        !           125: 1.25, A. J. Jones, Ph.D., i. e., or etc. .
        !           126: When a computer reads a document,
        !           127: finding the end of sentences is not as easy.
        !           128: First we must throw away the printer's marks and formatting
        !           129: commands that litter the text in computer form.
        !           130: Then STYLE
        !           131: defines a sentence
        !           132: as a string of words ending in one of:
        !           133: .DS
        !           134:  . ! ? /.
        !           135: .DE
        !           136: The end marker ``/.'' may be used to indicate an imperative sentence.
        !           137: Imperative sentences that are not so marked are not identified as imperative.
        !           138: STYLE properly handles numbers with embedded decimal points and commas,
        !           139: strings of letters and numbers with embedded decimal points used for
        !           140: naming computer file names, and
        !           141: the common
        !           142: abbreviations listed in Appendix 1.
        !           143: Numbers that end sentences, like the preceding sentence, cause
        !           144: a sentence break if the next word begins with a capital letter.
        !           145: Initials only cause a sentence break if the next word begins with
        !           146: a capital and is found in the dictionary of function words used by PARTS.
        !           147: So the string
        !           148: .DS
        !           149: J. D. JONES
        !           150: .DE
        !           151: does not cause a break, but the string
        !           152: .DS
        !           153:  ... system H.  The ...
        !           154: .DE
        !           155: does.
        !           156: With these rules most sentences are broken at the proper place,
        !           157: although occasionally
        !           158: either two sentences are called one or a fragment is called
        !           159: a sentence.
        !           160: More on this later.
        !           161: .NH 2
        !           162: Readability Grades
        !           163: .PP
        !           164: The first section of STYLE output consists of four readability indices.
        !           165: As Klare points out in [3] readability indices may be used to
        !           166: estimate the reading skills needed by the reader to understand a document.
        !           167: The readability indices reported by STYLE are based on
        !           168: measures of sentence and word lengths.
        !           169: Although the indices
        !           170: may not measure whether the document is coherent
        !           171: and well organized,
        !           172: experience has shown that high indices seem to be indicators of stylistic
        !           173: difficulty.
        !           174: Documents with short sentences and short words have low scores;
        !           175: those with long sentences and many polysyllabic words have high scores.
        !           176: The 4 formulae reported are Kincaid Formula [4], Automated Readability Index [5],
        !           177: Coleman-Liau Formula [6]
        !           178: and a normalized version of Flesch Reading Ease Score [7].
        !           179: The formulae differ because they  were experimentally derived using different texts
        !           180: and subject groups.
        !           181: We will discuss each of the formulae briefly; for a more
        !           182: detailed discussion the reader should see [3].
        !           183: .PP
        !           184: The Kincaid Formula, given by:
        !           185: .EQ
        !           186: Reading_Grade = 11.8 * syl_per_wd + .39 * wds_per_sent - 15.59
        !           187: .EN
        !           188: .br
        !           189: was based on Navy training manuals that ranged in difficulty
        !           190: from 5.5 to 16.3 in reading grade level.
        !           191: The score reported by this formula tends to be in the mid-range of the
        !           192: 4 scores.
        !           193: Because it is based on adult training manuals rather than
        !           194: school book text, this formula is probably the best
        !           195: one to apply to technical documents.
        !           196: .PP
        !           197: The Automated Readability Index (ARI), based on text from
        !           198: grades 0 to 7, was derived to be easy to automate.
        !           199: The formula is:
        !           200: .EQ
        !           201: Reading_Grade = 4.71 * let_per_wd + .5 * wds_per_sent - 21.43
        !           202: .EN
        !           203: .br
        !           204: ARI tends to produce scores that are higher than Kincaid and
        !           205: Coleman-Liau but are usually slightly lower than Flesch.
        !           206: .PP
        !           207: The Coleman-Liau Formula, based on text ranging in
        !           208: difficulty from .4 to 16.3, is:
        !           209: .EQ
        !           210: Reading_Grade = 5.89 * let_per_wd - .3 * sent_per_100_wds - 15.8
        !           211: .EN
        !           212: .br
        !           213: Of the four formulae this one usually gives the lowest
        !           214: grade when applied to technical documents.
        !           215: .PP
        !           216: The last formula, the Flesch Reading Ease Score, is based
        !           217: on grade school text covering grades 3 to 12.
        !           218: The formula, given by:
        !           219: .EQ
        !           220: Reading_Score = 206.835 - 84.6 * syl_per_wd - 1.015 * wds_per_sent
        !           221: .EN
        !           222: .br
        !           223: is usually reported in the range 0 (very difficult) to 100 (very easy).
        !           224: The score reported by STYLE is scaled to be comparable to
        !           225: the other formulas,
        !           226: except that the maximum grade level reported is set to 17.
        !           227: The Flesch score is usually the highest of the 4 scores
        !           228: on technical documents.
        !           229: .PP
        !           230: Coke [8] found that the Kincaid Formula is probably the best predictor for
        !           231: technical documents;
        !           232: both ARI and Flesch tend to overestimate
        !           233: the difficulty; Coleman-Liau tend to underestimate.
        !           234: On text in the range of grades 7 to 9
        !           235: the four formulas tend to be about the same.
        !           236: On easy text the Coleman-Liau formula is probably
        !           237: preferred since it is reasonably accurate at the lower
        !           238: grades and it is safer to present text that is a little too
        !           239: easy than a little too hard.
        !           240: .PP
        !           241: If a document has particularly difficult technical content, especially if
        !           242: it includes a lot of mathematics,
        !           243: it is probably best to make the text very easy to read, i.e. a lower
        !           244: readability index by shortening the sentences and words.
        !           245: This will allow the reader to concentrate on the technical
        !           246: content and not the long sentences.
        !           247: The user should remember that these indices are estimators;
        !           248: they should not be taken as absolute numbers.
        !           249: STYLE called with ``\-r number'' will print all sentences with
        !           250: an Automated Readability Index equal to or greater than ``number''.
        !           251: .NH 2
        !           252: Sentence length and structure
        !           253: .PP
        !           254: The next two sections of STYLE output deal with sentence length and structure.
        !           255: Almost all books on writing style or effective writing emphasize
        !           256: the importance of variety in sentence length and structure for good writing.
        !           257: Ewing's first rule in discussing style in the book
        !           258: .I
        !           259: Writing for Results
        !           260: .R
        !           261: [9] is:
        !           262: .DS
        !           263: ``Vary the sentence structure and length of your sentences.''
        !           264: .DE
        !           265: Leggett, Mead and Charvat break this rule into 3 in
        !           266: .I
        !           267: Prentice-Hall Handbook for Writers
        !           268: .R
        !           269: [10] as follows:
        !           270: .DS
        !           271: ``34a. Avoid the overuse of short simple sentences.''
        !           272: ``34b. Avoid the overuse of long compound sentences.''
        !           273: ``34c. Use various sentence structures to avoid monotony and increase effectiveness.''
        !           274: .DE
        !           275: Although experts agree that these rules are important, not all writers
        !           276: follow them.
        !           277: Sample technical documents have been found with almost no
        !           278: sentence length or type variability.
        !           279: One document had 90% of its sentences about the same
        !           280: length as the average;
        !           281: another was made up almost entirely of simple sentences (80%).
        !           282: .PP
        !           283: The output sections labeled ``sentence info'' and ``sentence types'' give
        !           284: both length and structure measures.
        !           285: STYLE reports on the number and average length of both
        !           286: sentences and words,
        !           287: and number of questions and imperative sentences (those ending in ``/.'').
        !           288: The measures of non-function words are an attempt to look at the content
        !           289: words in the document.
        !           290: In English
        !           291: non-function words are nouns, adjectives, adverbs, and non-auxiliary verbs;
        !           292: function words are prepositions, conjunctions, articles, and auxiliary
        !           293: verbs.
        !           294: Since most function words are short, they tend to lower the average
        !           295: word length.
        !           296: The average length of non-function words may be a more useful measure for comparing
        !           297: word choice of different writers than the total average word length.
        !           298: The percentages of short and long sentences measure sentence
        !           299: length variability.
        !           300: Short sentences are those at least 5 words less than the
        !           301: average; long sentences are those at least 10 words longer than the average.
        !           302: Last in the sentence information section is the
        !           303: length and location of the longest and shortest sentences.
        !           304: If the flag ``\-l number'' is used, STYLE will print all sentences
        !           305: longer than ``number''.
        !           306: .PP
        !           307: Because of the difficulties in dealing with the many uses of commas and conjunctions
        !           308: in English, sentence type definitions
        !           309: vary slightly from those of standard textbooks, but still measure
        !           310: the same constructional activity.
        !           311: .IP 1.
        !           312: A simple sentence has one verb and no dependent clause.
        !           313: .IP 2.
        !           314: A complex sentence has one independent
        !           315: clause and one dependent clause, each with one verb.
        !           316: Complex sentences are found by identifying sentences that contain either
        !           317: a subordinate conjunction or a clause beginning with words like ``that''
        !           318: or ``who''.
        !           319: The preceding sentence has such a clause.
        !           320: .IP 3.
        !           321: A compound sentence has more than one verb and no dependent
        !           322: clause.
        !           323: Sentences joined by ``;'' are also counted as compound.
        !           324: .IP 4.
        !           325: A compound-complex sentence has either several dependent clauses
        !           326: or one dependent clause and a compound verb in either
        !           327: the dependent or independent clause.
        !           328: .PP
        !           329: Even using these broader definitions, simple
        !           330: sentences dominate many of the technical documents that
        !           331: have been tested,
        !           332: but the example in Figure 1 shows variety in both sentence structure and
        !           333: sentence length.
        !           334: .NH 2
        !           335: Word Usage
        !           336: .PP
        !           337: The word usage measures are an attempt to identify
        !           338: some other constructional features of writing style.
        !           339: There are many different ways in English to
        !           340: say the same thing.
        !           341: The constructions differ from one another
        !           342: in the form of the words used.
        !           343: The following sentences all convey approximately the
        !           344: same meaning but differ in word usage:
        !           345: .DS
        !           346: The cxio program is used to perform all communication between the systems.
        !           347: The cxio program performs all communications between the systems.
        !           348: The cxio program is used to communicate between the systems.
        !           349: The cxio program communicates between the systems.
        !           350: All communication between the systems is performed by the cxio program.
        !           351: .DE
        !           352: The  distribution of the parts of speech and verb constructions
        !           353: helps identify overuse of particular constructions.
        !           354: Although the measures used by STYLE are crude, they do point out
        !           355: problem areas.
        !           356: For each category, STYLE reports a percentage and a raw count.
        !           357: In addition to looking at the percentage, the user
        !           358: may find it useful to compare the raw count with the number of sentences.
        !           359: If, for example, the number of infinitives is almost equal to the number
        !           360: of sentences, then many of the sentences in the document are constructed
        !           361: like the first and third in the preceding example.
        !           362: The user may want to transform some of these sentences into another form.
        !           363: Some of the implications of the word usage measures are discussed below.
        !           364: .IP "\fIVerbs\fR "
        !           365: are measured in several different ways to
        !           366: try to determine what types of verb constructions are
        !           367: most frequent in the document.
        !           368: Technical writing tends to contain many
        !           369: passive verb constructions and other usage of the verb ``to be''.
        !           370: The category of verbs labeled ``tobe'' measures both passives and sentences of
        !           371: the form:
        !           372: .DS
        !           373: .I
        !           374: subject tobe predicate
        !           375: .R
        !           376: .DE
        !           377: In counting verbs, whole verb phrases are counted as one verb.
        !           378: Verb phrases containing auxiliary verbs are counted in the category
        !           379: ``aux''.
        !           380: The verb phrases counted here are those whose tense is not
        !           381: simple present or simple past.
        !           382: It might eventually be useful to do more detailed measures
        !           383: of verb tense or mood.
        !           384: Infinitives are listed as ``inf''.
        !           385: The percentages reported for these three categories are based on
        !           386: the total number of verb phrases found.
        !           387: These categories are not mutually exclusive;
        !           388: they cannot be added, since, for example,
        !           389: ``to be going'' counts as both ``tobe'' and ``inf''.
        !           390: Use of these three types of verb constructions varies significantly among authors.
        !           391: .sp 2
        !           392: STYLE reports passive verbs as a percentage of the finite verbs in the
        !           393: document.
        !           394: Most style books warn against the overuse of passive verbs.
        !           395: Coleman [11] has shown that sentences with
        !           396: active verbs are easier to learn than those
        !           397: with passive verbs.
        !           398: Although the inverted object-subject order of the passive
        !           399: voice seems to emphasize the object, Coleman's experiments
        !           400: showed that there is little difference in retention
        !           401: by word position. He also showed that the direct object of an active verb
        !           402: is retained better than the subject of a passive verb.
        !           403: These experiments support the advice of the style books suggesting
        !           404: that writers should try to use active verbs wherever possible.
        !           405: The flag ``\-p'' causes STYLE to print all sentences containing passive verbs.
        !           406: .PP
        !           407: .IP "\fIPronouns\fR "
        !           408: add cohesiveness and connectivity to a document
        !           409: by providing back-reference.
        !           410: They are often a short-hand notation for something
        !           411: previously mentioned, and therefore connect the sentence containing the pronoun with the
        !           412: word to which the pronoun refers.
        !           413: Although there are other mechanisms for such connections, documents
        !           414: with no pronouns tend to be wordy and to have little connectivity.
        !           415: .IP "\fIAdverbs\fR "
        !           416: can provide transition between sentences and order
        !           417: in time and space.
        !           418: In performing these functions, adverbs, like pronouns, provide
        !           419: connectivity and cohesiveness.
        !           420: .IP "\fIConjunctions\fR "
        !           421: provide parallelism in a document by connecting two or more
        !           422: equal units.
        !           423: These units may be whole sentences, verb phrases, nouns, adjectives, or
        !           424: prepositional phrases.
        !           425: The compound and compound-complex sentences reported under
        !           426: sentence type are parallel structures.
        !           427: Other uses of parallel structures are indicated by the degree that the
        !           428: number of conjunctions reported under word usage exceeds the
        !           429: compound sentence measures.
        !           430: .IP "\fINouns and Adjectives.\fR "
        !           431: A ratio of nouns to adjectives near unity may indicate the over-use of modifiers.
        !           432: Some technical writers qualify every noun with one or more
        !           433: adjectives.
        !           434: Qualifiers in phrases like ``simple linear single-link network model''
        !           435: often lend more obscurity than precision to a text.
        !           436: .IP "\fINominalizations\fR "
        !           437: are verbs that are changed to nouns by adding one of the suffixes
        !           438: ``ment'', ``ance'', ``ence'', or ``ion''.
        !           439: Examples are accomplishment, admittance, adherence, and abbreviation.
        !           440: When a writer transforms a nominalized sentence to a non-nominalized
        !           441: sentence, she/he increases the effectiveness of the sentence in
        !           442: several ways.
        !           443: The noun becomes an active verb and frequently one complicated clause
        !           444: becomes two shorter clauses.
        !           445: For example,
        !           446: .DS
        !           447: Their inclusion of this provision is admission of the importance of the system.
        !           448: When they included this provision, they admitted the importance of the system.
        !           449: .DE
        !           450: Coleman found that the transformed sentences were easier to
        !           451: learn, even when the transformation produced sentences that were
        !           452: slightly longer, provided the transformation broke one clause into two.
        !           453: Writers who find their document contains many
        !           454: nominalizations may want to transform some of the sentences 
        !           455: to use active verbs.
        !           456: .NH 2
        !           457: Sentence openers
        !           458: .PP
        !           459: Another agreed upon principle of style is variety in sentence openers.
        !           460: Because STYLE determines the type of sentence opener by
        !           461: looking at the part of speech of the first word in the sentence,
        !           462: the sentences counted under the heading ``subject opener'' may not
        !           463: all really begin with the subject.
        !           464: However, a large percentage of sentences in this category
        !           465: still indicates lack of variety in sentence openers.
        !           466: Other sentence opener measures help the user determine
        !           467: if there are transitions between sentences and where
        !           468: the subordination occurs.
        !           469: Adverbs and conjunctions at the beginning of sentences are mechanisms for
        !           470: transition between sentences.
        !           471: A pronoun at the beginning shows a link to something previously mentioned
        !           472: and indicates connectivity.
        !           473: .PP
        !           474: The location of subordination can be determined by comparing
        !           475: the number of sentences that begin with a subordinator with
        !           476: the number of sentences with complex clauses.
        !           477: If few sentences start with subordinate conjunctions then
        !           478: the subordination is embedded or at the end of the complex sentences.
        !           479: For variety the writer may want to transform some sentences
        !           480: to have leading subordination.
        !           481: .PP
        !           482: The last category of openers, expletives, is commonly
        !           483: overworked in technical writing.
        !           484: Expletives are the words ``it'' and ``there'', usually with the verb ``to be'',
        !           485: in constructions where the subject follows the verb.
        !           486: For example,
        !           487: .DS
        !           488: There are three streets used by the traffic.
        !           489: There are too many users on this system.
        !           490: .DE
        !           491: This construction tends to emphasize the object rather than the
        !           492: subject of the sentence.
        !           493: The flag ``\-e'' will cause STYLE to print all
        !           494: sentences that begin with an expletive.
        !           495: .NH 1
        !           496: DICTION
        !           497: .PP
        !           498: The program DICTION prints all sentences in a document containing
        !           499: phrases that are either frequently misused or indicate wordiness.
        !           500: The program, an extension of Aho's FGREP [12] string
        !           501: matching program,
        !           502: takes as input a file of phrases or patterns to be matched and a file
        !           503: of text to be searched.
        !           504: A data base of about 450 phrases has been compiled as a default
        !           505: pattern file for DICTION.
        !           506: Before attempting to locate phrases, the program maps
        !           507: upper case letters to lower case and substitutes blanks for
        !           508: punctuation.
        !           509: Sentence boundaries were deemed less critical in DICTION than
        !           510: in STYLE, so abbreviations and other uses of the character
        !           511: ``.'' are not treated specially.
        !           512: DICTION brackets all pattern matches in a sentence with the characters
        !           513: ``['' ``]'' .
        !           514: Although many of the phrases in the default data base are correct
        !           515: in some contexts, in others they indicate wordiness.
        !           516: Some examples of the phrases and suggested alternatives are:
        !           517: .DS
        !           518: .TS
        !           519: cc
        !           520: ll.
        !           521: Phrase Alternative
        !           522: a large number of      many
        !           523: arrive at a decision   decide
        !           524: collect together       collect
        !           525: for this reason        so
        !           526: pertaining to  about
        !           527: through the use of     by or with
        !           528: utilize        use
        !           529: with the exception of  except
        !           530: .TE
        !           531: .DE
        !           532: Appendix 2 contains a complete list of the default file.
        !           533: Some of the entries are short forms of problem phrases.
        !           534: For example, the phrase ``the fact'' is found in all of the following
        !           535: and is sufficient to point out the wordiness to the user:
        !           536: .DS
        !           537: .TS
        !           538: cc
        !           539: ll.
        !           540: Phrase Alternative
        !           541: accounted for by the fact that caused by
        !           542: an example of this is the fact that    thus
        !           543: based on the fact that because
        !           544: despite the fact that  although
        !           545: due to the fact that   because
        !           546: in light of the fact that      because
        !           547: in view of the fact that       since
        !           548: notwithstanding the fact that  although
        !           549: .TE
        !           550: .DE
        !           551: Entries in Appendix 2 preceded by ``~'' are not matched.
        !           552: See Section 7 for details on the use of ``~''.
        !           553: .PP
        !           554: The user may supply her/his own pattern file with the flag ``\-f patfile''.
        !           555: In this case the default file will be loaded first, followed by the user file.
        !           556: This mechanism allows users to suppress
        !           557: patterns contained in the default file or to include their own pet peeves that are not in the default file.
        !           558: The flag ``\-n'' will exclude the default file altogether.
        !           559: In constructing a pattern file, blanks should be used before and after each
        !           560: phrase to avoid matching substrings in words.
        !           561: For example, to find all occurrences of the word ``the'', the pattern
        !           562: `` the '' should be used.
        !           563: The blanks cause only the word ``the'' to be matched and not the
        !           564: string ``the'' in words like there, other, and therefore.
        !           565: One side effect of surrounding the words with blanks is that
        !           566: when two phrases occur without intervening words, only the
        !           567: first will be matched.
        !           568: .NH 1
        !           569: EXPLAIN
        !           570: .PP
        !           571: The last program, EXPLAIN, is an interactive thesaurus for
        !           572: phrases found by DICTION.
        !           573: The user types one of the phrases bracketed by DICTION
        !           574: and EXPLAIN responds with suggested substitutions for the phrase
        !           575: that will improve the diction of the document.
        !           576: .KF
        !           577: .DS C
        !           578: Table 1
        !           579: Text Statistics on 20 Technical Documents
        !           580: .TS
        !           581: cccccc
        !           582: llnnnn.
        !           583:        variable        minimum maximum mean    standard deviation
        !           584: _
        !           585: Readability    Kincaid 9.5     16.9    13.3    2.2
        !           586:        automated       9.0     17.4    13.3    2.5
        !           587:        Cole-Liau       10.0    16.0    12.7    1.8
        !           588:        Flesch  8.9     17.0    14.4    2.2
        !           589: _
        !           590: sentence info. av sent length  15.5    30.3    21.6    4.0
        !           591:        av word length  4.61    5.63    5.08    .29
        !           592:        av nonfunction length   5.72    7.30    6.52    .45
        !           593:        short sent      23%     46%     33%     5.9
        !           594:        long sent       7%      20%     14%     2.9
        !           595: _
        !           596: sentence types simple  31%     71%     49%     11.4
        !           597:        complex 19%     50%     33%     8.3
        !           598:        compound        2%      14%     7%      3.3
        !           599:        compound-complex        2%      19%     10%     4.8
        !           600: _
        !           601: verb types     tobe    26%     64%     44.7%   10.3
        !           602:        auxiliary       10%     40%     21%     8.7
        !           603:        infinitives     8%      24%     15.1%   4.8
        !           604:        passives        12%     50%     29%     9.3
        !           605: _
        !           606: word usage     prepositions    10.1%   15.0%   12.3%   1.6
        !           607:        conjunction     1.8%    4.8%    3.4%    .9
        !           608:        adverbs 1.2%    5.0%    3.4%    1.0
        !           609:        nouns   23.6%   31.6%   27.8%   1.7
        !           610:        adjectives      15.4%   27.1%   21.1%   3.4
        !           611:        pronouns        1.2%    8.4%    2.5%    1.1
        !           612:        nominalizations 2%      5%      3.3%    .8
        !           613: _
        !           614: sentence openers       prepositions    6%      19%     12%     3.4
        !           615:        adverbs 0%      20%     9%      4.6
        !           616:        subject 56%     85%     70%     8.0
        !           617:        verbs   0%      4%      1%      1.0
        !           618:        subordinating conj      1%      12%     5%      2.7
        !           619:        conjunctions    0%      4%      0%      1.5
        !           620:        expletives      0%      6%      2%      1.7
        !           621: .TE
        !           622: .DE
        !           623: .KE
        !           624: .NH 1
        !           625: Results
        !           626: .NH 2
        !           627: STYLE
        !           628: .PP
        !           629: To get baseline statistics and check the program's accuracy,
        !           630: we ran STYLE on 20 technical documents.
        !           631: There were a total of 3287 sentences in the sample.
        !           632: The shortest document was 67 sentences long; the longest 339 sentences.
        !           633: The documents covered a wide range of subject matter, including
        !           634: theoretical computing, physics, psychology, engineering, and
        !           635: affirmative action.
        !           636: Table 1 gives the range, median, and standard deviation of the various style measures.
        !           637: As you will note most of the measurements have a fairly wide range of values
        !           638: across the sample documents.
        !           639: .PP
        !           640: As a comparison, Table 2 gives the median results
        !           641: for two different technical authors, a sample of instructional material, and a sample of the
        !           642: Federalist Papers.
        !           643: The two authors show similar styles, although author 2
        !           644: uses somewhat shorter sentences and longer words than author 1.
        !           645: Author 1 uses all types of sentences, while author 2 prefers
        !           646: simple and complex sentences, using few compound or compound-complex sentences.
        !           647: The other major difference in the styles of these authors is the location
        !           648: of subordination.
        !           649: Author 1 seems to prefer embedded or trailing subordination, while
        !           650: author 2 begins many sentences with the subordinate clause.
        !           651: The documents tested for both authors 1 and 2 were technical documents,
        !           652: written for a technical audience.
        !           653: The instructional documents, which are written for craftspeople,
        !           654: vary surprisingly little from the two technical samples.
        !           655: The sentences and words are a little longer,
        !           656: and they contain many passive and auxiliary verbs, few adverbs, and almost
        !           657: no pronouns.
        !           658: The instructional documents contain many imperative sentences, so there are
        !           659: many sentence with verb openers.
        !           660: The sample of Federalist Papers contrasts with the other
        !           661: samples in almost every way.
        !           662: .KF
        !           663: .DS C
        !           664: Table 2
        !           665: Text Statistics on Single Authors
        !           666: .TS
        !           667: cccccc
        !           668: llnnnn.
        !           669:        variable        author 1        author 2        inst.   FED
        !           670: _
        !           671: readability    Kincaid 11.0    10.3    10.8    16.3
        !           672:        automated       11.0    10.3    11.9    17.8
        !           673:        Coleman-Liau    9.3     10.1    10.2    12.3
        !           674:        Flesch  10.3    10.7    10.1    15.0
        !           675: _
        !           676: sentence info  av sent length  22.64   19.61   22.78   31.85
        !           677:        av word length  4.47    4.66    4.65    4.95
        !           678:        av nonfunction length   5.64    5.92    6.04    6.87
        !           679:        short sent      35%     43%     35%     40%
        !           680:        long sent       18%     15%     16%     21%
        !           681: _
        !           682: sentence types simple  36%     43%     40%     31%
        !           683:        complex 34%     41%     37%     34%
        !           684:        compound        13%     7%      4%      10%
        !           685:        compound-complex        16%     8%      14%     25%
        !           686: _
        !           687: verb type      tobe    42%     43%     45%     37%
        !           688:        auxiliary       17%     19%     32%     32%
        !           689:        infinitives     17%     15%     12%     21%
        !           690:        passives        20%     19%     36%     20%
        !           691: _
        !           692: word usage     prepositions    10.0%   10.8%   12.3%   15.9%
        !           693:        conjunctions    3.2%    2.4%    3.9%    3.4%
        !           694:        adverbs 5.05%   4.6%    3.5%    3.7%
        !           695:        nouns   27.7%   26.5%   29.1%   24.9%
        !           696:        adjectives      17.0%   19.0%   15.4%   12.4%
        !           697:        pronouns        5.3%    4.3%    2.1%    6.5%
        !           698:        nominalizations 1%      2%      2%      3%
        !           699: _
        !           700: sentence openers       prepositions    11%     14%     6%      5%
        !           701:        adverbs 9%      9%      6%      4%
        !           702:        subject 65%     59%     54%     66%
        !           703:        verb    3%      2%      14%     2%
        !           704:        subordinating conj      8%      14%     11%     3%
        !           705:        conjunction     1%      0%      0%      3%
        !           706:        expletives      3%      3%      0%      3%
        !           707: .TE
        !           708: .DE
        !           709: .KE
        !           710: .NH 2
        !           711: DICTION
        !           712: .PP
        !           713: In the few weeks that DICTION has been available
        !           714: to users
        !           715: about 35,000 sentences have been run with about
        !           716: 5,000 string matches.
        !           717: The authors using the program seem to make
        !           718: the suggested changes about 50-75% of the time.
        !           719: To date, almost 200 of the 450 strings in the default
        !           720: file have been matched.
        !           721: Although most of these phrases are valid and correct
        !           722: in some contexts, the 50-75% change rate seems to
        !           723: show that the phrases are used much more often than
        !           724: concise diction warrants.
        !           725: .NH 1
        !           726: Accuracy
        !           727: .NH 2
        !           728: Sentence Identification
        !           729: .PP
        !           730: The correctness of the STYLE output on the 20 document sample was checked
        !           731: in detail.
        !           732: STYLE misidentified
        !           733: 129 sentence fragments as sentences
        !           734: and incorrectly joined two or more sentences 75 times
        !           735: in the 3287 sentence sample.
        !           736: The problems were usually because of nonstandard formatting
        !           737: commands, unknown abbreviations, or lists of non-sentences.
        !           738: An impossibly long sentence found as the longest sentence in
        !           739: the document usually is the result of a long list
        !           740: of non-sentences.
        !           741: .NH 2
        !           742: Sentence Types
        !           743: .PP
        !           744: Style correctly identified sentence type on 86.5% of
        !           745: the sentences in the sample.
        !           746: The type distribution of the sentences was
        !           747: 52.5% simple, 29.9% complex, 8.5% compound and
        !           748: 9% compound-complex.
        !           749: The program reported 49.5% simple, 31.9% complex,
        !           750: 8% compound and 10.4% compound-complex.
        !           751: Looking at the errors on the individual
        !           752: documents, the number of simple sentences was
        !           753: under-reported by about 4% and the complex and compound-complex
        !           754: were over-reported by 3% and 2%, respectively.
        !           755: The following matrix shows the programs output
        !           756: vs. the actual sentence type.
        !           757: .DS C
        !           758: .TS
        !           759: csssss
        !           760: cccccc
        !           761: clnnnn.
        !           762: Program Results
        !           763:                simple  complex compound        comp-complex
        !           764: Actual simple  1566    132     49      17
        !           765: Sentence       complex 47      892     6       65
        !           766: Type   compound        40      6       207     23
        !           767:        comp-complex    0       52      5       249
        !           768: .TE
        !           769: .DE
        !           770: .PP
        !           771: The system's inability to find imperative sentences seems to
        !           772: have little effect on most of the style statistics.
        !           773: A document with half of its sentences imperative was run, with and
        !           774: without the imperative end marker.
        !           775: The results were identical except for the expected errors of not finding
        !           776: verbs as sentence openers, not counting the imperative sentences,
        !           777: and a slight difference (1%) in the number of nouns
        !           778: and adjectives reported.
        !           779: .NH 2
        !           780: Word Usage
        !           781: .PP
        !           782: The accuracy of identifying word types reflects
        !           783: that of PARTS, which is about 95% correct.
        !           784: The largest source of confusion is between nouns and
        !           785: adjectives.
        !           786: The verb counts were checked on about 20 sentences from each
        !           787: document and found to be about 98% correct.
        !           788: .NH 1
        !           789: Technical Details
        !           790: .NH 2
        !           791: Finding Sentences
        !           792: .PP
        !           793: The formatting commands embedded in the text increase the difficulty
        !           794: of finding sentences.
        !           795: Not all text in a document is in sentence form; there are headings,
        !           796: tables, equations and lists, for example.
        !           797: Headings like ``Finding Sentences'' above should be discarded, not
        !           798: attached to the next sentence.
        !           799: However, since many of the documents are formatted to be phototypeset,
        !           800: and contain font changes, which usually operate on the
        !           801: most important words in the document,
        !           802: discarding all formatting commands is not correct.
        !           803: To improve the programs' ability to find sentence boundaries, the deformatting program, DEROFF [13],
        !           804: has been given some knowledge of the formatting packages used on the
        !           805: .UX
        !           806: operating system.
        !           807: DEROFF will now do the following:
        !           808: .IP 1.
        !           809: Suppress all formatting macros that
        !           810: are used for titles, headings, author's name, etc.
        !           811: .IP 2.
        !           812: Suppress the arguments to the macros for titles, headings, author's name, etc.
        !           813: .IP 3.
        !           814: Suppress displays, tables, footnotes and text that is centered or in no-fill mode.
        !           815: .IP 4.
        !           816: Substitute a place holder for equations and check
        !           817: for hidden end markers.
        !           818: The place holder is necessary because many typists and authors use
        !           819: the equation setter to change fonts on important words.
        !           820: For this reason, header files containing the definition of
        !           821: the EQN delimiters must also be included as input to STYLE.
        !           822: End markers are often hidden when an equation ends a sentence
        !           823: and the period is typed
        !           824: inside the EQN delimiters.
        !           825: .IP 5.
        !           826: Add a "." after lists.
        !           827: If the flag \-ml is also used, all lists are suppressed.
        !           828: This is a separate flag because of the variety of ways the
        !           829: list macros are used.
        !           830: Often, lists are sentences that should be included in the analysis.
        !           831: The user must determine how lists are used in the document to be analyzed.
        !           832: .PP
        !           833: Both STYLE and DICTION call DEROFF before they look at the text.
        !           834: The user should supply the \-ml flag if the document contains
        !           835: many lists of non-sentences that should be skipped.
        !           836: .NH 2
        !           837: Details of DICTION
        !           838: .PP
        !           839: The program DICTION is based on the string matching program FGREP.
        !           840: FGREP takes as input a file of patterns to be matched and a file
        !           841: to be searched and outputs each line that contains
        !           842: any of the patterns
        !           843: with no indication of which pattern was matched.
        !           844: The following changes have been added to FGREP:
        !           845: .IP 1.
        !           846: The basic unit that DICTION operates on is a sentence rather than a line.
        !           847: Each sentence that contains one of the patterns is output.
        !           848: .IP 2.
        !           849: Upper case letters are mapped to lower case.
        !           850: .IP 3.
        !           851: Punctuation is replaced by blanks.
        !           852: .IP 4
        !           853: All pattern matches in the sentence are found and surrounded with
        !           854: ``['' ``]'' .
        !           855: .IP 5.
        !           856: A method for suppressing a string match has been added.
        !           857: Any pattern that begins with ``~'' will not be matched.
        !           858: Because the matching algorithm finds the longest
        !           859: substring, the suppression of a match allows words in some
        !           860: correct contexts not to be matched while allowing
        !           861: the word in another context to be found.
        !           862: For example, the word ``which'' is often incorrectly used
        !           863: instead of ``that'' in restrictive clauses.
        !           864: However, ``which'' is usually correct when preceded by a preposition
        !           865: or ``,''.
        !           866: The default pattern file suppresses the match
        !           867: of the common prepositions or a double
        !           868: blank followed by ``which'' and therefore matches only
        !           869: the suspect uses.
        !           870: The double blank accounts for the replaced comma.
        !           871: .NH
        !           872: Conclusions
        !           873: .PP
        !           874: A system of writing tools that measure some of the
        !           875: objective characteristics of writing style has been developed.
        !           876: The tools are sufficiently general that they may be applied to
        !           877: documents on any subject with equal accuracy.
        !           878: Although the measurements are only of the surface
        !           879: structure of the text, they do point out problem areas.
        !           880: In addition to helping writers produce better documents,
        !           881: these programs may be useful for studying
        !           882: the writing process and finding other formulae for measuring
        !           883: readability.
unix.superglobalmegacorp.com
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.