|
|
1.1 root 1: .TH OCR 1 coma,pipe,crab
2: .CT 1 graphics
3: .SH NAME
4: ocr \- optical character recognition
5: .SH SYNOPSIS
6: .B ocr
7: [
8: .I option ...
9: ]
10: [
11: .I file
12: ]
13: .SH DESCRIPTION
14: .I Ocr
15: reads a black-and-white image of a page from
16: .IR file ,
17: and writes ASCII to the standard output.
18: If no
19: .I file
20: is specified, it reads from the standard input.
21: .PP
22: The input is a
23: .IR picfile (5)
24: image of one column of machine-printed text.
25: Fonts, sizes, and line-spacings may vary within the column,
26: but each line should have a constant size and baseline.
27: Lines should be parallel and roughly horizontal.
28: .PP
29: In the output, white space approximates the original page layout.
30: Words are checked and corrected by reference to the
31: .IR spell (1)
32: dictionary, and hyphenations across lines are recombined.
33: .PP
34: The options are:
35: .nr xx \w'\fL-pn,m\ \ '
36: .TP \n(xxu
37: .BI -a s
38: The alphabet is the union of symbol sets selected by characters in string
39: .IR s ,
40: from among:
41: .RS
42: .PD
43: .nr yy \w'\fLA\ \ '
44: .TP \n(yyu
45: .B A
46: ABCDEFGHIJKLMNOPQRSTUVWXYZ
47: .PD0
48: .TP
49: .B a
50: abcdefghijklmnopqrstuvwxyz
51: .PD0
52: .TP
53: .B 0
54: 0123456789
55: .PD0
56: .TP
57: .B .
58: \&.\^,\|-\^:\^;\|*\^'\|\^"\|?\^!\|/\|&\|$\^(\^)\^[\|\^]\|#\|@\|% \0\0\0\0\0\0\0 \kz(basic punctuation)
59: .ig
60: should include ` /(em + ???
61: shouldn't include []#@% ???
62: ..
63: .PD0
64: .TP
65: .B ^
66: ^\|\f(CW~\fR\^`\|\^\\\||\|\^{\|}\|_ \h'|\nzu'(extended punctuation)
67: .ig
68: should include []#@% ???
69: shouldn't include ` ???
70: ..
71: .PD0
72: .TP
73: .B +
74: +\^\-\^*\|/\|<\^>\^=\^.\^E\|e\|[\|] \h'|\nzu'(for numerical tables)
75: .PD0
76: .TP
77: .B s
78: .ie t \(sc\^\(dg\^\(dd\^\(ct\|\(bu\|\(rg\|\(co\|\(de\^\(fm\^\(en\|\^\(mi\|\(em \h'|\nzu'(selected non-ASCII)
79: .el \\(sc\\(dg\\(dd\\(ct\\(bu\\(rg\\(co\\(de\\(fm\\(en\\(mi\\(em (selected non-ASCII)
80: .PD0
81: .TP
82: .B l
83: .ie t \(fi\|\(fl\|f\h'-.1m'f\|f\h'-.1m'\(fi\|f\h'-.1m'\(fl\|\N'114'\|\N'115'\|\N'105'\|\N'106' \h'|\nzu'(ligatures and digraphs)
84: .el fi fl ff ffi ffl ae AE oe OE \h'|\nzu'(ligatures & digraphs)
85: .PD
86: .PP
87: The default is
88: .BR -aAa0.+^ ,
89: the full printable-ASCII set, which may be abbreviated as
90: .BR -ap .
91: Thus,
92: .B -apsl
93: selects all of the above.
94: .RE
95: .PD
96: .TP \n(xxu
97: .BI -m l[,r]
98: Trim the left and right margins of the image by
99: .I l
100: and
101: .I r
102: inches, respectively, before looking for columns.
103: If
104: .I r
105: is omitted, it is assumed to equal
106: .IR l.
107: .TP
108: .BI -n n
109: Find the
110: .I n
111: largest columns.
112: Each column should be compactly-printed
113: and separated from the others by at least 5 ems of horizontal white space.
114: .TP
115: .BI -p n,m
116: Point sizes lie in the range [
117: .I n, m
118: ]; other sizes are discarded.
119: The default is
120: .BR -p6,24 .
121: .TP
122: .B -t
123: Write
124: .IR troff (1)
125: format.
126: Each column is shown on a separate page, left- and top-justified.
127: Lines are placed at their original height in the column,
128: and each word starts at its original horizontal location in the line.
129: Characters are printed approximately original size in Times roman.
130: Hyphenated words are not recombined.
131: .TP
132: .B -u
133: Unspellable words are prefixed with `?' or, if
134: .B -t
135: is specified, printed boldface.
136: .TP
137: .BI -w w
138: Find the largest column of width
139: .I w
140: inches.
141: .SS Fonts
142: Times, Helvetica, Palatino, Constant Width, Printout, Baskerville, Memphis,
143: Caslon Old, Zapf, Optima, Futura, Euro, Spartan, Garamond, Breughel, Textype,
144: Bembo, Souvenir and similar fonts are recognized in roman,
145: italic, bold, condensed, and expanded styles.
146: Also Tibetan, on request.
147: .SH SEE ALSO
148: .IR bcp (1),
149: .IR cscan (1),
150: .IR font (6),
151: .IR picfile (5),
152: .IR spell (1),
153: .IR troff (1)
154: .SH BUGS
155: For best results, use images of high-contrast, cleanly-printed original
156: documents digitized at a resolution of 400 pixels/inch or higher.
157: It sometimes helps to restrict the alphabet and sizes to what's there.
158: Multiple-column finding is chancy; if it goes wrong, runtimes may be excessive.
159: .ig
160: 8.7 CPU minutes on pipe to read this page, September 1989.
161: ..
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.