|
|
1.1 root 1: .TH CRASH 8 VAX-11
2: .UC 4
3: .SH NAME
4: crash \- what happens when the system crashes
5: .SH DESCRIPTION
6: This section explains what happens when the system crashes and how
7: you can get a crash dump for analysis of non-transient problems.
8: .PP
9: When the system crashes voluntarily it prints a message of the form
10: .IP
11: panic: why i gave up the ghost
12: .LP
13: on the console, and then invokes an automatic reboot procedure as
14: described in
15: .IR reboot (8).
16: If the auto-reboot switch is off on the console, then the processor
17: will simply halt at this point.
18: Otherwise the registers and the top few locations of the stack will
19: be printed on the console, and then the system will check the disks
20: and (unless some unexpected inconsistency is encountered), resume
21: multi-user operations.
22: .PP
23: The system has a large number of internal consistency checks; if one
24: of these fails, then it will panic with a very short message indicating
25: which one failed. In the absence of a dump, little can be done about
26: one of these. If the problem recurs, you should arrange to get a dump
27: for further analysis by running with auto-reboot disabled during normal
28: working hours and then following the procedure described below.
29: .PP
30: The most common cause of system failures is hardware failure, which
31: can reflect itself in different ways. Here are the messages which
32: you are likely to encounter, with some hints as to causes.
33: Left unstated in all cases is the possibility that hardware or software
34: error produced the message in some unexpected way.
35: .TP
36: IO err in push
37: .ns
38: .TP
39: hard IO err in swap
40: The system encountered an error trying to write to the paging device
41: or an error in reading critical information from a disk drive.
42: You should fix your disk if it is broken or unreliable.
43: .TP
44: Timeout table overflow
45: .ns
46: .TP
47: ran out of bdp's
48: .ns
49: .TP
50: ran out of uba map
51: These really shouldn't be panics, but until we fix up the data structures
52: involved, running out of entries causes a crash. If the timeout table
53: overflows, you should make it bigger. If you run out of bdp's or uba map
54: you probably have a buggy device driver in your system, allocating and
55: not releasing UNIBUS resources.
56: .TP
57: KSP not valid
58: .ns
59: .TP
60: SBI fault
61: .ns
62: .TP
63: Machine check
64: .ns
65: .TP
66: CHM? in kernel
67: These indicate either a serious bug in the system or, more often,
68: a glitch or failing hardware. For the machine check, the top part of
69: the resulting stack frame gives more information. You can refer to a
70: VAX 11/780 System Maintenance Guide for information on machine checks.
71: If machine checks or SBI faults recur, check out the hardware or call
72: field service. If the other faults recur, there is likely a bug somewhere
73: in the system, although these can be caused by a flakey processor.
74: Run processor microdiagnostics.
75: .TP
76: trap type %d, code=%d
77: A unexpected trap has occurred within the system; the trap types are:
78: .RS
79: .TP 10
80: 0
81: reserved addressing mode
82: .br
83: .ns
84: .TP 10
85: 1
86: privileged instruction
87: .br
88: .ns
89: .TP 10
90: 2
91: BPT
92: .br
93: .ns
94: .TP 10
95: 3
96: XFC
97: .br
98: .ns
99: .TP 10
100: 4
101: reserved operand
102: .br
103: .ns
104: .TP 10
105: 5
106: CHMK (system call)
107: .br
108: .ns
109: .TP 10
110: 6
111: arithmetic trap
112: .br
113: .ns
114: .TP 10
115: 7
116: reschedule trap (software level 3)
117: .br
118: .ns
119: .TP 10
120: 8
121: segmentation fault
122: .br
123: .ns
124: .TP 10
125: 9
126: protection fault
127: .br
128: .ns
129: .TP 10
130: 10
131: trace pending (TP bit)
132: .RE
133: .IP
134: The favorite trap type in system crashes is trap type 9, indicating
135: a wild reference. The code is the referenced address. If you look
136: down the stack, just after the trap type and the code are the pc and
137: the ps of the processor when it trapped, showing you where in the
138: system the problem occurred. These problems tend to be easy to track
139: down if they are kernel bugs since the processor stops cold, but random
140: flakiness seems to cause this sometimes, e.g. we have trapped with
141: code 80000800 three times in six months as an instruction fetch went across
142: this page boundary in the kernel but have been unable to find any reason
143: for this to have happened.
144: .TP
145: init died
146: The system initialization process has exited. This is bad news, as no new
147: users will then be able to log in. Rebooting is the only fix, so the
148: system just does it right away.
149: .PP
150: That completes the list of panic types you are likely to see.
151: Now for the crash dump procedure:
152: .PP
153: At the moment a dump can be taken only on magnetic tape.
154: Before you do anything, be sure that a clean tape is mounted with a ring-in
155: on the tape drive if you plan to make a dump.
156: .PP
157: Write the date and time on the console log.
158: Use the console commands to examine the registers, program status long word,
159: and the top several locations on the stack.
160: A suggested command sequence, which is executed by the \*(lq@DUMP\*(rq
161: console command script, is:
162: .DS
163: .nf
164: E PSL<return>
165: E R0/NE:F<return>
166: E SP<return>
167: E/V @ /NE:40<return>
168: .fi
169: .DE
170: If hardware problems dictate a special set of commands be executed when
171: the system crashes, a sequence of commands can be saved using the console
172: command \*(lqLINK\*(rq to be reexecuted with \*(lqPERFORM\*(rq (which can be
173: abbreviated \*(lqP\*(rq).
174: If a dump is to be taken on magnetic tape (this is a good idea
175: in most any case where the cause of the crash is not immediately obvious)
176: then the following commands will (should) be executed:
177: .DS
178: .nf
179: D PSL 0<return>
180: D PC 80000200<return>
181: C<return>
182: .fi
183: .DE
184: These commands are actually part of the standard \*(lq@DUMP\*(rq script.
185: This should write a copy of all of memory
186: on the tape, followed by two EOF marks.
187: Caution:
188: Any error is taken to mean the end of memory has been reached.
189: This means that you must be sure the ring is in,
190: the tape is ready, and the tape is clean and new.
191: .PP
192: If there are not 40(hex) locations active on the kernel stack when the
193: procedure is begun, then the console may begin to print error diagnostics.
194: You can stop this by hitting \*(lq^C\*(rq (control-C), and then give the
195: last three commands above.
196: .PP
197: If the dump fails, you can try again,
198: but some of the registers will be lost.
199: See below for what to do with the tape.
200: .PP
201: To restart after a crash, follow the directions in
202: .IR reboot (8);
203: if the virtual memory subsystem is suspected as the cause of the crash,
204: then a version of the system other than \*(lqvmunix\*(rq should be booted
205: which will leave the paging areas temporarily intact
206: for use by the post-mortem analysis program
207: .I analyze.
208: After checking your root file system consistency with
209: .IR fsck (8),
210: you can read the core dump tape into the file /vmcore with
211: .IP
212: dd if=/dev/rmt0 of=/vmcore bs=20b
213: .LP
214: It does not work to use just
215: .IR cp (1),
216: as the tape is blocked.
217: With the system still in single-user mode, run the analysis program
218: .I analyze,
219: e.g.:
220: .IP
221: analyze \-s /dev/drum /vmcore /vmunix
222: .LP
223: and save the output.
224: Then boot up
225: \*(lqvmunix\*(rq
226: and let it do the automatic reboot, i.e. to boot multi-user from
227: an RM03/RM05/RP06 on the MASSBUS
228: .IP
229: >>> BOOT RPM
230: .PP
231: After rebooting, to analyze a dump you should execute
232: .I "ps \-alxk"
233: to print the process table at the time of the crash.
234: Use
235: .IR adb (1)
236: to examine
237: .IR /vmcore .
238: The location
239: .I dumpstack\-80000000
240: is the bottom of a stack onto which were pushed the stack pointer
241: .BR sp ,
242: .B PCBB
243: (containing the physical address of a
244: .IR u_area ),
245: .BR MAPEN ,
246: .BR IPL ,
247: and registers
248: .BR r13 \- r0
249: (in that order).
250: .BR r13 (fp)
251: is the system frame pointer and the stack is used in standard
252: .B calls
253: format. Use
254: .IR adb (1)
255: to get a reverse calling order.
256: In most cases this procedure will give
257: an idea of what is wrong.
258: A more complete discussion
259: of system debugging is impossible here.
260: See, however,
261: .IR analyze (8)
262: for some more hints.
263: .SH "SEE ALSO"
264: analyze(8), reboot(8)
265: .br
266: .I "VAX 11/780 System Maintenance Guide"
267: for more information about machine checks.
268: .SH BUGS
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.