|
|
1.1 root 1: .\" Copyright (c) 1980 Regents of the University of California.
2: .\" All rights reserved. The Berkeley software License Agreement
3: .\" specifies the terms and conditions for redistribution.
4: .\"
5: .\" @(#)crash.8 6.3 (Berkeley) 6/24/90
6: .\"
7: .TH CRASH 8V "June 24, 1990"
8: .UC 4
9: .SH NAME
10: crash \- what happens when the system crashes
11: .SH DESCRIPTION
12: This section explains what happens when the system crashes
13: and (very briefly) how to analyze crash dumps.
14: .PP
15: When the system crashes voluntarily it prints a message of the form
16: .IP
17: panic: why i gave up the ghost
18: .LP
19: on the console, takes a dump on a mass storage peripheral,
20: and then invokes an automatic reboot procedure as
21: described in
22: .IR reboot (8).
23: (If auto-reboot is disabled on the front panel of the machine the system
24: will simply halt at this point.)
25: Unless some unexpected inconsistency is encountered in the state
26: of the file systems due to hardware or software failure, the system
27: will then resume multi-user operations.
28: .PP
29: The system has a large number of internal consistency checks; if one
30: of these fails, then it will panic with a very short message indicating
31: which one failed.
32: In many instances, this will be the name of the routine which detected
33: the error, or a two-word description of the inconsistency.
34: A full understanding of most panic messages requires perusal of the
35: source code for the system.
36: .PP
37: The most common cause of system failures is hardware failure, which
38: can reflect itself in different ways. Here are the messages which
39: are most likely, with some hints as to causes.
40: Left unstated in all cases is the possibility that hardware or software
41: error produced the message in some unexpected way.
42: .TP
43: .B iinit
44: This cryptic panic message results from a failure to mount the root filesystem
45: during the bootstrap process.
46: Either the root filesystem has been corrupted,
47: or the system is attempting to use the wrong device as root filesystem.
48: Usually, an alternate copy of the system binary or an alternate root
49: filesystem can be used to bring up the system to investigate.
50: .TP
51: .B Can't exec /sbin/init
52: This is not a panic message, as reboots are likely to be futile.
53: Late in the bootstrap procedure, the system was unable to locate
54: and execute the initialization process,
55: .IR init (8).
56: The root filesystem is incorrect or has been corrupted, or the mode
57: or type of /sbin/init forbids execution.
58: .TP
59: .B IO err in push
60: .ns
61: .TP
62: .B hard IO err in swap
63: The system encountered an error trying to write to the paging device
64: or an error in reading critical information from a disk drive.
65: The offending disk should be fixed if it is broken or unreliable.
66: .TP
67: .B realloccg: bad optim
68: .ns
69: .TP
70: .B ialloc: dup alloc
71: .ns
72: .TP
73: .B alloccgblk: cyl groups corrupted
74: .ns
75: .TP
76: .B ialloccg: map corrupted
77: .ns
78: .TP
79: .B free: freeing free block
80: .ns
81: .TP
82: .B free: freeing free frag
83: .ns
84: .TP
85: .B ifree: freeing free inode
86: .ns
87: .TP
88: .B alloccg: map corrupted
89: These panic messages are among those that may be produced
90: when filesystem inconsistencies are detected.
91: The problem generally results from a failure to repair damaged filesystems
92: after a crash, hardware failures, or other condition that should not
93: normally occur.
94: A filesystem check will normally correct the problem.
95: .TP
96: .B timeout table overflow
97: .ns
98: This really shouldn't be a panic, but until the data structure
99: involved is made to be extensible, running out of entries causes a crash.
100: If this happens, make the timeout table bigger.
101: .TP
102: .B KSP not valid
103: .ns
104: .TP
105: .B SBI fault
106: .ns
107: .TP
108: .B CHM? in kernel
109: These indicate either a serious bug in the system or, more often,
110: a glitch or failing hardware.
111: If SBI faults recur, check out the hardware or call
112: field service. If the other faults recur, there is likely a bug somewhere
113: in the system, although these can be caused by a flakey processor.
114: Run processor microdiagnostics.
115: .TP
116: .B machine check %x:
117: .I description
118: .ns
119: .TP
120: .I \0\0\0machine dependent machine-check information
121: .ns
122: Machine checks are different on each type of CPU.
123: Most of the internal processor registers are saved at the time of the fault
124: and are printed on the console.
125: For most processors, there is one line that summarizes the type of machine
126: check.
127: Often, the nature of the problem is apparent from this messaage
128: and/or the contents of key registers.
129: The VAX Hardware Handbook should be consulted,
130: and, if necessary, your friendly field service people should be informed
131: of the problem.
132: .TP
133: .B trap type %d, code=%x, pc=%x
134: A unexpected trap has occurred within the system; the trap types are:
135: .sp
136: .nf
137: 0 reserved addressing fault
138: 1 privileged instruction fault
139: 2 reserved operand fault
140: 3 bpt instruction fault
141: 4 xfc instruction fault
142: 5 system call trap
143: 6 arithmetic trap
144: 7 ast delivery trap
145: 8 segmentation fault
146: 9 protection fault
147: 10 trace trap
148: 11 compatibility mode fault
149: 12 page fault
150: 13 page table fault
151: .fi
152: .sp
153: The favorite trap types in system crashes are trap types 8 and 9,
154: indicating
155: a wild reference. The code is the referenced address, and the pc at the
156: time of the fault is printed. These problems tend to be easy to track
157: down if they are kernel bugs since the processor stops cold, but random
158: flakiness seems to cause this sometimes.
159: The debugger can be used to locate the instruction and subroutine
160: corresponding to the PC value.
161: If that is insufficient to suggest the nature of the problem,
162: more detailed examination of the system status at the time of the trap
163: usually can produce an explanation.
164: .TP
165: .B init died
166: The system initialization process has exited. This is bad news, as no new
167: users will then be able to log in. Rebooting is the only fix, so the
168: system just does it right away.
169: .TP
170: .B out of mbufs: map full
171: The network has exhausted its private page map for network buffers.
172: This usually indicates that buffers are being lost, and rather than
173: allow the system to slowly degrade, it reboots immediately.
174: The map may be made larger if necessary.
175: .PP
176: That completes the list of panic types you are likely to see.
177: .PP
178: When the system crashes it writes (or at least attempts to write)
179: an image of memory into the back end of the dump device,
180: usually the same as the primary swap
181: area. After the system is rebooted, the program
182: .IR savecore (8)
183: runs and preserves a copy of this core image and the current
184: system in a specified directory for later perusal. See
185: .IR savecore (8)
186: for details.
187: .PP
188: To analyze a dump you should begin by running
189: .IR adb (1)
190: with the
191: .B \-k
192: flag on the system load image and core dump.
193: If the core image is the result of a panic,
194: the panic message is printed.
195: Normally the command
196: ``$c''
197: will provide a stack trace from the point of
198: the crash and this will provide a clue as to
199: what went wrong.
200: A more complete discussion
201: of system debugging is impossible here.
202: See, however,
203: ``Using ADB to Debug the UNIX Kernel''.
204: .SH "SEE ALSO"
205: adb(1),
206: reboot(8)
207: .br
208: .I "VAX 11/780 System Maintenance Guide"
209: and
210: .I "VAX Hardware Handbook"
211: for more information about machine checks.
212: .br
213: .I "Using ADB to Debug the UNIX Kernel"
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.