|
|
1.1 ! root 1: .TH CRASH 8 VAX-11 ! 2: .UC 4 ! 3: .SH NAME ! 4: crash \- what happens when the system crashes ! 5: .SH DESCRIPTION ! 6: This section explains what happens when the system crashes and how ! 7: you can get a crash dump for analysis of non-transient problems. ! 8: .PP ! 9: When the system crashes voluntarily it prints a message of the form ! 10: .IP ! 11: panic: why i gave up the ghost ! 12: .LP ! 13: on the console, and then invokes an automatic reboot procedure as ! 14: described in ! 15: .IR reboot (8). ! 16: If the auto-reboot switch is off on the console, then the processor ! 17: will simply halt at this point. ! 18: Otherwise the registers and the top few locations of the stack will ! 19: be printed on the console, and then the system will check the disks ! 20: and (unless some unexpected inconsistency is encountered), resume ! 21: multi-user operations. ! 22: .PP ! 23: The system has a large number of internal consistency checks; if one ! 24: of these fails, then it will panic with a very short message indicating ! 25: which one failed. In the absence of a dump, little can be done about ! 26: one of these. If the problem recurs, you should arrange to get a dump ! 27: for further analysis by running with auto-reboot disabled during normal ! 28: working hours and then following the procedure described below. ! 29: .PP ! 30: The most common cause of system failures is hardware failure, which ! 31: can reflect itself in different ways. Here are the messages which ! 32: you are likely to encounter, with some hints as to causes. ! 33: Left unstated in all cases is the possibility that hardware or software ! 34: error produced the message in some unexpected way. ! 35: .TP ! 36: IO err in push ! 37: .ns ! 38: .TP ! 39: hard IO err in swap ! 40: The system encountered an error trying to write to the paging device ! 41: or an error in reading critical information from a disk drive. ! 42: You should fix your disk if it is broken or unreliable. ! 43: .TP ! 44: Timeout table overflow ! 45: .ns ! 46: .TP ! 47: ran out of bdp's ! 48: .ns ! 49: .TP ! 50: ran out of uba map ! 51: These really shouldn't be panics, but until we fix up the data structures ! 52: involved, running out of entries causes a crash. If the timeout table ! 53: overflows, you should make it bigger. If you run out of bdp's or uba map ! 54: you probably have a buggy device driver in your system, allocating and ! 55: not releasing UNIBUS resources. ! 56: .TP ! 57: KSP not valid ! 58: .ns ! 59: .TP ! 60: SBI fault ! 61: .ns ! 62: .TP ! 63: Machine check ! 64: .ns ! 65: .TP ! 66: CHM? in kernel ! 67: These indicate either a serious bug in the system or, more often, ! 68: a glitch or failing hardware. For the machine check, the top part of ! 69: the resulting stack frame gives more information. You can refer to a ! 70: VAX 11/780 System Maintenance Guide for information on machine checks. ! 71: If machine checks or SBI faults recur, check out the hardware or call ! 72: field service. If the other faults recur, there is likely a bug somewhere ! 73: in the system, although these can be caused by a flakey processor. ! 74: Run processor microdiagnostics. ! 75: .TP ! 76: trap type %d, code=%d ! 77: A unexpected trap has occurred within the system; the trap types are: ! 78: .RS ! 79: .TP 10 ! 80: 0 ! 81: reserved addressing mode ! 82: .br ! 83: .ns ! 84: .TP 10 ! 85: 1 ! 86: privileged instruction ! 87: .br ! 88: .ns ! 89: .TP 10 ! 90: 2 ! 91: BPT ! 92: .br ! 93: .ns ! 94: .TP 10 ! 95: 3 ! 96: XFC ! 97: .br ! 98: .ns ! 99: .TP 10 ! 100: 4 ! 101: reserved operand ! 102: .br ! 103: .ns ! 104: .TP 10 ! 105: 5 ! 106: CHMK (system call) ! 107: .br ! 108: .ns ! 109: .TP 10 ! 110: 6 ! 111: arithmetic trap ! 112: .br ! 113: .ns ! 114: .TP 10 ! 115: 7 ! 116: reschedule trap (software level 3) ! 117: .br ! 118: .ns ! 119: .TP 10 ! 120: 8 ! 121: segmentation fault ! 122: .br ! 123: .ns ! 124: .TP 10 ! 125: 9 ! 126: protection fault ! 127: .br ! 128: .ns ! 129: .TP 10 ! 130: 10 ! 131: trace pending (TP bit) ! 132: .RE ! 133: .IP ! 134: The favorite trap type in system crashes is trap type 9, indicating ! 135: a wild reference. The code is the referenced address. If you look ! 136: down the stack, just after the trap type and the code are the pc and ! 137: the ps of the processor when it trapped, showing you where in the ! 138: system the problem occurred. These problems tend to be easy to track ! 139: down if they are kernel bugs since the processor stops cold, but random ! 140: flakiness seems to cause this sometimes, e.g. we have trapped with ! 141: code 80000800 three times in six months as an instruction fetch went across ! 142: this page boundary in the kernel but have been unable to find any reason ! 143: for this to have happened. ! 144: .TP ! 145: init died ! 146: The system initialization process has exited. This is bad news, as no new ! 147: users will then be able to log in. Rebooting is the only fix, so the ! 148: system just does it right away. ! 149: .PP ! 150: That completes the list of panic types you are likely to see. ! 151: Now for the crash dump procedure: ! 152: .PP ! 153: At the moment a dump can be taken only on magnetic tape. ! 154: Before you do anything, be sure that a clean tape is mounted with a ring-in ! 155: on the tape drive if you plan to make a dump. ! 156: .PP ! 157: Write the date and time on the console log. ! 158: Use the console commands to examine the registers, program status long word, ! 159: and the top several locations on the stack. ! 160: A suggested command sequence, which is executed by the \*(lq@DUMP\*(rq ! 161: console command script, is: ! 162: .DS ! 163: .nf ! 164: E PSL<return> ! 165: E R0/NE:F<return> ! 166: E SP<return> ! 167: E/V @ /NE:40<return> ! 168: .fi ! 169: .DE ! 170: If hardware problems dictate a special set of commands be executed when ! 171: the system crashes, a sequence of commands can be saved using the console ! 172: command \*(lqLINK\*(rq to be reexecuted with \*(lqPERFORM\*(rq (which can be ! 173: abbreviated \*(lqP\*(rq). ! 174: If a dump is to be taken on magnetic tape (this is a good idea ! 175: in most any case where the cause of the crash is not immediately obvious) ! 176: then the following commands will (should) be executed: ! 177: .DS ! 178: .nf ! 179: D PSL 0<return> ! 180: D PC 80000200<return> ! 181: C<return> ! 182: .fi ! 183: .DE ! 184: These commands are actually part of the standard \*(lq@DUMP\*(rq script. ! 185: This should write a copy of all of memory ! 186: on the tape, followed by two EOF marks. ! 187: Caution: ! 188: Any error is taken to mean the end of memory has been reached. ! 189: This means that you must be sure the ring is in, ! 190: the tape is ready, and the tape is clean and new. ! 191: .PP ! 192: If there are not 40(hex) locations active on the kernel stack when the ! 193: procedure is begun, then the console may begin to print error diagnostics. ! 194: You can stop this by hitting \*(lq^C\*(rq (control-C), and then give the ! 195: last three commands above. ! 196: .PP ! 197: If the dump fails, you can try again, ! 198: but some of the registers will be lost. ! 199: See below for what to do with the tape. ! 200: .PP ! 201: To restart after a crash, follow the directions in ! 202: .IR reboot (8); ! 203: if the virtual memory subsystem is suspected as the cause of the crash, ! 204: then a version of the system other than \*(lqvmunix\*(rq should be booted ! 205: which will leave the paging areas temporarily intact ! 206: for use by the post-mortem analysis program ! 207: .I analyze. ! 208: After checking your root file system consistency with ! 209: .IR fsck (8), ! 210: you can read the core dump tape into the file /vmcore with ! 211: .IP ! 212: dd if=/dev/rmt0 of=/vmcore bs=20b ! 213: .LP ! 214: It does not work to use just ! 215: .IR cp (1), ! 216: as the tape is blocked. ! 217: With the system still in single-user mode, run the analysis program ! 218: .I analyze, ! 219: e.g.: ! 220: .IP ! 221: analyze \-s /dev/drum /vmcore /vmunix ! 222: .LP ! 223: and save the output. ! 224: Then boot up ! 225: \*(lqvmunix\*(rq ! 226: and let it do the automatic reboot, i.e. to boot multi-user from ! 227: an RM03/RM05/RP06 on the MASSBUS ! 228: .IP ! 229: >>> BOOT RPM ! 230: .PP ! 231: After rebooting, to analyze a dump you should execute ! 232: .I "ps \-alxk" ! 233: to print the process table at the time of the crash. ! 234: Use ! 235: .IR adb (1) ! 236: to examine ! 237: .IR /vmcore . ! 238: The location ! 239: .I dumpstack\-80000000 ! 240: is the bottom of a stack onto which were pushed the stack pointer ! 241: .BR sp , ! 242: .B PCBB ! 243: (containing the physical address of a ! 244: .IR u_area ), ! 245: .BR MAPEN , ! 246: .BR IPL , ! 247: and registers ! 248: .BR r13 \- r0 ! 249: (in that order). ! 250: .BR r13 (fp) ! 251: is the system frame pointer and the stack is used in standard ! 252: .B calls ! 253: format. Use ! 254: .IR adb (1) ! 255: to get a reverse calling order. ! 256: In most cases this procedure will give ! 257: an idea of what is wrong. ! 258: A more complete discussion ! 259: of system debugging is impossible here. ! 260: See, however, ! 261: .IR analyze (8) ! 262: for some more hints. ! 263: .SH "SEE ALSO" ! 264: analyze(8), reboot(8) ! 265: .br ! 266: .I "VAX 11/780 System Maintenance Guide" ! 267: for more information about machine checks. ! 268: .SH BUGS
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.