Annotation of researchv10dc/man/manb/crash.8, revision 1.1.1.1

1.1       root        1: .TH CRASH 8 VAX-11
                      2: .UC 4
                      3: .SH NAME
                      4: crash \- what happens when the system crashes
                      5: .SH DESCRIPTION
                      6: This section explains what happens when the system crashes and how
                      7: you can get a crash dump for analysis of non-transient problems.
                      8: .PP
                      9: When the system crashes voluntarily it prints a message of the form
                     10: .IP
                     11: panic: why i gave up the ghost
                     12: .LP
                     13: on the console, and then invokes an automatic reboot procedure as
                     14: described in
                     15: .IR reboot (8).
                     16: If the auto-reboot switch is off on the console, then the processor
                     17: will simply halt at this point.
                     18: Otherwise the registers and the top few locations of the stack will
                     19: be printed on the console, and then the system will check the disks
                     20: and (unless some unexpected inconsistency is encountered), resume
                     21: multi-user operations.
                     22: .PP
                     23: The system has a large number of internal consistency checks; if one
                     24: of these fails, then it will panic with a very short message indicating
                     25: which one failed.  In the absence of a dump, little can be done about
                     26: one of these.  If the problem recurs, you should arrange to get a dump
                     27: for further analysis by running with auto-reboot disabled during normal
                     28: working hours and then following the procedure described below.
                     29: .PP
                     30: The most common cause of system failures is hardware failure, which
                     31: can reflect itself in different ways.  Here are the messages which
                     32: you are likely to encounter, with some hints as to causes.
                     33: Left unstated in all cases is the possibility that hardware or software
                     34: error produced the message in some unexpected way.
                     35: .TP
                     36: IO err in push
                     37: .ns
                     38: .TP
                     39: hard IO err in swap
                     40: The system encountered an error trying to write to the paging device
                     41: or an error in reading critical information from a disk drive.
                     42: You should fix your disk if it is broken or unreliable.
                     43: .TP
                     44: Timeout table overflow
                     45: .ns
                     46: .TP
                     47: ran out of bdp's
                     48: .ns
                     49: .TP
                     50: ran out of uba map
                     51: These really shouldn't be panics, but until we fix up the data structures
                     52: involved, running out of entries causes a crash.  If the timeout table
                     53: overflows, you should make it bigger.  If you run out of bdp's or uba map
                     54: you probably have a buggy device driver in your system, allocating and
                     55: not releasing UNIBUS resources.
                     56: .TP
                     57: KSP not valid
                     58: .ns
                     59: .TP
                     60: SBI fault
                     61: .ns
                     62: .TP
                     63: Machine check
                     64: .ns
                     65: .TP
                     66: CHM? in kernel
                     67: These indicate either a serious bug in the system or, more often,
                     68: a glitch or failing hardware.  For the machine check, the top part of
                     69: the resulting stack frame gives more information.  You can refer to a
                     70: VAX 11/780 System Maintenance Guide for information on machine checks.
                     71: If machine checks or SBI faults recur, check out the hardware or call
                     72: field service.  If the other faults recur, there is likely a bug somewhere
                     73: in the system, although these can be caused by a flakey processor.
                     74: Run processor microdiagnostics.
                     75: .TP
                     76: trap type %d, code=%d
                     77: A unexpected trap has occurred within the system; the trap types are:
                     78: .RS
                     79: .TP 10
                     80: 0
                     81: reserved addressing mode
                     82: .br
                     83: .ns
                     84: .TP 10
                     85: 1
                     86: privileged instruction
                     87: .br
                     88: .ns
                     89: .TP 10
                     90: 2
                     91: BPT
                     92: .br
                     93: .ns
                     94: .TP 10
                     95: 3
                     96: XFC
                     97: .br
                     98: .ns
                     99: .TP 10
                    100: 4
                    101: reserved operand
                    102: .br
                    103: .ns
                    104: .TP 10
                    105: 5
                    106: CHMK (system call)
                    107: .br
                    108: .ns
                    109: .TP 10
                    110: 6
                    111: arithmetic trap
                    112: .br
                    113: .ns
                    114: .TP 10
                    115: 7
                    116: reschedule trap (software level 3)
                    117: .br
                    118: .ns
                    119: .TP 10
                    120: 8
                    121: segmentation fault
                    122: .br
                    123: .ns
                    124: .TP 10
                    125: 9
                    126: protection fault
                    127: .br
                    128: .ns
                    129: .TP 10
                    130: 10
                    131: trace pending (TP bit)
                    132: .RE
                    133: .IP
                    134: The favorite trap type in system crashes is trap type 9, indicating
                    135: a wild reference.  The code is the referenced address.  If you look
                    136: down the stack, just after the trap type and the code are the pc and
                    137: the ps of the processor when it trapped, showing you where in the
                    138: system the problem occurred.  These problems tend to be easy to track
                    139: down if they are kernel bugs since the processor stops cold, but random
                    140: flakiness seems to cause this sometimes, e.g. we have trapped with
                    141: code 80000800 three times in six months as an instruction fetch went across
                    142: this page boundary in the kernel but have been unable to find any reason
                    143: for this to have happened.
                    144: .TP
                    145: init died
                    146: The system initialization process has exited.  This is bad news, as no new
                    147: users will then be able to log in.  Rebooting is the only fix, so the
                    148: system just does it right away.
                    149: .PP
                    150: That completes the list of panic types you are likely to see.
                    151: Now for the crash dump procedure:
                    152: .PP
                    153: At the moment a dump can be taken only on magnetic tape.
                    154: Before you do anything, be sure that a clean tape is mounted with a ring-in
                    155: on the tape drive if you plan to make a dump.
                    156: .PP
                    157: Write the date and time on the console log.
                    158: Use the console commands to examine the registers, program status long word,
                    159: and the top several locations on the stack.
                    160: A suggested command sequence, which is executed by the \*(lq@DUMP\*(rq
                    161: console command script, is:
                    162: .DS
                    163: .nf
                    164:        E PSL<return>
                    165:        E R0/NE:F<return>
                    166:        E SP<return>
                    167:        E/V @ /NE:40<return>
                    168: .fi
                    169: .DE
                    170: If hardware problems dictate a special set of commands be executed when
                    171: the system crashes, a sequence of commands can be saved using the console
                    172: command \*(lqLINK\*(rq to be reexecuted with \*(lqPERFORM\*(rq (which can be
                    173: abbreviated \*(lqP\*(rq).
                    174: If a dump is to be taken on magnetic tape (this is a good idea
                    175: in most any case where the cause of the crash is not immediately obvious)
                    176: then the following commands will (should) be executed:
                    177: .DS
                    178: .nf
                    179:        D PSL 0<return>
                    180:        D PC 80000200<return>
                    181:        C<return>
                    182: .fi
                    183: .DE
                    184: These commands are actually part of the standard \*(lq@DUMP\*(rq script.
                    185: This should write a copy of all of memory
                    186: on the tape, followed by two EOF marks.
                    187: Caution:
                    188: Any error is taken to mean the end of memory has been reached.
                    189: This means that you must be sure the ring is in,
                    190: the tape is ready, and the tape is clean and new.
                    191: .PP
                    192: If there are not 40(hex) locations active on the kernel stack when the
                    193: procedure is begun, then the console may begin to print error diagnostics.
                    194: You can stop this by hitting \*(lq^C\*(rq (control-C), and then give the
                    195: last three commands above.
                    196: .PP
                    197: If the dump fails, you can try again,
                    198: but some of the registers will be lost.
                    199: See below for what to do with the tape.
                    200: .PP
                    201: To restart after a crash, follow the directions in
                    202: .IR reboot (8);
                    203: if the virtual memory subsystem is suspected as the cause of the crash,
                    204: then a version of the system other than \*(lqvmunix\*(rq should be booted
                    205: which will leave the paging areas temporarily intact
                    206: for use by the post-mortem analysis program
                    207: .I analyze.
                    208: After checking your root file system consistency with
                    209: .IR fsck (8),
                    210: you can read the core dump tape into the file /vmcore with
                    211: .IP
                    212: dd if=/dev/rmt0 of=/vmcore bs=20b
                    213: .LP
                    214: It does not work to use just
                    215: .IR cp (1),
                    216: as the tape is blocked.
                    217: With the system still in single-user mode, run the analysis program
                    218: .I analyze,
                    219: e.g.:
                    220: .IP
                    221: analyze \-s /dev/drum /vmcore /vmunix
                    222: .LP
                    223: and save the output.
                    224: Then boot up
                    225: \*(lqvmunix\*(rq
                    226: and let it do the automatic reboot, i.e. to boot multi-user from
                    227: an RM03/RM05/RP06 on the MASSBUS
                    228: .IP
                    229: >>> BOOT RPM
                    230: .PP
                    231: After rebooting, to analyze a dump you should execute
                    232: .I "ps \-alxk"
                    233: to print the process table at the time of the crash.
                    234: Use
                    235: .IR adb (1)
                    236: to examine
                    237: .IR /vmcore .
                    238: The location
                    239: .I dumpstack\-80000000
                    240: is the bottom of a stack onto which were pushed the stack pointer
                    241: .BR sp ,
                    242: .B PCBB
                    243: (containing the physical address of a
                    244: .IR u_area ),
                    245: .BR MAPEN ,
                    246: .BR IPL ,
                    247: and registers
                    248: .BR r13 \- r0
                    249: (in that order).
                    250: .BR r13 (fp)
                    251: is the system frame pointer and the stack is used in standard
                    252: .B calls
                    253: format.  Use
                    254: .IR  adb (1)
                    255: to get a reverse calling order.
                    256: In most cases this procedure will give
                    257: an idea of what is wrong.
                    258: A more complete discussion
                    259: of system debugging is impossible here.
                    260: See, however,
                    261: .IR analyze (8)
                    262: for some more hints.
                    263: .SH "SEE ALSO"
                    264: analyze(8), reboot(8)
                    265: .br
                    266: .I "VAX 11/780 System Maintenance Guide"
                    267: for more information about machine checks.
                    268: .SH BUGS

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.