Annotation of researchv10dc/man/manb/crash.8, revision 1.1

1.1     ! root        1: .TH CRASH 8 VAX-11
        !             2: .UC 4
        !             3: .SH NAME
        !             4: crash \- what happens when the system crashes
        !             5: .SH DESCRIPTION
        !             6: This section explains what happens when the system crashes and how
        !             7: you can get a crash dump for analysis of non-transient problems.
        !             8: .PP
        !             9: When the system crashes voluntarily it prints a message of the form
        !            10: .IP
        !            11: panic: why i gave up the ghost
        !            12: .LP
        !            13: on the console, and then invokes an automatic reboot procedure as
        !            14: described in
        !            15: .IR reboot (8).
        !            16: If the auto-reboot switch is off on the console, then the processor
        !            17: will simply halt at this point.
        !            18: Otherwise the registers and the top few locations of the stack will
        !            19: be printed on the console, and then the system will check the disks
        !            20: and (unless some unexpected inconsistency is encountered), resume
        !            21: multi-user operations.
        !            22: .PP
        !            23: The system has a large number of internal consistency checks; if one
        !            24: of these fails, then it will panic with a very short message indicating
        !            25: which one failed.  In the absence of a dump, little can be done about
        !            26: one of these.  If the problem recurs, you should arrange to get a dump
        !            27: for further analysis by running with auto-reboot disabled during normal
        !            28: working hours and then following the procedure described below.
        !            29: .PP
        !            30: The most common cause of system failures is hardware failure, which
        !            31: can reflect itself in different ways.  Here are the messages which
        !            32: you are likely to encounter, with some hints as to causes.
        !            33: Left unstated in all cases is the possibility that hardware or software
        !            34: error produced the message in some unexpected way.
        !            35: .TP
        !            36: IO err in push
        !            37: .ns
        !            38: .TP
        !            39: hard IO err in swap
        !            40: The system encountered an error trying to write to the paging device
        !            41: or an error in reading critical information from a disk drive.
        !            42: You should fix your disk if it is broken or unreliable.
        !            43: .TP
        !            44: Timeout table overflow
        !            45: .ns
        !            46: .TP
        !            47: ran out of bdp's
        !            48: .ns
        !            49: .TP
        !            50: ran out of uba map
        !            51: These really shouldn't be panics, but until we fix up the data structures
        !            52: involved, running out of entries causes a crash.  If the timeout table
        !            53: overflows, you should make it bigger.  If you run out of bdp's or uba map
        !            54: you probably have a buggy device driver in your system, allocating and
        !            55: not releasing UNIBUS resources.
        !            56: .TP
        !            57: KSP not valid
        !            58: .ns
        !            59: .TP
        !            60: SBI fault
        !            61: .ns
        !            62: .TP
        !            63: Machine check
        !            64: .ns
        !            65: .TP
        !            66: CHM? in kernel
        !            67: These indicate either a serious bug in the system or, more often,
        !            68: a glitch or failing hardware.  For the machine check, the top part of
        !            69: the resulting stack frame gives more information.  You can refer to a
        !            70: VAX 11/780 System Maintenance Guide for information on machine checks.
        !            71: If machine checks or SBI faults recur, check out the hardware or call
        !            72: field service.  If the other faults recur, there is likely a bug somewhere
        !            73: in the system, although these can be caused by a flakey processor.
        !            74: Run processor microdiagnostics.
        !            75: .TP
        !            76: trap type %d, code=%d
        !            77: A unexpected trap has occurred within the system; the trap types are:
        !            78: .RS
        !            79: .TP 10
        !            80: 0
        !            81: reserved addressing mode
        !            82: .br
        !            83: .ns
        !            84: .TP 10
        !            85: 1
        !            86: privileged instruction
        !            87: .br
        !            88: .ns
        !            89: .TP 10
        !            90: 2
        !            91: BPT
        !            92: .br
        !            93: .ns
        !            94: .TP 10
        !            95: 3
        !            96: XFC
        !            97: .br
        !            98: .ns
        !            99: .TP 10
        !           100: 4
        !           101: reserved operand
        !           102: .br
        !           103: .ns
        !           104: .TP 10
        !           105: 5
        !           106: CHMK (system call)
        !           107: .br
        !           108: .ns
        !           109: .TP 10
        !           110: 6
        !           111: arithmetic trap
        !           112: .br
        !           113: .ns
        !           114: .TP 10
        !           115: 7
        !           116: reschedule trap (software level 3)
        !           117: .br
        !           118: .ns
        !           119: .TP 10
        !           120: 8
        !           121: segmentation fault
        !           122: .br
        !           123: .ns
        !           124: .TP 10
        !           125: 9
        !           126: protection fault
        !           127: .br
        !           128: .ns
        !           129: .TP 10
        !           130: 10
        !           131: trace pending (TP bit)
        !           132: .RE
        !           133: .IP
        !           134: The favorite trap type in system crashes is trap type 9, indicating
        !           135: a wild reference.  The code is the referenced address.  If you look
        !           136: down the stack, just after the trap type and the code are the pc and
        !           137: the ps of the processor when it trapped, showing you where in the
        !           138: system the problem occurred.  These problems tend to be easy to track
        !           139: down if they are kernel bugs since the processor stops cold, but random
        !           140: flakiness seems to cause this sometimes, e.g. we have trapped with
        !           141: code 80000800 three times in six months as an instruction fetch went across
        !           142: this page boundary in the kernel but have been unable to find any reason
        !           143: for this to have happened.
        !           144: .TP
        !           145: init died
        !           146: The system initialization process has exited.  This is bad news, as no new
        !           147: users will then be able to log in.  Rebooting is the only fix, so the
        !           148: system just does it right away.
        !           149: .PP
        !           150: That completes the list of panic types you are likely to see.
        !           151: Now for the crash dump procedure:
        !           152: .PP
        !           153: At the moment a dump can be taken only on magnetic tape.
        !           154: Before you do anything, be sure that a clean tape is mounted with a ring-in
        !           155: on the tape drive if you plan to make a dump.
        !           156: .PP
        !           157: Write the date and time on the console log.
        !           158: Use the console commands to examine the registers, program status long word,
        !           159: and the top several locations on the stack.
        !           160: A suggested command sequence, which is executed by the \*(lq@DUMP\*(rq
        !           161: console command script, is:
        !           162: .DS
        !           163: .nf
        !           164:        E PSL<return>
        !           165:        E R0/NE:F<return>
        !           166:        E SP<return>
        !           167:        E/V @ /NE:40<return>
        !           168: .fi
        !           169: .DE
        !           170: If hardware problems dictate a special set of commands be executed when
        !           171: the system crashes, a sequence of commands can be saved using the console
        !           172: command \*(lqLINK\*(rq to be reexecuted with \*(lqPERFORM\*(rq (which can be
        !           173: abbreviated \*(lqP\*(rq).
        !           174: If a dump is to be taken on magnetic tape (this is a good idea
        !           175: in most any case where the cause of the crash is not immediately obvious)
        !           176: then the following commands will (should) be executed:
        !           177: .DS
        !           178: .nf
        !           179:        D PSL 0<return>
        !           180:        D PC 80000200<return>
        !           181:        C<return>
        !           182: .fi
        !           183: .DE
        !           184: These commands are actually part of the standard \*(lq@DUMP\*(rq script.
        !           185: This should write a copy of all of memory
        !           186: on the tape, followed by two EOF marks.
        !           187: Caution:
        !           188: Any error is taken to mean the end of memory has been reached.
        !           189: This means that you must be sure the ring is in,
        !           190: the tape is ready, and the tape is clean and new.
        !           191: .PP
        !           192: If there are not 40(hex) locations active on the kernel stack when the
        !           193: procedure is begun, then the console may begin to print error diagnostics.
        !           194: You can stop this by hitting \*(lq^C\*(rq (control-C), and then give the
        !           195: last three commands above.
        !           196: .PP
        !           197: If the dump fails, you can try again,
        !           198: but some of the registers will be lost.
        !           199: See below for what to do with the tape.
        !           200: .PP
        !           201: To restart after a crash, follow the directions in
        !           202: .IR reboot (8);
        !           203: if the virtual memory subsystem is suspected as the cause of the crash,
        !           204: then a version of the system other than \*(lqvmunix\*(rq should be booted
        !           205: which will leave the paging areas temporarily intact
        !           206: for use by the post-mortem analysis program
        !           207: .I analyze.
        !           208: After checking your root file system consistency with
        !           209: .IR fsck (8),
        !           210: you can read the core dump tape into the file /vmcore with
        !           211: .IP
        !           212: dd if=/dev/rmt0 of=/vmcore bs=20b
        !           213: .LP
        !           214: It does not work to use just
        !           215: .IR cp (1),
        !           216: as the tape is blocked.
        !           217: With the system still in single-user mode, run the analysis program
        !           218: .I analyze,
        !           219: e.g.:
        !           220: .IP
        !           221: analyze \-s /dev/drum /vmcore /vmunix
        !           222: .LP
        !           223: and save the output.
        !           224: Then boot up
        !           225: \*(lqvmunix\*(rq
        !           226: and let it do the automatic reboot, i.e. to boot multi-user from
        !           227: an RM03/RM05/RP06 on the MASSBUS
        !           228: .IP
        !           229: >>> BOOT RPM
        !           230: .PP
        !           231: After rebooting, to analyze a dump you should execute
        !           232: .I "ps \-alxk"
        !           233: to print the process table at the time of the crash.
        !           234: Use
        !           235: .IR adb (1)
        !           236: to examine
        !           237: .IR /vmcore .
        !           238: The location
        !           239: .I dumpstack\-80000000
        !           240: is the bottom of a stack onto which were pushed the stack pointer
        !           241: .BR sp ,
        !           242: .B PCBB
        !           243: (containing the physical address of a
        !           244: .IR u_area ),
        !           245: .BR MAPEN ,
        !           246: .BR IPL ,
        !           247: and registers
        !           248: .BR r13 \- r0
        !           249: (in that order).
        !           250: .BR r13 (fp)
        !           251: is the system frame pointer and the stack is used in standard
        !           252: .B calls
        !           253: format.  Use
        !           254: .IR  adb (1)
        !           255: to get a reverse calling order.
        !           256: In most cases this procedure will give
        !           257: an idea of what is wrong.
        !           258: A more complete discussion
        !           259: of system debugging is impossible here.
        !           260: See, however,
        !           261: .IR analyze (8)
        !           262: for some more hints.
        !           263: .SH "SEE ALSO"
        !           264: analyze(8), reboot(8)
        !           265: .br
        !           266: .I "VAX 11/780 System Maintenance Guide"
        !           267: for more information about machine checks.
        !           268: .SH BUGS

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.