|
|
1.1 ! root 1: | Copyright (c) 1988 Regents of the University of California. ! 2: | All rights reserved. ! 3: | ! 4: | Redistribution and use in source and binary forms are permitted ! 5: | provided that: (1) source distributions retain this entire copyright ! 6: | notice and comment, and (2) distributions including binaries display ! 7: | the following acknowledgement: ``This product includes software ! 8: | developed by the University of California, Berkeley and its contributors'' ! 9: | in the documentation or other materials provided with the distribution ! 10: | and in all advertising materials mentioning features or use of this ! 11: | software. Neither the name of the University nor the names of its ! 12: | contributors may be used to endorse or promote products derived ! 13: | from this software without specific prior written permission. ! 14: | THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR ! 15: | IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED ! 16: | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. ! 17: | ! 18: | @(#)oc_cksum.s 7.1 (Berkeley) 5/8/90 ! 19: | ! 20: | ! 21: | oc_cksum: ones complement 16 bit checksum for MC68020. ! 22: | ! 23: | oc_cksum (buffer, count, strtval) ! 24: | ! 25: | Do a 16 bit one's complement sum of 'count' bytes from 'buffer'. ! 26: | 'strtval' is the starting value of the sum (usually zero). ! 27: | ! 28: | It simplifies life in in_cksum if strtval can be >= 2^16. ! 29: | This routine will work as long as strtval is < 2^31. ! 30: | ! 31: | Performance ! 32: | ----------- ! 33: | This routine is intended for MC 68020s but should also work ! 34: | for 68030s. It (deliberately) doesn't worry about the alignment ! 35: | of the buffer so will only work on a 68010 if the buffer is ! 36: | aligned on an even address. (Also, a routine written to use ! 37: | 68010 "loop mode" would almost certainly be faster than this ! 38: | code on a 68010). ! 39: | ! 40: | We don't worry about alignment because this routine is frequently ! 41: | called with small counts: 20 bytes for IP header checksums and 40 ! 42: | bytes for TCP ack checksums. For these small counts, testing for ! 43: | bad alignment adds ~10% to the per-call cost. Since, by the nature ! 44: | of the kernel's allocator, the data we're called with is almost ! 45: | always longword aligned, there is no benefit to this added cost ! 46: | and we're better off letting the loop take a big performance hit ! 47: | in the rare cases where we're handed an unaligned buffer. ! 48: | ! 49: | Loop unrolling constants of 2, 4, 8, 16, 32 and 64 times were ! 50: | tested on random data on four different types of processors (see ! 51: | list below -- 64 was the largest unrolling because anything more ! 52: | overflows the 68020 Icache). On all the processors, the ! 53: | throughput asymptote was located between 8 and 16 (closer to 8). ! 54: | However, 16 was substantially better than 8 for small counts. ! 55: | (It's clear why this happens for a count of 40: unroll-8 pays a ! 56: | loop branch cost and unroll-16 doesn't. But the tests also showed ! 57: | that 16 was better than 8 for a count of 20. It's not obvious to ! 58: | me why.) So, since 16 was good for both large and small counts, ! 59: | the loop below is unrolled 16 times. ! 60: | ! 61: | The processors tested and their average time to checksum 1024 bytes ! 62: | of random data were: ! 63: | Sun 3/50 (15MHz) 190 us/KB ! 64: | Sun 3/180 (16.6MHz) 175 us/KB ! 65: | Sun 3/60 (20MHz) 134 us/KB ! 66: | Sun 3/280 (25MHz) 95 us/KB ! 67: | ! 68: | The cost of calling this routine was typically 10% of the per- ! 69: | kilobyte cost. E.g., checksumming zero bytes on a 3/60 cost 9us ! 70: | and each additional byte cost 125ns. With the high fixed cost, ! 71: | it would clearly be a gain to "inline" this routine -- the ! 72: | subroutine call adds 400% overhead to an IP header checksum. ! 73: | However, in absolute terms, inlining would only gain 10us per ! 74: | packet -- a 1% effect for a 1ms ethernet packet. This is not ! 75: | enough gain to be worth the effort. ! 76: ! 77: .data ! 78: .asciz "@(#)$Header: oc_cksum.s,v 1.1 89/08/23 12:53:20 mike Exp $" ! 79: .even ! 80: .text ! 81: ! 82: .globl _oc_cksum ! 83: _oc_cksum: ! 84: movl sp@(4),a0 | get buffer ptr ! 85: movl sp@(8),d1 | get byte count ! 86: movl sp@(12),d0 | get starting value ! 87: movl d2,sp@- | free a reg ! 88: ! 89: | test for possible 1, 2 or 3 bytes of excess at end ! 90: | of buffer. The usual case is no excess (the usual ! 91: | case is header checksums) so we give that the faster ! 92: | 'not taken' leg of the compare. (We do the excess ! 93: | first because we're about the trash the low order ! 94: | bits of the count in d1.) ! 95: ! 96: btst #0,d1 ! 97: jne L5 | if one or three bytes excess ! 98: btst #1,d1 ! 99: jne L7 | if two bytes excess ! 100: L1: ! 101: movl d1,d2 ! 102: lsrl #6,d1 | make cnt into # of 64 byte chunks ! 103: andl #0x3c,d2 | then find fractions of a chunk ! 104: negl d2 ! 105: andb #0xf,cc | clear X ! 106: jmp pc@(L3-.-2:b,d2) ! 107: L2: ! 108: movl a0@+,d2 ! 109: addxl d2,d0 ! 110: movl a0@+,d2 ! 111: addxl d2,d0 ! 112: movl a0@+,d2 ! 113: addxl d2,d0 ! 114: movl a0@+,d2 ! 115: addxl d2,d0 ! 116: movl a0@+,d2 ! 117: addxl d2,d0 ! 118: movl a0@+,d2 ! 119: addxl d2,d0 ! 120: movl a0@+,d2 ! 121: addxl d2,d0 ! 122: movl a0@+,d2 ! 123: addxl d2,d0 ! 124: movl a0@+,d2 ! 125: addxl d2,d0 ! 126: movl a0@+,d2 ! 127: addxl d2,d0 ! 128: movl a0@+,d2 ! 129: addxl d2,d0 ! 130: movl a0@+,d2 ! 131: addxl d2,d0 ! 132: movl a0@+,d2 ! 133: addxl d2,d0 ! 134: movl a0@+,d2 ! 135: addxl d2,d0 ! 136: movl a0@+,d2 ! 137: addxl d2,d0 ! 138: movl a0@+,d2 ! 139: addxl d2,d0 ! 140: L3: ! 141: dbra d1,L2 | (NB- dbra doesn't affect X) ! 142: ! 143: movl d0,d1 | fold 32 bit sum to 16 bits ! 144: swap d1 | (NB- swap doesn't affect X) ! 145: addxw d1,d0 ! 146: jcc L4 ! 147: addw #1,d0 ! 148: L4: ! 149: andl #0xffff,d0 ! 150: movl sp@+,d2 ! 151: rts ! 152: ! 153: L5: | deal with 1 or 3 excess bytes at the end of the buffer. ! 154: btst #1,d1 ! 155: jeq L6 | if 1 excess ! 156: ! 157: | 3 bytes excess ! 158: clrl d2 ! 159: movw a0@(-3,d1:l),d2 | add in last full word then drop ! 160: addl d2,d0 | through to pick up last byte ! 161: ! 162: L6: | 1 byte excess ! 163: clrl d2 ! 164: movb a0@(-1,d1:l),d2 ! 165: lsll #8,d2 ! 166: addl d2,d0 ! 167: jra L1 ! 168: ! 169: L7: | 2 bytes excess ! 170: clrl d2 ! 171: movw a0@(-2,d1:l),d2 ! 172: addl d2,d0 ! 173: jra L1
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.