|
|
1.1 root 1: | Copyright (c) 1988 Regents of the University of California.
2: | All rights reserved.
3: |
4: | Redistribution and use in source and binary forms are permitted
5: | provided that: (1) source distributions retain this entire copyright
6: | notice and comment, and (2) distributions including binaries display
7: | the following acknowledgement: ``This product includes software
8: | developed by the University of California, Berkeley and its contributors''
9: | in the documentation or other materials provided with the distribution
10: | and in all advertising materials mentioning features or use of this
11: | software. Neither the name of the University nor the names of its
12: | contributors may be used to endorse or promote products derived
13: | from this software without specific prior written permission.
14: | THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
15: | IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
16: | WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
17: |
18: | @(#)oc_cksum.s 7.1 (Berkeley) 5/8/90
19: |
20: |
21: | oc_cksum: ones complement 16 bit checksum for MC68020.
22: |
23: | oc_cksum (buffer, count, strtval)
24: |
25: | Do a 16 bit one's complement sum of 'count' bytes from 'buffer'.
26: | 'strtval' is the starting value of the sum (usually zero).
27: |
28: | It simplifies life in in_cksum if strtval can be >= 2^16.
29: | This routine will work as long as strtval is < 2^31.
30: |
31: | Performance
32: | -----------
33: | This routine is intended for MC 68020s but should also work
34: | for 68030s. It (deliberately) doesn't worry about the alignment
35: | of the buffer so will only work on a 68010 if the buffer is
36: | aligned on an even address. (Also, a routine written to use
37: | 68010 "loop mode" would almost certainly be faster than this
38: | code on a 68010).
39: |
40: | We don't worry about alignment because this routine is frequently
41: | called with small counts: 20 bytes for IP header checksums and 40
42: | bytes for TCP ack checksums. For these small counts, testing for
43: | bad alignment adds ~10% to the per-call cost. Since, by the nature
44: | of the kernel's allocator, the data we're called with is almost
45: | always longword aligned, there is no benefit to this added cost
46: | and we're better off letting the loop take a big performance hit
47: | in the rare cases where we're handed an unaligned buffer.
48: |
49: | Loop unrolling constants of 2, 4, 8, 16, 32 and 64 times were
50: | tested on random data on four different types of processors (see
51: | list below -- 64 was the largest unrolling because anything more
52: | overflows the 68020 Icache). On all the processors, the
53: | throughput asymptote was located between 8 and 16 (closer to 8).
54: | However, 16 was substantially better than 8 for small counts.
55: | (It's clear why this happens for a count of 40: unroll-8 pays a
56: | loop branch cost and unroll-16 doesn't. But the tests also showed
57: | that 16 was better than 8 for a count of 20. It's not obvious to
58: | me why.) So, since 16 was good for both large and small counts,
59: | the loop below is unrolled 16 times.
60: |
61: | The processors tested and their average time to checksum 1024 bytes
62: | of random data were:
63: | Sun 3/50 (15MHz) 190 us/KB
64: | Sun 3/180 (16.6MHz) 175 us/KB
65: | Sun 3/60 (20MHz) 134 us/KB
66: | Sun 3/280 (25MHz) 95 us/KB
67: |
68: | The cost of calling this routine was typically 10% of the per-
69: | kilobyte cost. E.g., checksumming zero bytes on a 3/60 cost 9us
70: | and each additional byte cost 125ns. With the high fixed cost,
71: | it would clearly be a gain to "inline" this routine -- the
72: | subroutine call adds 400% overhead to an IP header checksum.
73: | However, in absolute terms, inlining would only gain 10us per
74: | packet -- a 1% effect for a 1ms ethernet packet. This is not
75: | enough gain to be worth the effort.
76:
77: .data
78: .asciz "@(#)$Header: oc_cksum.s,v 1.1 89/08/23 12:53:20 mike Exp $"
79: .even
80: .text
81:
82: .globl _oc_cksum
83: _oc_cksum:
84: movl sp@(4),a0 | get buffer ptr
85: movl sp@(8),d1 | get byte count
86: movl sp@(12),d0 | get starting value
87: movl d2,sp@- | free a reg
88:
89: | test for possible 1, 2 or 3 bytes of excess at end
90: | of buffer. The usual case is no excess (the usual
91: | case is header checksums) so we give that the faster
92: | 'not taken' leg of the compare. (We do the excess
93: | first because we're about the trash the low order
94: | bits of the count in d1.)
95:
96: btst #0,d1
97: jne L5 | if one or three bytes excess
98: btst #1,d1
99: jne L7 | if two bytes excess
100: L1:
101: movl d1,d2
102: lsrl #6,d1 | make cnt into # of 64 byte chunks
103: andl #0x3c,d2 | then find fractions of a chunk
104: negl d2
105: andb #0xf,cc | clear X
106: jmp pc@(L3-.-2:b,d2)
107: L2:
108: movl a0@+,d2
109: addxl d2,d0
110: movl a0@+,d2
111: addxl d2,d0
112: movl a0@+,d2
113: addxl d2,d0
114: movl a0@+,d2
115: addxl d2,d0
116: movl a0@+,d2
117: addxl d2,d0
118: movl a0@+,d2
119: addxl d2,d0
120: movl a0@+,d2
121: addxl d2,d0
122: movl a0@+,d2
123: addxl d2,d0
124: movl a0@+,d2
125: addxl d2,d0
126: movl a0@+,d2
127: addxl d2,d0
128: movl a0@+,d2
129: addxl d2,d0
130: movl a0@+,d2
131: addxl d2,d0
132: movl a0@+,d2
133: addxl d2,d0
134: movl a0@+,d2
135: addxl d2,d0
136: movl a0@+,d2
137: addxl d2,d0
138: movl a0@+,d2
139: addxl d2,d0
140: L3:
141: dbra d1,L2 | (NB- dbra doesn't affect X)
142:
143: movl d0,d1 | fold 32 bit sum to 16 bits
144: swap d1 | (NB- swap doesn't affect X)
145: addxw d1,d0
146: jcc L4
147: addw #1,d0
148: L4:
149: andl #0xffff,d0
150: movl sp@+,d2
151: rts
152:
153: L5: | deal with 1 or 3 excess bytes at the end of the buffer.
154: btst #1,d1
155: jeq L6 | if 1 excess
156:
157: | 3 bytes excess
158: clrl d2
159: movw a0@(-3,d1:l),d2 | add in last full word then drop
160: addl d2,d0 | through to pick up last byte
161:
162: L6: | 1 byte excess
163: clrl d2
164: movb a0@(-1,d1:l),d2
165: lsll #8,d2
166: addl d2,d0
167: jra L1
168:
169: L7: | 2 bytes excess
170: clrl d2
171: movw a0@(-2,d1:l),d2
172: addl d2,d0
173: jra L1
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.