|
|
1.1 root 1: From [email protected] (Chuq Von Rospach) Thu Jun 6 20:36:39 1985 2: Relay-Version: version B 2.10.3 4.3bsd-beta 6/6/85; site seismo.UUCP 3: Posting-Version: version B 2.10.2 9/17/84 chuqui version 1.7 9/23/84; site nsc.UUCP 4: Path: seismo!nsc!chuqui 5: From: [email protected] (Chuq Von Rospach) 6: Newsgroups: net.sources 7: Subject: YA News Archiver 8: Message-ID: <[email protected]> 9: Date: 7 Jun 85 00:36:39 GMT 10: Date-Received: 7 Jun 85 06:25:58 GMT 11: Distribution: net 12: Organization: The Blue Parrot 13: Lines: 566 14: 15: Here is a netnews archiver similar to the recently posted keepnews but 16: designed to work with much larger archives where the wonderful quadratic 17: search time feature of the Unix (Unix is a trademark of AT&T Bell Labs, 18: quadratic search times are a feature of Unix) becomes a real problem. This 19: archive also knows how to walk through a directory tree so you can simply 20: set it on /usr/spool/oldnews and let it do its work. There are lots of 21: other nifty things I call features (and you might, too) that make it a lot 22: easier to use than anything else I've seen set up to work on archives. Mine 23: simply outgrew any capability to do anything with about the same time I got 24: a request for information out of it. I found out (the hard way) that 25: keepnews wasn't terribly reliable working under 2.10.2, so I finally 26: decided to hack together my own. 27: 28: Comments, enhancements, bug fixes, etc... are welcome, but I can only work 29: on them on a time available basis... 30: 31: chuq 32: ------- 33: # This is a shell archive. 34: # Remove everything above and including the cut line. 35: # Then run the rest of the file through sh. 36: #-----cut here-----cut here-----cut here-----cut here----- 37: #!/bin/sh 38: # shar: Shell Archiver 39: # Run the following text with /bin/sh to create: 40: # README 41: # Makefile 42: # savenews.c 43: # This archive created: Thu Jun 6 17:28:50 1985 44: # By: Chuq Von Rospach (The Blue Parrot) 45: cat << \SHAR_EOF > README 46: Savenews -- 47: 48: Savenews is a short program designed to make handling of usenet archives 49: generated by 'expire -a' easier, and to make it possible to find stuff in 50: the archive once it is there. 51: 52: It was created by me when I had to get something out of my archives and 53: realized that there was no way I was going to find anything in 70 megabytes 54: of random data. It keeps a set of logs of the Subject lines of the articles 55: and stores the articles themselves in a hashed subdirectory format designed 56: to minimize the quadratic lookup hassles of the unix directory system 57: (This, of course, is a feature). 58: 59: It has been put into the public domain by national semiconductor, and 60: neither myself or national guarantee that this code even exists, much 61: less that it does anything useful. This, BTW, is a disclaimer. 62: 63: chuq von rospach 64: national semiconductor 65: nsc!chuqui 66: SHAR_EOF 67: cat << \SHAR_EOF > Makefile 68: # 69: # Makefile for savenews 70: # 71: CFLAGS = -g 72: 73: savenews: savenews.c 74: ${CC} ${CFLAGS} savenews.c -o savenews 75: 76: clean: 77: rm -f savenews 78: 79: lint: 80: lint -hx savenews.c 81: SHAR_EOF 82: cat << \SHAR_EOF > savenews.c 83: /* 84: * savenews filename [filename ...] 85: * 86: * Savenews is a program designed to clean up and compact a 87: * usenet archive. It will take the filename(s) given to it as arguments 88: * and save them in a netnews archive (defined by SAVENEWS, default is 89: * /usr/spool/savenews). 90: * 91: * This program was set up to do two main things: 92: * 93: * 1) compact out the useless parts of the message, specifically the lines 94: * in the header that don't serve a useful purpose in an archive. This 95: * is done by removing all but the following header lines: From, Date, 96: * Newsgroups, Subject, and Message-ID, and seems to save an average of 97: * 500 bytes an article. 98: * 99: * 2) keep the quadratic nature of unix(TM AT&T Bell labs) directory searches 100: * from making your life miserable. Storing a raw archive of 101: * net.unix-wizards is a silly thing to do, for example. What I do is 102: * create a one level subdirectory set to keep any one directory from 103: * getting too large, but this program is currently set so that there 104: * are enough directories to keep the total number of files in any one 105: * directory below about 150 in the largest parts of my archive. The 106: * algorithm I use is abs(atoi(Message-ID)%HASHVAL)) with HASHVAL being 107: * prime. This quick and dirty hash gives you directories with the 108: * numbers 0 to HASHVAL-1, and about the same number of files in each 109: * given a random distribution of Message-ID numbers (not bad, in 110: * reality) 111: * 112: * The program will add the name of the file and the subject line of the 113: * article in a logfile in subdirectory LOGS, the filename being the 114: * newsgroup. 115: * 116: * As currently written, an article will be saved only to the first 117: * newsgroup in the Newsgroups header line. This means that something 118: * posted to 'net.source,net.flame' will end up in net.sources, but that 119: * somethine posted to 'net.flame,net.sources' will end up in net.flame. 120: * I consider this a feature. Others may disagree. 121: * 122: * If an article is saved that has a duplicate message-ID of one already 123: * in the archive, then it will be saved by adding the character '_' and 124: * some small integer needed to make the filename unique. You can then 125: * use ls or find to look for these and see if they are duplicates (and 126: * remove them) or if they are simply botches by some other site (it does 127: * happen, unfortunately). 128: * 129: * This program will do intelligent things if given a non-news article, 130: * such as nothing. Don't push it, though -- I haven't tried it on 131: * special devices, symbolic links, and other wierdies and it is likely 132: * to throw up on some of them since I didn`t feel like protecting someone 133: * from trying to archive /dev (if tar can consider this a feature, so can 134: * I...) 135: * 136: * This program uses the 4.2 Directory routines (libndir). If you don't 137: * run 4.2, get ahold of a copy of the compatibility library for your 138: * system and use it, or hack up do_dir and is_dir to get around it 139: * if you believe in messing around with primitive hacks (I LIKE libndir) 140: * 141: * General usage: every so often run the program with 142: * 'savenews /usr/spool/oldnews'. Look through /usr/spool/savenews 143: * for duplicated articles and remove them, and then copy all of the 144: * stuff to tape. Remove everything except the LOGS directory, so that 145: * people can use grep to look for things in the archive. It should be 146: * easy to get things back off of tape and make the archive useful this 147: * way. Thinking about it, if you can't use the archive, you might as well 148: * not have it, which is why this program got written (I needed something 149: * out of my archive, and it took me a week to find it). 150: * 151: * This program is designed to run under 2.10.2, but should work under any 152: * B news system. Anyone else is on their own. This is in 153: * the public domain by the kindness of my employer, national 154: * semiconductor, but neither I nor national make any guarantee that it 155: * will work, that we will support this program, or even admit that it 156: * exists. This is called a disclaimer, and means that if you use this 157: * program, you are on your own. It DOES, however, pass lint cleanly, which 158: * is more than I can say for most stuff posted to the net. Feel free to 159: * fix, break, enhance, change, or do anything to this program except 160: * claim it to be your own (unless, of course, you break it...). Passing 161: * enhancements back to me would be nice, too. 162: * 163: * chuq von rospach, national semiconductor (nsc!chuqui) 164: * 165: */ 166: 167: #include <stdio.h> 168: #include <sys/types.h> 169: #include <sys/stat.h> 170: #include <sys/dir.h> 171: #include <ctype.h> 172: 173: #define FALSE 0 174: #define TRUE 1 175: #define HASHVAL 37 /* hash value for sub-dirs. Prime number! */ 176: #define NUMDIRS 1024 /* number of dirs that can be pushed */ 177: #define SAVENEWS "/usr/spool/savenews" /* home of the archive */ 178: #define LOGFILE "LOGS" /* subdir in SAVENEWS to save logs in */ 179: #define JOBLOG "joblog" /* where log of this job is put */ 180: #define DIRMODE 0755 /* mkdir with this mode */ 181: #define COPYBUF 8192 /* block read/write buffer size */ 182: 183: char *Progname; /* name of the program for Eprintf */ 184: char line[BUFSIZ]; /* general purpose line buffer */ 185: 186: #define NUM_HEADERS 5 /* number of headers we are saving */ 187: #define GROUP_HEADER 1 /* where Newsgroup will be found */ 188: #define SUBJECT_HEADER 2 /* where Subject will be found */ 189: #define MESSAGE_HEADER 3 /* where Message-ID will be found */ 190: char header_data[NUM_HEADERS][BUFSIZ]; 191: char *headers[NUM_HEADERS] = 192: { 193: "From:", 194: "Newsgroups:", 195: "Subject:", 196: "Message-ID:", 197: "Date:" 198: }; 199: 200: long num_saved = 0; /* number of articles saved */ 201: FILE *logfp; /* file pointer to joblog file */ 202: 203: char *rindex(), *strcat(), *pop_dir(), *strcpy(), *strsave(), *index(); 204: 205: main(argc,argv) 206: int argc; 207: char *argv[]; 208: { 209: register int i; 210: char joblogfile[BUFSIZ]; 211: char *dirname; 212: 213: /* 214: * This removes and preceeding pathname so that 215: * anything printed out by Eprintf has just the 216: * program name and not where it came from 217: */ 218: if ((Progname = rindex(argv[0],'/')) == NULL) 219: Progname = argv[0]; 220: else 221: Progname++; 222: 223: if (argc == 1) { 224: fprintf(stderr,"Usage: %s file [file ...]\n",Progname); 225: exit(1); 226: } 227: 228: sprintf(joblogfile,"%s/%s",SAVENEWS,JOBLOG); 229: if ((logfp = fopen(joblogfile,"w")) == NULL) 230: fprintf(stderr,"Can't open %s, logging suspended\n",joblogfile); 231: 232: for (i = 1 ; i < argc; i++) { /* process each parameter */ 233: register int rc; 234: if ((rc = is_dir(argv[i])) == -1) 235: continue; 236: else if (rc == TRUE) 237: do_dir(argv[i]); 238: else 239: save_file(argv[i]); 240: } 241: while((dirname = pop_dir()) != NULL) { 242: do_dir(dirname); /* process whatever is left on dirstack */ 243: } 244: printf("Total articles saved was %d\n",num_saved); 245: exit(0); 246: } 247: 248: do_dir(dname) /* process a directory, push other directories on stack */ 249: /* to be handled recursively later */ 250: char *dname; 251: { 252: DIR *dirp; 253: struct direct *dp; 254: char fullname[BUFSIZ]; 255: 256: if ((dirp = opendir(dname)) == NULL) { 257: Eprintf("can't opendir %s\n",dname); 258: return; 259: } 260: 261: for (dp = readdir(dirp); dp != NULL; dp = readdir(dirp)) { 262: register int rc; 263: 264: if(dp->d_namlen == 2 && !strcmp(dp->d_name,"..") 265: || (dp->d_namlen == 1 && !strcmp(dp->d_name,"."))) 266: continue; /* skip . and .. */ 267: 268: sprintf(fullname,"%s/%s",dname,dp->d_name); 269: if((rc = is_dir(fullname)) == -1) 270: continue; 271: else if (rc == TRUE) 272: push_dir(fullname); 273: else 274: save_file(fullname); 275: } 276: closedir(dirp); 277: } 278: 279: is_dir(name) 280: char *name; 281: { 282: struct stat sbuf; 283: 284: if (stat(name,&sbuf) == -1) { 285: Eprintf("can't stat '%s'\n",name); 286: return(-1); 287: } 288: return((sbuf.st_mode & S_IFDIR) ? TRUE : FALSE); 289: } 290: 291: /* VARARGS */ 292: Eprintf(s1,s2,s3,s4,s5,s6,s7,s8,s9) 293: char *s1,*s2,*s3,*s4,*s5,*s6,*s7,*s8,*s9; 294: { 295: if (logfp == NULL) 296: return; 297: fprintf(logfp,"%s: ",Progname); 298: fprintf(logfp,s1,s2,s3,s4,s5,s6,s7,s8,s9); 299: fflush(logfp); 300: } 301: 302: /* 303: * quick and dirty stack routines. 304: * 305: * push_dir(name) char *name; 306: * stores the given string in the stack 307: * char *pop_dir() 308: * returns a string from the stack, or NULL if none. 309: */ 310: 311: static char *dirstack[NUMDIRS]; 312: static int lastdir = 0; 313: static char pop_name[BUFSIZ]; 314: 315: push_dir(name) 316: char *name; 317: { 318: if (lastdir >= NUMDIRS) { 319: Eprintf("push_dir overflow!\n"); 320: return; 321: } 322: dirstack[lastdir] = strsave(name); 323: if (dirstack[lastdir] == NULL) 324: { 325: Eprintf("malloc failed!\n"); 326: return; 327: } 328: lastdir++; 329: } 330: 331: char *pop_dir() 332: { 333: if(lastdir == 0) 334: return(NULL); 335: lastdir--; 336: strcpy(pop_name,dirstack[lastdir]); 337: dirstack[lastdir] = NULL; 338: free(dirstack[lastdir]); 339: return(pop_name); 340: } 341: 342: char *strsave(s) 343: char *s; 344: { 345: char *p, *malloc(); 346: 347: if ((p = malloc((unsigned)strlen(s)+1)) != NULL) 348: strcpy(p,s); 349: return(p); 350: } 351: 352: save_file(name) /* save the article in the archive */ 353: char *name; 354: { 355: FILE *fp, *ofp, *fopen(), *output_file(); 356: register int i, nc; 357: char diskbuf[COPYBUF]; 358: 359: Eprintf("saving '%s'\n",name); 360: if ((fp = fopen(name,"r")) == NULL) { 361: Eprintf("can't open\n"); 362: return; 363: } 364: 365: if ((fgets(line,BUFSIZ,fp) == NULL)) { 366: Eprintf("0 length file\n"); 367: fclose(fp); 368: return; 369: } 370: if (!start_header(line)) { 371: Eprintf("not a news article\n"); 372: fclose(fp); 373: return; 374: } 375: read_header(fp); 376: if ((ofp = output_file()) == NULL) { 377: Eprintf("Can't save\n"); 378: fclose(fp); 379: return; 380: } 381: 382: for (i = 0; i < NUM_HEADERS; i++) 383: fprintf(ofp,"%s\n",header_data[i]); 384: fputc('\n',ofp); 385: 386: while ((nc = fread(diskbuf,sizeof(char),COPYBUF,fp)) != 0) 387: fwrite(diskbuf,sizeof(char),nc,ofp); /* copy body of article */ 388: fclose(ofp); 389: fclose(fp); 390: num_saved++; 391: return; 392: } 393: 394: start_header(s) /* see if this is the start of a news article */ 395: char *s; 396: { 397: /* 398: * If this is coming from B news, the first line will 'always' be 399: * Relay-Version (at least, on my system). Your mileage my vary. 400: */ 401: if (!strncmp(s,"Relay-Version:",14)) 402: return(TRUE); 403: /* 404: * If you are copying a section of archive already archived by 405: * sendnews, then the first line will be From (unless you changed 406: * the headers data structure, then its up to you...) 407: */ 408: if (!strncmp(s,"From:",5)) 409: return(TRUE); 410: return(FALSE); 411: } 412: 413: /* 414: * By the time we get here, the first line will already be read in and 415: * checked by start_header(). If we are re-copying a savenews archive 416: * (which happens when you decide to play with HASHVAL, trust me) then 417: * we need to save the From line, so we can't just throw it away. Hence 418: * the funky looking do-while setup instead of something a bit more 419: * straightforward 420: */ 421: read_header(fp) 422: FILE *fp; 423: { 424: register int i; 425: 426: for (i = 0; i < NUM_HEADERS; i++) 427: header_data[i][0] = '\0'; /* remove last articles data */ 428: 429: do { 430: char *cp; 431: 432: if (line[0] == '\n') /* always be a blank line after the header */ 433: return; 434: 435: for (i = 0 ; i < NUM_HEADERS; i++) { 436: if (!strncmp(headers[i],line,strlen(headers[i]))) { 437: strcpy(header_data[i],line); 438: if (cp = index(header_data[i],'\n')) 439: *cp = '\0'; /* eat newlines */ 440: } 441: } 442: } while (fgets(line,BUFSIZ,fp) != NULL); 443: } 444: 445: FILE *output_file() /* generate the name in the archive */ 446: { 447: int hashval, copy = 0; 448: FILE *fp, *fopen(); 449: char *p, newsgroup[BUFSIZ], message_id[BUFSIZ]; 450: char shortname[BUFSIZ], filename[BUFSIZ], filename2[BUFSIZ]; 451: 452: /* get the first newsgroup */ 453: p = index(header_data[GROUP_HEADER],':'); /* move past Newsgroups */ 454: if (!p) { 455: Eprintf("Invalid newsgroups\n"); 456: return(NULL); 457: } 458: p++; /* skip the colon */ 459: while (isspace(*p)) 460: p++; /* skip whitespace */ 461: strcpy(newsgroup,p); 462: if (p = index(newsgroup,',')) 463: *p= '\0'; /* newsgroup now only has one name in it */ 464: 465: /* get the message-id */ 466: p = index(header_data[MESSAGE_HEADER],':'); 467: if (!p) { 468: Eprintf("Invalid message-id\n"); 469: return(NULL); 470: } 471: p++; /* skip the colon */ 472: while (isspace(*p)) 473: p++; /* skip whitespace */ 474: if (*p == '<' || *p == '(') 475: p++; 476: if (*p == '-') /* make negative article id numbers positive (hack) */ 477: p++; 478: strcpy(message_id,p); 479: if (p = index(message_id,'.')) /* trim off the .UUCP if any */ 480: *p = '\0'; 481: else if (p = index(message_id,'>')) /* or get the closing bracket */ 482: *p = '\0'; 483: else if (p = index(message_id,')')) /* or get the closing paren */ 484: *p = '\0'; 485: if (p = index(message_id,'@')) /* change nnn@site */ 486: *p = '.'; /* to nnn.site */ 487: 488: /* generate the hash value for the subdirectory */ 489: hashval = atoi(message_id) % HASHVAL; 490: 491: /* setup the filename to save to */ 492: sprintf(shortname,"%s/%d/%s",newsgroup,hashval,message_id); 493: sprintf(filename,"%s/%s",SAVENEWS,shortname); 494: while (exists(filename)) { /* make it unique if neccessary */ 495: 496: sprintf(shortname,"%s/%d/%s_%d",newsgroup,hashval,message_id,++copy); 497: sprintf(filename,"%s/%s",SAVENEWS,shortname); 498: } 499: 500: strcpy(filename2,filename); /* must chop off the filename */ 501: if (p = rindex(filename2,'/')) /* since we don't want to */ 502: *p = '\0'; /* to makeparents */ 503: makeparents(filename2); 504: 505: if ((fp = fopen(filename,"w")) == NULL) { 506: Eprintf("Can't open %s for output\n",filename); 507: return(NULL); 508: } 509: log(newsgroup,shortname); 510: return(fp); 511: } 512: 513: exists(name) 514: char *name; 515: { 516: struct stat sbuf; 517: 518: if (stat(name,&sbuf) == -1) { 519: return(FALSE); 520: } 521: return(TRUE); 522: } 523: 524: makeparents(name) /* recursively make parent directories */ 525: char *name; 526: { 527: char *p, buf[BUFSIZ]; 528: 529: if (exists(name)) 530: return; 531: strcpy(buf,name); 532: if (!(p = rindex(buf,'/'))) { 533: Eprintf("makeparents failed!\n"); 534: return; 535: } 536: *p = '\0'; 537: makeparents(buf); 538: mkdir(name,DIRMODE); 539: } 540: 541: log(group,name) /* write to the logfile */ 542: char *group, *name; 543: { 544: char *subject, logfile[BUFSIZ]; 545: FILE *ofp, *fopen(); 546: 547: /* get the subject */ 548: subject = index(header_data[SUBJECT_HEADER],':'); 549: if (!subject) { 550: Eprintf("Invalid subject, no log entry\n"); 551: return; 552: } 553: subject++; /* skip the colon */ 554: while (isspace(*subject)) 555: subject++; /* skip whitespace */ 556: 557: /* generate the place where it goes */ 558: sprintf(logfile,"%s/%s",SAVENEWS,LOGFILE); 559: makeparents(logfile); 560: strcat(logfile,"/"); 561: strcat(logfile,group); 562: 563: if ((ofp = fopen(logfile,"a")) == NULL) 564: { 565: Eprintf("open failed on %s\n",logfile); 566: return; 567: } 568: fprintf(ofp,"%s\t%s\n", name, subject); 569: fclose(ofp); 570: } 571: 572: SHAR_EOF 573: # End of shell archive 574: exit 0 575: -- 576: :From the misfiring synapses of: Chuq Von Rospach 577: {cbosgd,fortune,hplabs,ihnp4,seismo}!nsc!chuqui [email protected] 578: 579: The offices were very nice, and the clients were only raping the land, and 580: then, of course, there was the money... 581: 582:
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.