|
|
1.1 ! root 1: From [email protected] (Chuq Von Rospach) Thu Jun 6 20:36:39 1985 ! 2: Relay-Version: version B 2.10.3 4.3bsd-beta 6/6/85; site seismo.UUCP ! 3: Posting-Version: version B 2.10.2 9/17/84 chuqui version 1.7 9/23/84; site nsc.UUCP ! 4: Path: seismo!nsc!chuqui ! 5: From: [email protected] (Chuq Von Rospach) ! 6: Newsgroups: net.sources ! 7: Subject: YA News Archiver ! 8: Message-ID: <[email protected]> ! 9: Date: 7 Jun 85 00:36:39 GMT ! 10: Date-Received: 7 Jun 85 06:25:58 GMT ! 11: Distribution: net ! 12: Organization: The Blue Parrot ! 13: Lines: 566 ! 14: ! 15: Here is a netnews archiver similar to the recently posted keepnews but ! 16: designed to work with much larger archives where the wonderful quadratic ! 17: search time feature of the Unix (Unix is a trademark of AT&T Bell Labs, ! 18: quadratic search times are a feature of Unix) becomes a real problem. This ! 19: archive also knows how to walk through a directory tree so you can simply ! 20: set it on /usr/spool/oldnews and let it do its work. There are lots of ! 21: other nifty things I call features (and you might, too) that make it a lot ! 22: easier to use than anything else I've seen set up to work on archives. Mine ! 23: simply outgrew any capability to do anything with about the same time I got ! 24: a request for information out of it. I found out (the hard way) that ! 25: keepnews wasn't terribly reliable working under 2.10.2, so I finally ! 26: decided to hack together my own. ! 27: ! 28: Comments, enhancements, bug fixes, etc... are welcome, but I can only work ! 29: on them on a time available basis... ! 30: ! 31: chuq ! 32: ------- ! 33: # This is a shell archive. ! 34: # Remove everything above and including the cut line. ! 35: # Then run the rest of the file through sh. ! 36: #-----cut here-----cut here-----cut here-----cut here----- ! 37: #!/bin/sh ! 38: # shar: Shell Archiver ! 39: # Run the following text with /bin/sh to create: ! 40: # README ! 41: # Makefile ! 42: # savenews.c ! 43: # This archive created: Thu Jun 6 17:28:50 1985 ! 44: # By: Chuq Von Rospach (The Blue Parrot) ! 45: cat << \SHAR_EOF > README ! 46: Savenews -- ! 47: ! 48: Savenews is a short program designed to make handling of usenet archives ! 49: generated by 'expire -a' easier, and to make it possible to find stuff in ! 50: the archive once it is there. ! 51: ! 52: It was created by me when I had to get something out of my archives and ! 53: realized that there was no way I was going to find anything in 70 megabytes ! 54: of random data. It keeps a set of logs of the Subject lines of the articles ! 55: and stores the articles themselves in a hashed subdirectory format designed ! 56: to minimize the quadratic lookup hassles of the unix directory system ! 57: (This, of course, is a feature). ! 58: ! 59: It has been put into the public domain by national semiconductor, and ! 60: neither myself or national guarantee that this code even exists, much ! 61: less that it does anything useful. This, BTW, is a disclaimer. ! 62: ! 63: chuq von rospach ! 64: national semiconductor ! 65: nsc!chuqui ! 66: SHAR_EOF ! 67: cat << \SHAR_EOF > Makefile ! 68: # ! 69: # Makefile for savenews ! 70: # ! 71: CFLAGS = -g ! 72: ! 73: savenews: savenews.c ! 74: ${CC} ${CFLAGS} savenews.c -o savenews ! 75: ! 76: clean: ! 77: rm -f savenews ! 78: ! 79: lint: ! 80: lint -hx savenews.c ! 81: SHAR_EOF ! 82: cat << \SHAR_EOF > savenews.c ! 83: /* ! 84: * savenews filename [filename ...] ! 85: * ! 86: * Savenews is a program designed to clean up and compact a ! 87: * usenet archive. It will take the filename(s) given to it as arguments ! 88: * and save them in a netnews archive (defined by SAVENEWS, default is ! 89: * /usr/spool/savenews). ! 90: * ! 91: * This program was set up to do two main things: ! 92: * ! 93: * 1) compact out the useless parts of the message, specifically the lines ! 94: * in the header that don't serve a useful purpose in an archive. This ! 95: * is done by removing all but the following header lines: From, Date, ! 96: * Newsgroups, Subject, and Message-ID, and seems to save an average of ! 97: * 500 bytes an article. ! 98: * ! 99: * 2) keep the quadratic nature of unix(TM AT&T Bell labs) directory searches ! 100: * from making your life miserable. Storing a raw archive of ! 101: * net.unix-wizards is a silly thing to do, for example. What I do is ! 102: * create a one level subdirectory set to keep any one directory from ! 103: * getting too large, but this program is currently set so that there ! 104: * are enough directories to keep the total number of files in any one ! 105: * directory below about 150 in the largest parts of my archive. The ! 106: * algorithm I use is abs(atoi(Message-ID)%HASHVAL)) with HASHVAL being ! 107: * prime. This quick and dirty hash gives you directories with the ! 108: * numbers 0 to HASHVAL-1, and about the same number of files in each ! 109: * given a random distribution of Message-ID numbers (not bad, in ! 110: * reality) ! 111: * ! 112: * The program will add the name of the file and the subject line of the ! 113: * article in a logfile in subdirectory LOGS, the filename being the ! 114: * newsgroup. ! 115: * ! 116: * As currently written, an article will be saved only to the first ! 117: * newsgroup in the Newsgroups header line. This means that something ! 118: * posted to 'net.source,net.flame' will end up in net.sources, but that ! 119: * somethine posted to 'net.flame,net.sources' will end up in net.flame. ! 120: * I consider this a feature. Others may disagree. ! 121: * ! 122: * If an article is saved that has a duplicate message-ID of one already ! 123: * in the archive, then it will be saved by adding the character '_' and ! 124: * some small integer needed to make the filename unique. You can then ! 125: * use ls or find to look for these and see if they are duplicates (and ! 126: * remove them) or if they are simply botches by some other site (it does ! 127: * happen, unfortunately). ! 128: * ! 129: * This program will do intelligent things if given a non-news article, ! 130: * such as nothing. Don't push it, though -- I haven't tried it on ! 131: * special devices, symbolic links, and other wierdies and it is likely ! 132: * to throw up on some of them since I didn`t feel like protecting someone ! 133: * from trying to archive /dev (if tar can consider this a feature, so can ! 134: * I...) ! 135: * ! 136: * This program uses the 4.2 Directory routines (libndir). If you don't ! 137: * run 4.2, get ahold of a copy of the compatibility library for your ! 138: * system and use it, or hack up do_dir and is_dir to get around it ! 139: * if you believe in messing around with primitive hacks (I LIKE libndir) ! 140: * ! 141: * General usage: every so often run the program with ! 142: * 'savenews /usr/spool/oldnews'. Look through /usr/spool/savenews ! 143: * for duplicated articles and remove them, and then copy all of the ! 144: * stuff to tape. Remove everything except the LOGS directory, so that ! 145: * people can use grep to look for things in the archive. It should be ! 146: * easy to get things back off of tape and make the archive useful this ! 147: * way. Thinking about it, if you can't use the archive, you might as well ! 148: * not have it, which is why this program got written (I needed something ! 149: * out of my archive, and it took me a week to find it). ! 150: * ! 151: * This program is designed to run under 2.10.2, but should work under any ! 152: * B news system. Anyone else is on their own. This is in ! 153: * the public domain by the kindness of my employer, national ! 154: * semiconductor, but neither I nor national make any guarantee that it ! 155: * will work, that we will support this program, or even admit that it ! 156: * exists. This is called a disclaimer, and means that if you use this ! 157: * program, you are on your own. It DOES, however, pass lint cleanly, which ! 158: * is more than I can say for most stuff posted to the net. Feel free to ! 159: * fix, break, enhance, change, or do anything to this program except ! 160: * claim it to be your own (unless, of course, you break it...). Passing ! 161: * enhancements back to me would be nice, too. ! 162: * ! 163: * chuq von rospach, national semiconductor (nsc!chuqui) ! 164: * ! 165: */ ! 166: ! 167: #include <stdio.h> ! 168: #include <sys/types.h> ! 169: #include <sys/stat.h> ! 170: #include <sys/dir.h> ! 171: #include <ctype.h> ! 172: ! 173: #define FALSE 0 ! 174: #define TRUE 1 ! 175: #define HASHVAL 37 /* hash value for sub-dirs. Prime number! */ ! 176: #define NUMDIRS 1024 /* number of dirs that can be pushed */ ! 177: #define SAVENEWS "/usr/spool/savenews" /* home of the archive */ ! 178: #define LOGFILE "LOGS" /* subdir in SAVENEWS to save logs in */ ! 179: #define JOBLOG "joblog" /* where log of this job is put */ ! 180: #define DIRMODE 0755 /* mkdir with this mode */ ! 181: #define COPYBUF 8192 /* block read/write buffer size */ ! 182: ! 183: char *Progname; /* name of the program for Eprintf */ ! 184: char line[BUFSIZ]; /* general purpose line buffer */ ! 185: ! 186: #define NUM_HEADERS 5 /* number of headers we are saving */ ! 187: #define GROUP_HEADER 1 /* where Newsgroup will be found */ ! 188: #define SUBJECT_HEADER 2 /* where Subject will be found */ ! 189: #define MESSAGE_HEADER 3 /* where Message-ID will be found */ ! 190: char header_data[NUM_HEADERS][BUFSIZ]; ! 191: char *headers[NUM_HEADERS] = ! 192: { ! 193: "From:", ! 194: "Newsgroups:", ! 195: "Subject:", ! 196: "Message-ID:", ! 197: "Date:" ! 198: }; ! 199: ! 200: long num_saved = 0; /* number of articles saved */ ! 201: FILE *logfp; /* file pointer to joblog file */ ! 202: ! 203: char *rindex(), *strcat(), *pop_dir(), *strcpy(), *strsave(), *index(); ! 204: ! 205: main(argc,argv) ! 206: int argc; ! 207: char *argv[]; ! 208: { ! 209: register int i; ! 210: char joblogfile[BUFSIZ]; ! 211: char *dirname; ! 212: ! 213: /* ! 214: * This removes and preceeding pathname so that ! 215: * anything printed out by Eprintf has just the ! 216: * program name and not where it came from ! 217: */ ! 218: if ((Progname = rindex(argv[0],'/')) == NULL) ! 219: Progname = argv[0]; ! 220: else ! 221: Progname++; ! 222: ! 223: if (argc == 1) { ! 224: fprintf(stderr,"Usage: %s file [file ...]\n",Progname); ! 225: exit(1); ! 226: } ! 227: ! 228: sprintf(joblogfile,"%s/%s",SAVENEWS,JOBLOG); ! 229: if ((logfp = fopen(joblogfile,"w")) == NULL) ! 230: fprintf(stderr,"Can't open %s, logging suspended\n",joblogfile); ! 231: ! 232: for (i = 1 ; i < argc; i++) { /* process each parameter */ ! 233: register int rc; ! 234: if ((rc = is_dir(argv[i])) == -1) ! 235: continue; ! 236: else if (rc == TRUE) ! 237: do_dir(argv[i]); ! 238: else ! 239: save_file(argv[i]); ! 240: } ! 241: while((dirname = pop_dir()) != NULL) { ! 242: do_dir(dirname); /* process whatever is left on dirstack */ ! 243: } ! 244: printf("Total articles saved was %d\n",num_saved); ! 245: exit(0); ! 246: } ! 247: ! 248: do_dir(dname) /* process a directory, push other directories on stack */ ! 249: /* to be handled recursively later */ ! 250: char *dname; ! 251: { ! 252: DIR *dirp; ! 253: struct direct *dp; ! 254: char fullname[BUFSIZ]; ! 255: ! 256: if ((dirp = opendir(dname)) == NULL) { ! 257: Eprintf("can't opendir %s\n",dname); ! 258: return; ! 259: } ! 260: ! 261: for (dp = readdir(dirp); dp != NULL; dp = readdir(dirp)) { ! 262: register int rc; ! 263: ! 264: if(dp->d_namlen == 2 && !strcmp(dp->d_name,"..") ! 265: || (dp->d_namlen == 1 && !strcmp(dp->d_name,"."))) ! 266: continue; /* skip . and .. */ ! 267: ! 268: sprintf(fullname,"%s/%s",dname,dp->d_name); ! 269: if((rc = is_dir(fullname)) == -1) ! 270: continue; ! 271: else if (rc == TRUE) ! 272: push_dir(fullname); ! 273: else ! 274: save_file(fullname); ! 275: } ! 276: closedir(dirp); ! 277: } ! 278: ! 279: is_dir(name) ! 280: char *name; ! 281: { ! 282: struct stat sbuf; ! 283: ! 284: if (stat(name,&sbuf) == -1) { ! 285: Eprintf("can't stat '%s'\n",name); ! 286: return(-1); ! 287: } ! 288: return((sbuf.st_mode & S_IFDIR) ? TRUE : FALSE); ! 289: } ! 290: ! 291: /* VARARGS */ ! 292: Eprintf(s1,s2,s3,s4,s5,s6,s7,s8,s9) ! 293: char *s1,*s2,*s3,*s4,*s5,*s6,*s7,*s8,*s9; ! 294: { ! 295: if (logfp == NULL) ! 296: return; ! 297: fprintf(logfp,"%s: ",Progname); ! 298: fprintf(logfp,s1,s2,s3,s4,s5,s6,s7,s8,s9); ! 299: fflush(logfp); ! 300: } ! 301: ! 302: /* ! 303: * quick and dirty stack routines. ! 304: * ! 305: * push_dir(name) char *name; ! 306: * stores the given string in the stack ! 307: * char *pop_dir() ! 308: * returns a string from the stack, or NULL if none. ! 309: */ ! 310: ! 311: static char *dirstack[NUMDIRS]; ! 312: static int lastdir = 0; ! 313: static char pop_name[BUFSIZ]; ! 314: ! 315: push_dir(name) ! 316: char *name; ! 317: { ! 318: if (lastdir >= NUMDIRS) { ! 319: Eprintf("push_dir overflow!\n"); ! 320: return; ! 321: } ! 322: dirstack[lastdir] = strsave(name); ! 323: if (dirstack[lastdir] == NULL) ! 324: { ! 325: Eprintf("malloc failed!\n"); ! 326: return; ! 327: } ! 328: lastdir++; ! 329: } ! 330: ! 331: char *pop_dir() ! 332: { ! 333: if(lastdir == 0) ! 334: return(NULL); ! 335: lastdir--; ! 336: strcpy(pop_name,dirstack[lastdir]); ! 337: dirstack[lastdir] = NULL; ! 338: free(dirstack[lastdir]); ! 339: return(pop_name); ! 340: } ! 341: ! 342: char *strsave(s) ! 343: char *s; ! 344: { ! 345: char *p, *malloc(); ! 346: ! 347: if ((p = malloc((unsigned)strlen(s)+1)) != NULL) ! 348: strcpy(p,s); ! 349: return(p); ! 350: } ! 351: ! 352: save_file(name) /* save the article in the archive */ ! 353: char *name; ! 354: { ! 355: FILE *fp, *ofp, *fopen(), *output_file(); ! 356: register int i, nc; ! 357: char diskbuf[COPYBUF]; ! 358: ! 359: Eprintf("saving '%s'\n",name); ! 360: if ((fp = fopen(name,"r")) == NULL) { ! 361: Eprintf("can't open\n"); ! 362: return; ! 363: } ! 364: ! 365: if ((fgets(line,BUFSIZ,fp) == NULL)) { ! 366: Eprintf("0 length file\n"); ! 367: fclose(fp); ! 368: return; ! 369: } ! 370: if (!start_header(line)) { ! 371: Eprintf("not a news article\n"); ! 372: fclose(fp); ! 373: return; ! 374: } ! 375: read_header(fp); ! 376: if ((ofp = output_file()) == NULL) { ! 377: Eprintf("Can't save\n"); ! 378: fclose(fp); ! 379: return; ! 380: } ! 381: ! 382: for (i = 0; i < NUM_HEADERS; i++) ! 383: fprintf(ofp,"%s\n",header_data[i]); ! 384: fputc('\n',ofp); ! 385: ! 386: while ((nc = fread(diskbuf,sizeof(char),COPYBUF,fp)) != 0) ! 387: fwrite(diskbuf,sizeof(char),nc,ofp); /* copy body of article */ ! 388: fclose(ofp); ! 389: fclose(fp); ! 390: num_saved++; ! 391: return; ! 392: } ! 393: ! 394: start_header(s) /* see if this is the start of a news article */ ! 395: char *s; ! 396: { ! 397: /* ! 398: * If this is coming from B news, the first line will 'always' be ! 399: * Relay-Version (at least, on my system). Your mileage my vary. ! 400: */ ! 401: if (!strncmp(s,"Relay-Version:",14)) ! 402: return(TRUE); ! 403: /* ! 404: * If you are copying a section of archive already archived by ! 405: * sendnews, then the first line will be From (unless you changed ! 406: * the headers data structure, then its up to you...) ! 407: */ ! 408: if (!strncmp(s,"From:",5)) ! 409: return(TRUE); ! 410: return(FALSE); ! 411: } ! 412: ! 413: /* ! 414: * By the time we get here, the first line will already be read in and ! 415: * checked by start_header(). If we are re-copying a savenews archive ! 416: * (which happens when you decide to play with HASHVAL, trust me) then ! 417: * we need to save the From line, so we can't just throw it away. Hence ! 418: * the funky looking do-while setup instead of something a bit more ! 419: * straightforward ! 420: */ ! 421: read_header(fp) ! 422: FILE *fp; ! 423: { ! 424: register int i; ! 425: ! 426: for (i = 0; i < NUM_HEADERS; i++) ! 427: header_data[i][0] = '\0'; /* remove last articles data */ ! 428: ! 429: do { ! 430: char *cp; ! 431: ! 432: if (line[0] == '\n') /* always be a blank line after the header */ ! 433: return; ! 434: ! 435: for (i = 0 ; i < NUM_HEADERS; i++) { ! 436: if (!strncmp(headers[i],line,strlen(headers[i]))) { ! 437: strcpy(header_data[i],line); ! 438: if (cp = index(header_data[i],'\n')) ! 439: *cp = '\0'; /* eat newlines */ ! 440: } ! 441: } ! 442: } while (fgets(line,BUFSIZ,fp) != NULL); ! 443: } ! 444: ! 445: FILE *output_file() /* generate the name in the archive */ ! 446: { ! 447: int hashval, copy = 0; ! 448: FILE *fp, *fopen(); ! 449: char *p, newsgroup[BUFSIZ], message_id[BUFSIZ]; ! 450: char shortname[BUFSIZ], filename[BUFSIZ], filename2[BUFSIZ]; ! 451: ! 452: /* get the first newsgroup */ ! 453: p = index(header_data[GROUP_HEADER],':'); /* move past Newsgroups */ ! 454: if (!p) { ! 455: Eprintf("Invalid newsgroups\n"); ! 456: return(NULL); ! 457: } ! 458: p++; /* skip the colon */ ! 459: while (isspace(*p)) ! 460: p++; /* skip whitespace */ ! 461: strcpy(newsgroup,p); ! 462: if (p = index(newsgroup,',')) ! 463: *p= '\0'; /* newsgroup now only has one name in it */ ! 464: ! 465: /* get the message-id */ ! 466: p = index(header_data[MESSAGE_HEADER],':'); ! 467: if (!p) { ! 468: Eprintf("Invalid message-id\n"); ! 469: return(NULL); ! 470: } ! 471: p++; /* skip the colon */ ! 472: while (isspace(*p)) ! 473: p++; /* skip whitespace */ ! 474: if (*p == '<' || *p == '(') ! 475: p++; ! 476: if (*p == '-') /* make negative article id numbers positive (hack) */ ! 477: p++; ! 478: strcpy(message_id,p); ! 479: if (p = index(message_id,'.')) /* trim off the .UUCP if any */ ! 480: *p = '\0'; ! 481: else if (p = index(message_id,'>')) /* or get the closing bracket */ ! 482: *p = '\0'; ! 483: else if (p = index(message_id,')')) /* or get the closing paren */ ! 484: *p = '\0'; ! 485: if (p = index(message_id,'@')) /* change nnn@site */ ! 486: *p = '.'; /* to nnn.site */ ! 487: ! 488: /* generate the hash value for the subdirectory */ ! 489: hashval = atoi(message_id) % HASHVAL; ! 490: ! 491: /* setup the filename to save to */ ! 492: sprintf(shortname,"%s/%d/%s",newsgroup,hashval,message_id); ! 493: sprintf(filename,"%s/%s",SAVENEWS,shortname); ! 494: while (exists(filename)) { /* make it unique if neccessary */ ! 495: ! 496: sprintf(shortname,"%s/%d/%s_%d",newsgroup,hashval,message_id,++copy); ! 497: sprintf(filename,"%s/%s",SAVENEWS,shortname); ! 498: } ! 499: ! 500: strcpy(filename2,filename); /* must chop off the filename */ ! 501: if (p = rindex(filename2,'/')) /* since we don't want to */ ! 502: *p = '\0'; /* to makeparents */ ! 503: makeparents(filename2); ! 504: ! 505: if ((fp = fopen(filename,"w")) == NULL) { ! 506: Eprintf("Can't open %s for output\n",filename); ! 507: return(NULL); ! 508: } ! 509: log(newsgroup,shortname); ! 510: return(fp); ! 511: } ! 512: ! 513: exists(name) ! 514: char *name; ! 515: { ! 516: struct stat sbuf; ! 517: ! 518: if (stat(name,&sbuf) == -1) { ! 519: return(FALSE); ! 520: } ! 521: return(TRUE); ! 522: } ! 523: ! 524: makeparents(name) /* recursively make parent directories */ ! 525: char *name; ! 526: { ! 527: char *p, buf[BUFSIZ]; ! 528: ! 529: if (exists(name)) ! 530: return; ! 531: strcpy(buf,name); ! 532: if (!(p = rindex(buf,'/'))) { ! 533: Eprintf("makeparents failed!\n"); ! 534: return; ! 535: } ! 536: *p = '\0'; ! 537: makeparents(buf); ! 538: mkdir(name,DIRMODE); ! 539: } ! 540: ! 541: log(group,name) /* write to the logfile */ ! 542: char *group, *name; ! 543: { ! 544: char *subject, logfile[BUFSIZ]; ! 545: FILE *ofp, *fopen(); ! 546: ! 547: /* get the subject */ ! 548: subject = index(header_data[SUBJECT_HEADER],':'); ! 549: if (!subject) { ! 550: Eprintf("Invalid subject, no log entry\n"); ! 551: return; ! 552: } ! 553: subject++; /* skip the colon */ ! 554: while (isspace(*subject)) ! 555: subject++; /* skip whitespace */ ! 556: ! 557: /* generate the place where it goes */ ! 558: sprintf(logfile,"%s/%s",SAVENEWS,LOGFILE); ! 559: makeparents(logfile); ! 560: strcat(logfile,"/"); ! 561: strcat(logfile,group); ! 562: ! 563: if ((ofp = fopen(logfile,"a")) == NULL) ! 564: { ! 565: Eprintf("open failed on %s\n",logfile); ! 566: return; ! 567: } ! 568: fprintf(ofp,"%s\t%s\n", name, subject); ! 569: fclose(ofp); ! 570: } ! 571: ! 572: SHAR_EOF ! 573: # End of shell archive ! 574: exit 0 ! 575: -- ! 576: :From the misfiring synapses of: Chuq Von Rospach ! 577: {cbosgd,fortune,hplabs,ihnp4,seismo}!nsc!chuqui [email protected] ! 578: ! 579: The offices were very nice, and the clients were only raping the land, and ! 580: then, of course, there was the money... ! 581: ! 582:
This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.