File:  [Research Unix] / researchv10dc / cmd / sml / lib / lexgen / mlex_int.doc
Revision 1.1.1.1 (vendor branch): download - view: text, annotated - select for diffs
Tue Apr 24 17:21:33 2018 UTC (8 years, 1 month ago) by root
Branches: belllabs, MAIN
CVS tags: researchv10, HEAD
researchv10 Dan Cross

This is minimal documentation for the lexer driver produced by ml-lex.

Main data structures:

   The transition table is stored in tab.  Tab is an array of records, indexed
by state number.  The first field of the record, fin, is a list of final leaves
assocated with it.  The second field of the record, trans, is a transition
table for the state indexed by character number.  It gives the next state
for a given input character.

   The usual initial start state is state #1.  State 0 is a dead state, which
has transitions only to itself.

   The field yyfin has type yyfinstate list.  yyfinstate consists of the
following three constructors:

	* N of int - indicates normal end leaf.
	* D of int - dummy end leaf - for indicating when an end state for
	  a trailing context regular expression has been reached.  These are
	  stored and propagated backwards when action is executed.
	* T of int - indicates an actual end leaf for a trailing context reg.
	  expression, which should be executed only if D i was encountered
	  after this end leaf while scanning forward.  The dummy end leaf is
	  removed from the backward propagating list after this node is
	  encountered.

    
     The function scan inside the function lex operates as a transition
function, scanning the input until it is no longer possible to take any
more transitions.  It accumulates a list of the accepting leaf list 
associated with each accepting state passed through.

       Scan operates as follows:

	 Input: * s - current state
		* AcceptingLeaves - list of accepting leave lists.  Each state
		  has a list of accepting leaves associated with it.  This list
		  may be nil if the state is not a final state.
		* l - position of the next character in the buffer b to read
		* i0 - starting position in the buffer.

	Output: If no match is found, it raises the exception LexError. 
		Otherwise, it returns a value of type lexresult.

	It operates as a transtion function:
	     It (1) adds the list of accepting leaves for the current state to
	            the list of accepting leave lists
		(2) tries to make a transition on the current input character
		    to the next state.  If it can't make a transition, it 
		    executes the action function.
			(a) - if it is past the end of the buffer, it 
				(1) checks if it as at end eof.  If it is then:
				 	It checks to see if it has made any
					transitions since it was first called -
					(l>i0 when this is true.)  If it hasn't
					this implies that scan was called at
					the end of file.  It thus executes
					eof function declared by the user.
				        Otherwise it must execute action w/
					the current accepting state list.
				(2) otherwise it reads a block of up to 1024
				    characters, and appends this block to the
				    useful suffix of characters left in the
				    buffer (those character which have been
				    scanned in this call to lex()).  The buffer
				    operation should be altered if one intends
				    to process reg. expressions whose lexemes'
				    length will be >> 1024.  For most normal
				    applications, the buffer update operation
				    will be fine.

			  	    This buffer update operation requires
				    O(n^2/1024) char. copies for lexemes > 1024
				    characters in length, and O(n) char. copies 
				    for lexemes <= 1024 characters in length.
				    It can be made O(n) using linked list
				    buffers & a Byte.array of size n (not the
				     ^operator!) for concatenating the buffers
				    to return a value for yytext when a lexeme
				    is longer than the typical buffer length.

			(3) If the transition is to a dead state (0 is used
			    for the dead state), action is executed instead.

unix.superglobalmegacorp.com

This archive runs on limited infrastructure. Preserving old code on modern bandwidth. Automated agents are requested to crawl responsibly.