DIFF PROGRAM 
                       First version (10-9-92) 
 
   Philosophy 
 
      Starting in a state where let's say we have two pointers into 
   two files, we basically want to find two blocks in each one which 
   match.  There are two criteria: (1) how big are the blocks, and (2) 
   how many records did we have to skip to find the match?  It seems 
   to me that we cannot simply stop at the first block match, since to 
   continue our search might reveal a much larger block match.  On the 
   other hand, if we do find a block match of a certain size, we have 
   limit somehow the amount of searching we continue to do.  
 
      I propose at the start that we use the following rules: 
 
      (1) Define a search as any record-to-record comparison 
 
      (2) Define a successful comparison as the situation where 
            a record-to-record comparison succeeds.  Define the 
            blocksize as one plus the number of successive 
            successful comparisons.  Define this unit as a 
            comparison block.  
 
            Note: We will add one detail here which reflects the 
            structure of Ibex programs.  If the record which starts 
            the block contains only the word "end", this will not 
            be defined as a successful comparison.  
 
      (3) Define the file start pointers as pointing to one record 
            beyond the last record in each file about which an 
            accounting has been made.  At the beginning of the 
            program, the start pointers are set to 1 in both files.  
 
      (4) Define an accounting as the action of advancing the 
            start pointers.  Note that an accounting may simply 
            involve advancing the start pointers and reporting a 
            mismatch. (end of file condition) 
 
      (5) Define state1 as the condition where you have just 
            made an accounting (i.e., you have reported on a match 
            or mismatch), or you are at the beginning of both 
            files.  
 
      (6) Define the number of further searches as the number 
            of times you can conduct a search without finding 
            a successful comparison.  Define the count of further 
            searches as the counter for this set of searches.  
            Define the start pointers as pointing to the file 
            records where a this particular set of searches is 
            to be started.  Define the search pointers as pointing 
            to the file records which are about to be searched or 
            compared.  
 
      (7) When you enter state1, the number of further searches 
            you may conduct is 2001000.  If you make a successful 
            comparison before you reach this number, then you 
            enter state2.  If you do not, then you make an 
            accounting which says that the the set of 2000 records 
            pointed to be each of the start pointers is mutually 
            exclusive, i.e., no match of any kind can be found 
            amoung them.  This condition will stop the program.  
 
      (8) When you enter state2, you have in hand a successful 
            comparison with an associated blocksize.  The problem 
            is that you may not have found the best or most 
            significant comparison.  For example, the way the search 
            proceeds, you may find two records 50 records away in 
            file 1 which match the next two records in file 2 
            BEFORE you file a block of 20 record which is 30 records 
            away in each file.  You therefore need to reduce the 
            number of further searches according to the situation 
            and go on looking for a better block.  Since you already 
            have a successful comparison and you are looking for 
            a better one, this means you are entering a new state, 
            state3.  
 
      (9) You enter state3 either from itself or from state2.  In 
            either case, you have a successful comparison in hand 
            which is the best you have found so far.  You need to 
            recompute the number of further searches.  I propose 
            the following formula.  Define the controlling pointers 
            to be the value of search pointers for the current   
            successful comparison.  Then: 
 
                 m = controllingpoint1 - startpoint1 + 1 
                 n = controllingpoint2 - startpoint2 + 1 
                                                      
                                                     (m + n)^3/2 * 200 
                 number of further searches = ─────────────────────────────── 
                                              (m * n) * squareroot(blocksize) 
 
 
                               SELECTED EXAMPLES 
                             ═════════════════════ 
                          m  │   n  │  b  │  f   │   S 
                        ─────┼──────┼─────┼──────┼──────── 
                         10  │   1  │  1  │ 3.65 │  730 
                          5  │   5  │  1  │ 1.26 │  252 
                         20  │   1  │  3  │ 2.77 │  554 
                         10  │  10  │  5  │  .40 │   80 
                         40  │   1  │  2  │ 4.64 │  928 
                         20  │  20  │ 30  │  .12 │   24 
                         60  │   1  │  3  │ 4.58 │  916 
                         30  │  30  │ 20  │  .12 │   24 
                        200  │   1  │  2  │10.08 │ 2016 
                        100  │ 100  │ 50  │  .04 │    8 
                             │      │     │      │ 
 
            You will now conduct the number of further searches looking 
            for a larger comparison block.  The current blocksize becomes 
            the controlling blocksize.  The current comparison block 
            becomes the controlling comparison block.  
 
            The search will continue where it left off.  If you find 
            a new successful comparison, you must ascertain whether the 
            new blocksize is larger than the controlling blocksize.  
            If so, you have a found a new contolling comparison 
            block, and you must return to the top of state3 to see 
            how much longer the search must go on.  If not, you 
            must keep searching until another new successful comparison 
            is found or the number of further searches is reached.  
 
            When you have searched the full number of further searches 
            without finding a better successful comparison, your 
            search is finished and you may make an accounting 
            as described above.  
 
     (10) The order for making searches needs to be specified.  
            We have previously defined the search pointers as 
            pointing to the file records which are about to be 
            searched or compared and the start pointers as 
            pointing to the file records where a set of searches 
            was started.  It will be sufficient to define the 
            next value of the search pointers, given the current 
            values.  If the current values are m and n, and the 
            values of the start pointers are a and b, then compute 
            the offset y = n - b.  
               if y = 0 or m = number of records in file1 
                 n = (m - a) + b + 1 
                 if n > number of records in file2 
                   No more comparisons can be made.  Treat this 
                   situation the same as if the count of further 
                   searches = number of futher searches.  
                 end 
                 m = a 
               else 
                 m += 1 
                 n -= 1 
               end 
 
     (11) The last problem to deal with is the corner cases that 
            result when one or both of the pointer point to the 
            end of a file.  I think I will attempt to deal with 
            these in the actual writing of the code.  
 
 
 
 
#define MODS    1 

      str file1.80, file2.80, rec1.1000, rec2.1000, rec3.1000, rec4.1000 
      str esc.1 

      table X(80000), Y(80000) 

      int g,h,i,j,k,y,m,n,r,s,t 
      int max1, max2 
      int searchpoint1, searchpoint2 
      int startpoint1, startpoint2 
      int furthersearches, searchcount 
      int blocksize, controllingblocksize 
      int controllingpoint1, controllingpoint2 
      int recflag 
      real w, x 

      esc = chr(27) 

 
      Step 1: get files; load tables; set limits and counters; 
 
      putc     
      putc The purpose of this program is to compare two similar files.  The 
      putc maximum number of records allowed in each file is 80000.  
      putc       
      putc File 1?  
      getc file1 
      putc File 2?  
      getc file2 

      open [1,1] file1 
      loop for i = 1 to 80001 
        getf [1] rec1 
        rec1 = trm(rec1) 
        tput [X,i] ~rec1 
      repeat 
eof1: 
      max1 = i - 1 
      close [1] 

      open [2,1] file2 
      loop for i = 1 to 80001 
        getf [2] rec1 
        rec1 = trm(rec1) 
        tput [Y,i] ~rec1 
      repeat 
eof2: 
      max2 = i - 1 
      close [2] 
      if max1 = 0 
        putc File 1 is empty 
        stop 
      end 
      if max2 = 0 
        putc File 2 is empty 
        stop 
      end 

      startpoint1 = 1 
      startpoint2 = 1 

┌─────────────────────────────────────┐ 
│               State1                │ 
└─────────────────────────────────────┘ 

State1: 

      searchpoint1 = startpoint1 
      searchpoint2 = startpoint2 
      furthersearches = 2001000 
      searchcount = 0 

S1search: 

      recflag = 0 
      tget [X,searchpoint1] rec1 
      tget [Y,searchpoint2] rec2 

#if MODS 
      rec1 = trm(rec1) 
      rec2 = trm(rec2) 
#endif 

      if rec1 = rec2 and rec1 <> "" 
        if rec1 = "*" or rec1{1} = esc 
          recflag = 1 
        end 
        rec3 = rev(rec1) // pad(5) 
        if rec3{1,5} <> "dne  " 
          i = searchpoint1 
          j = searchpoint2 
          loop while i < max1 and j < max2 
            ++i 
            ++j 
            tget [X,i] rec1 
            tget [Y,j] rec2 
            if rec1 <> rec2 
              --i 
              goto AAA 
            end 
          repeat 
AAA: 
          blocksize = i - searchpoint1 + 1 

&A        dputc match found rec1 = ~searchpoint1  rec2 = ~searchpoint2  size = ~blocksize
&A        dputc Going to State2 

#if MODS 
          if blocksize > 1 
            goto State2 
          end 
#else 
          if recflag = 0 or blocksize > 1 
            goto State2 
          end 
#endif 
        end 
      end 

      ++searchcount 
      if searchcount = 40000 
        putc searching .b46  .b46  .b46   .b27 &a0C...  
      end     

      if searchcount = furthersearches 
        goto S1report 
      end 

      y = searchpoint2 - startpoint2  
      if y = 0 or searchpoint1 = max1 
        searchpoint2 = searchpoint1 - startpoint1 + startpoint2 + 1 + y 
        if searchpoint2 > max2 
          searchpoint1 = startpoint1 + searchpoint2 - max2 
          searchpoint2 = max2 
          if searchpoint1 > max1 
            goto S1report 
          end 
        else 
          searchpoint1 = startpoint1 
        end 
      else 
        ++searchpoint1 
        --searchpoint2 
      end 
      goto S1search 

S1report: 

      if searchcount = furthersearches 
        /* 2000 records in each file have been cross checked and 
        /* no matches of any kind have been found 
        putc 2000 records in each file have been cross checked and no 
        putc matches of any kind have been found.  let's face it, these 
        putc files are substantially different.  You should use some other 
        putc method of comparison 
        goto END 
      end 
      /* end of file has been encountered in both files 
      putc End of file has been encountered in both files with no (further) 
      putc matches found.  ...  
      i = max1 - startpoint1 
      j = max2 - startpoint2 
      if i > 50 or j > 50 
        putc More than 50 records are involved in this report.  
        putc Type "y" if you wish to see them.  
        getk k 
        k &= 0xffff 
        if k <> 'y' and k <> 'Y' 
          goto END 
        end 
      else 
        putc 
      end 
      putc 
      putc Mismatching groups of records 
      putc   File 1: records ~startpoint1  to ~max1 
      loop for i = startpoint1 to max1 
        tget [X,i] rec1 
        putc    ←~rec1 
      repeat 
      putc   File 2: records ~startpoint2  to ~max2 
      loop for i = startpoint2 to max2 
        tget [Y,i] rec1 
        putc    →~rec1 
      repeat 
      goto END 

┌─────────────────────────────────────┐ 
│               State2                │ 
└─────────────────────────────────────┘ 

State2: 

      controllingblocksize = blocksize 
      controllingpoint1 = searchpoint1 
      controllingpoint2 = searchpoint2 
      goto State3 

┌─────────────────────────────────────┐ 
│               State3                │ 
└─────────────────────────────────────┘ 

State3: 
   
                                       (m + n)^3/2 * 200 
   number of further searches = ─────────────────────────────── 
                                (m * n) * squareroot(blocksize) 

      m = controllingpoint1 - startpoint1 + 1 
      n = controllingpoint2 - startpoint2 + 1 
      w = flt(m + n) 
      x = flt(m * n) 
      w = pow(w, 1.5) * 200.0 
      w = w / x 
      x = flt(blocksize) 
      w = w / sqt(x) 
      furthersearches = fix(w) 
      if blocksize = 1 and recflag = 1 
        dputc furthersearchs = ~furthersearches 
        furthersearches *= 1000 
        dputc furthersearchs = ~furthersearches 
      end 
&A    dputc furthersearchs = ~furthersearches 


      searchcount = 0 
      if searchcount = furthersearches 
        goto S3report 
      end 

    /* advance the search pointers 

      y = searchpoint2 - startpoint2  
      if y = 0 or searchpoint1 = max1 
        searchpoint2 = searchpoint1 - startpoint1 + startpoint2 + 1 + y 
        if searchpoint2 > max2 
          searchpoint1 = startpoint1 + searchpoint2 - max2 
          searchpoint2 = max2 
          if searchpoint1 > max1 
            goto S3report 
          end 
        else 
          searchpoint1 = startpoint1 
        end 
      else 
        ++searchpoint1 
        --searchpoint2 
      end 

S3search: 

      tget [X,searchpoint1] rec1 
      tget [Y,searchpoint2] rec2 
      if rec1 = rec2 and rec1 <> "" 
        rec3 = rev(rec1) // pad(5) 
        if rec3{1,5} <> "dne  " 
          i = searchpoint1 
          j = searchpoint2 
          loop while i < max1 and j < max2 
            ++i 
            ++j 
            tget [X,i] rec1 
            tget [Y,j] rec2 
            if rec1 <> rec2 
              --i 
              goto BBB 
            end 
          repeat 
BBB: 
          blocksize = i - searchpoint1 + 1 

          if blocksize > controllingblocksize 
            if searchpoint1 > controllingpoint1 and searchpoint2 > controllingpoint2

          /* don't count new block if it is completely beyond the controlling block

            else 
              controllingblocksize = blocksize 
              controllingpoint1 = searchpoint1 
              controllingpoint2 = searchpoint2 
              goto State3 
            end 
          end 
        end 
      end 

      ++searchcount 
 
 
      y = searchcount / 10000 
      if rem = 0 
        putc ~y  ...  
      end 
 
 
      if searchcount = furthersearches 
        goto S3report 
      end 

      y = searchpoint2 - startpoint2  
      if y = 0 or searchpoint1 = max1 
        searchpoint2 = searchpoint1 - startpoint1 + startpoint2 + 1 + y 
        if searchpoint2 > max2 
          searchpoint1 = startpoint1 + searchpoint2 - max2 
          searchpoint2 = max2 
          if searchpoint1 > max1 
            goto S3report 
          end 
        else 
          searchpoint1 = startpoint1 
        end 
      else 
        ++searchpoint1 
        --searchpoint2 
      end 
      goto S3search 

S3report: 

      if startpoint1 = controllingpoint1 and startpoint2 = controllingpoint2 
        /* only at beginning of file 
        putc 
        if controllingblocksize = 1 
          putc Matching block of 1 record 
          putc  File 1: record 1     matches with 
          putc  File 2: record 1     
        else 
          putc Matching block of ~controllingblocksize  records 
          putc  File 1: records 1 to ~controllingblocksize     match with 
          putc  File 2: records 1 to ~controllingblocksize 
        end 
        putc 

        startpoint1 += controllingblocksize 
        startpoint2 += controllingblocksize 
        if startpoint1 > max1             
          if startpoint2 > max2 
            putc Files are identical 
            goto END 
          end 
          putc File 1 matches File 2 but is shorter 
          goto END 
        end 
        if startpoint2 > max2 
          putc File 2 matches File 1 but is shorter
          goto END 
        end 
        goto State1 
      end 

      if startpoint1 <> controllingpoint1 and startpoint2 <> controllingpoint2 
        putc Mismatching groups of records 
        if startpoint1 = controllingpoint1 - 1 
          putc   File 1: record ~startpoint1 
        else 
          putc   File 1: records ~startpoint1  to ~(controllingpoint1 - 1) 
        end 
        loop for i = startpoint1 to controllingpoint1 - 1 
          tget [X,i] rec1 
          putc    ←~rec1 
        repeat 
        if startpoint2 = controllingpoint2 - 1 
          putc   File 2: record ~startpoint2 
        else 
          putc   File 2: records ~startpoint2  to ~(controllingpoint2 - 1) 
        end 
        loop for i = startpoint2 to controllingpoint2 - 1 
          tget [Y,i] rec2 
          putc    →~rec2 
        repeat 
      else 
        if startpoint1 <> controllingpoint1 
          if startpoint1 = controllingpoint1 - 1 
            putc Unmatched record in File 1 
            putc   File 1: record ~startpoint1 
          else 
            putc Unmatched records in File 1 
            putc   File 1: records ~startpoint1  to ~(controllingpoint1 - 1) 
          end 
          loop for i = startpoint1 to controllingpoint1 - 1 
            tget [X,i] rec1 
            putc    ←~rec1 
          repeat 
        end 

        if startpoint2 <> controllingpoint2 
          if startpoint2 = controllingpoint2 - 1 
            putc Unmatched record in File 2 
            putc   File 2: record ~startpoint2 
          else 
            putc Unmatched records in File 2 
            putc   File 2: records ~startpoint2  to ~(controllingpoint2 - 1) 
          end 
          loop for i = startpoint2 to controllingpoint2 - 1 
            tget [Y,i] rec2 
            putc    →~rec2 
          repeat 
        end 
      end 


      i = controllingpoint1 + controllingblocksize - 1 
      j = controllingpoint2 + controllingblocksize - 1 
      putc 
      if controllingblocksize = 1 
        putc Matching block of 1 record 
        putc   File 1: record ~controllingpoint1        matches with 
        putc   File 2: record ~controllingpoint2 
      else 
        putc Matching block of ~controllingblocksize  records 
        putc   File 1: records ~controllingpoint1  to ~i     match with 
        putc   File 2: records ~controllingpoint2  to ~j 
      end 
      putc 

      startpoint1 = i + 1
      startpoint2 = j + 1
      if startpoint1 > max1 
        if startpoint2 > max2 
          putc      End of both files      
          goto END 
        end 
        putc End of file 1 
        goto END 
      end 
      if startpoint2 > max2 
        putc End of file 2 
        goto END 
      end 
      goto State1 

END:  putc 
      putc Type any key to exit 
      getk k 

      run