DIFF PROGRAM First version (10-9-92) Philosophy Starting in a state where let's say we have two pointers into two files, we basically want to find two blocks in each one which match. There are two criteria: (1) how big are the blocks, and (2) how many records did we have to skip to find the match? It seems to me that we cannot simply stop at the first block match, since to continue our search might reveal a much larger block match. On the other hand, if we do find a block match of a certain size, we have limit somehow the amount of searching we continue to do. I propose at the start that we use the following rules: (1) Define a search as any record-to-record comparison (2) Define a successful comparison as the situation where a record-to-record comparison succeeds. Define the blocksize as one plus the number of successive successful comparisons. Define this unit as a comparison block. Note: We will add one detail here which reflects the structure of Ibex programs. If the record which starts the block contains only the word "end", this will not be defined as a successful comparison. (3) Define the file start pointers as pointing to one record beyond the last record in each file about which an accounting has been made. At the beginning of the program, the start pointers are set to 1 in both files. (4) Define an accounting as the action of advancing the start pointers. Note that an accounting may simply involve advancing the start pointers and reporting a mismatch. (end of file condition) (5) Define state1 as the condition where you have just made an accounting (i.e., you have reported on a match or mismatch), or you are at the beginning of both files. (6) Define the number of further searches as the number of times you can conduct a search without finding a successful comparison. Define the count of further searches as the counter for this set of searches. Define the start pointers as pointing to the file records where a this particular set of searches is to be started. Define the search pointers as pointing to the file records which are about to be searched or compared. (7) When you enter state1, the number of further searches you may conduct is 2001000. If you make a successful comparison before you reach this number, then you enter state2. If you do not, then you make an accounting which says that the the set of 2000 records pointed to be each of the start pointers is mutually exclusive, i.e., no match of any kind can be found amoung them. This condition will stop the program. (8) When you enter state2, you have in hand a successful comparison with an associated blocksize. The problem is that you may not have found the best or most significant comparison. For example, the way the search proceeds, you may find two records 50 records away in file 1 which match the next two records in file 2 BEFORE you file a block of 20 record which is 30 records away in each file. You therefore need to reduce the number of further searches according to the situation and go on looking for a better block. Since you already have a successful comparison and you are looking for a better one, this means you are entering a new state, state3. (9) You enter state3 either from itself or from state2. In either case, you have a successful comparison in hand which is the best you have found so far. You need to recompute the number of further searches. I propose the following formula. Define the controlling pointers to be the value of search pointers for the current successful comparison. Then: m = controllingpoint1 - startpoint1 + 1 n = controllingpoint2 - startpoint2 + 1 (m + n)^3/2 * 200 number of further searches = ─────────────────────────────── (m * n) * squareroot(blocksize) SELECTED EXAMPLES ═════════════════════ m │ n │ b │ f │ S ─────┼──────┼─────┼──────┼──────── 10 │ 1 │ 1 │ 3.65 │ 730 5 │ 5 │ 1 │ 1.26 │ 252 20 │ 1 │ 3 │ 2.77 │ 554 10 │ 10 │ 5 │ .40 │ 80 40 │ 1 │ 2 │ 4.64 │ 928 20 │ 20 │ 30 │ .12 │ 24 60 │ 1 │ 3 │ 4.58 │ 916 30 │ 30 │ 20 │ .12 │ 24 200 │ 1 │ 2 │10.08 │ 2016 100 │ 100 │ 50 │ .04 │ 8 │ │ │ │ You will now conduct the number of further searches looking for a larger comparison block. The current blocksize becomes the controlling blocksize. The current comparison block becomes the controlling comparison block. The search will continue where it left off. If you find a new successful comparison, you must ascertain whether the new blocksize is larger than the controlling blocksize. If so, you have a found a new contolling comparison block, and you must return to the top of state3 to see how much longer the search must go on. If not, you must keep searching until another new successful comparison is found or the number of further searches is reached. When you have searched the full number of further searches without finding a better successful comparison, your search is finished and you may make an accounting as described above. (10) The order for making searches needs to be specified. We have previously defined the search pointers as pointing to the file records which are about to be searched or compared and the start pointers as pointing to the file records where a set of searches was started. It will be sufficient to define the next value of the search pointers, given the current values. If the current values are m and n, and the values of the start pointers are a and b, then compute the offset y = n - b. if y = 0 or m = number of records in file1 n = (m - a) + b + 1 if n > number of records in file2 No more comparisons can be made. Treat this situation the same as if the count of further searches = number of futher searches. end m = a else m += 1 n -= 1 end (11) The last problem to deal with is the corner cases that result when one or both of the pointer point to the end of a file. I think I will attempt to deal with these in the actual writing of the code. #define MODS 1 str file1.80, file2.80, rec1.1000, rec2.1000, rec3.1000, rec4.1000 str esc.1 table X(80000), Y(80000) int g,h,i,j,k,y,m,n,r,s,t int max1, max2 int searchpoint1, searchpoint2 int startpoint1, startpoint2 int furthersearches, searchcount int blocksize, controllingblocksize int controllingpoint1, controllingpoint2 int recflag real w, x esc = chr(27) Step 1: get files; load tables; set limits and counters; putc putc The purpose of this program is to compare two similar files. The putc maximum number of records allowed in each file is 80000. putc putc File 1? getc file1 putc File 2? getc file2 open [1,1] file1 loop for i = 1 to 80001 getf [1] rec1 rec1 = trm(rec1) tput [X,i] ~rec1 repeat eof1: max1 = i - 1 close [1] open [2,1] file2 loop for i = 1 to 80001 getf [2] rec1 rec1 = trm(rec1) tput [Y,i] ~rec1 repeat eof2: max2 = i - 1 close [2] if max1 = 0 putc File 1 is empty stop end if max2 = 0 putc File 2 is empty stop end startpoint1 = 1 startpoint2 = 1 ┌─────────────────────────────────────┐ │ State1 │ └─────────────────────────────────────┘ State1: searchpoint1 = startpoint1 searchpoint2 = startpoint2 furthersearches = 2001000 searchcount = 0 S1search: recflag = 0 tget [X,searchpoint1] rec1 tget [Y,searchpoint2] rec2 #if MODS rec1 = trm(rec1) rec2 = trm(rec2) #endif if rec1 = rec2 and rec1 <> "" if rec1 = "*" or rec1{1} = esc recflag = 1 end rec3 = rev(rec1) // pad(5) if rec3{1,5} <> "dne " i = searchpoint1 j = searchpoint2 loop while i < max1 and j < max2 ++i ++j tget [X,i] rec1 tget [Y,j] rec2 if rec1 <> rec2 --i goto AAA end repeat AAA: blocksize = i - searchpoint1 + 1 &A dputc match found rec1 = ~searchpoint1 rec2 = ~searchpoint2 size = ~blocksize &A dputc Going to State2 #if MODS if blocksize > 1 goto State2 end #else if recflag = 0 or blocksize > 1 goto State2 end #endif end end ++searchcount if searchcount = 40000 putc searching .b46 .b46 .b46 .b27 &a0C... end if searchcount = furthersearches goto S1report end y = searchpoint2 - startpoint2 if y = 0 or searchpoint1 = max1 searchpoint2 = searchpoint1 - startpoint1 + startpoint2 + 1 + y if searchpoint2 > max2 searchpoint1 = startpoint1 + searchpoint2 - max2 searchpoint2 = max2 if searchpoint1 > max1 goto S1report end else searchpoint1 = startpoint1 end else ++searchpoint1 --searchpoint2 end goto S1search S1report: if searchcount = furthersearches /* 2000 records in each file have been cross checked and /* no matches of any kind have been found putc 2000 records in each file have been cross checked and no putc matches of any kind have been found. let's face it, these putc files are substantially different. You should use some other putc method of comparison goto END end /* end of file has been encountered in both files putc End of file has been encountered in both files with no (further) putc matches found. ... i = max1 - startpoint1 j = max2 - startpoint2 if i > 50 or j > 50 putc More than 50 records are involved in this report. putc Type "y" if you wish to see them. getk k k &= 0xffff if k <> 'y' and k <> 'Y' goto END end else putc end putc putc Mismatching groups of records putc File 1: records ~startpoint1 to ~max1 loop for i = startpoint1 to max1 tget [X,i] rec1 putc ←~rec1 repeat putc File 2: records ~startpoint2 to ~max2 loop for i = startpoint2 to max2 tget [Y,i] rec1 putc →~rec1 repeat goto END ┌─────────────────────────────────────┐ │ State2 │ └─────────────────────────────────────┘ State2: controllingblocksize = blocksize controllingpoint1 = searchpoint1 controllingpoint2 = searchpoint2 goto State3 ┌─────────────────────────────────────┐ │ State3 │ └─────────────────────────────────────┘ State3: (m + n)^3/2 * 200 number of further searches = ─────────────────────────────── (m * n) * squareroot(blocksize) m = controllingpoint1 - startpoint1 + 1 n = controllingpoint2 - startpoint2 + 1 w = flt(m + n) x = flt(m * n) w = pow(w, 1.5) * 200.0 w = w / x x = flt(blocksize) w = w / sqt(x) furthersearches = fix(w) if blocksize = 1 and recflag = 1 dputc furthersearchs = ~furthersearches furthersearches *= 1000 dputc furthersearchs = ~furthersearches end &A dputc furthersearchs = ~furthersearches searchcount = 0 if searchcount = furthersearches goto S3report end /* advance the search pointers y = searchpoint2 - startpoint2 if y = 0 or searchpoint1 = max1 searchpoint2 = searchpoint1 - startpoint1 + startpoint2 + 1 + y if searchpoint2 > max2 searchpoint1 = startpoint1 + searchpoint2 - max2 searchpoint2 = max2 if searchpoint1 > max1 goto S3report end else searchpoint1 = startpoint1 end else ++searchpoint1 --searchpoint2 end S3search: tget [X,searchpoint1] rec1 tget [Y,searchpoint2] rec2 if rec1 = rec2 and rec1 <> "" rec3 = rev(rec1) // pad(5) if rec3{1,5} <> "dne " i = searchpoint1 j = searchpoint2 loop while i < max1 and j < max2 ++i ++j tget [X,i] rec1 tget [Y,j] rec2 if rec1 <> rec2 --i goto BBB end repeat BBB: blocksize = i - searchpoint1 + 1 if blocksize > controllingblocksize if searchpoint1 > controllingpoint1 and searchpoint2 > controllingpoint2 /* don't count new block if it is completely beyond the controlling block else controllingblocksize = blocksize controllingpoint1 = searchpoint1 controllingpoint2 = searchpoint2 goto State3 end end end end ++searchcount y = searchcount / 10000 if rem = 0 putc ~y ... end if searchcount = furthersearches goto S3report end y = searchpoint2 - startpoint2 if y = 0 or searchpoint1 = max1 searchpoint2 = searchpoint1 - startpoint1 + startpoint2 + 1 + y if searchpoint2 > max2 searchpoint1 = startpoint1 + searchpoint2 - max2 searchpoint2 = max2 if searchpoint1 > max1 goto S3report end else searchpoint1 = startpoint1 end else ++searchpoint1 --searchpoint2 end goto S3search S3report: if startpoint1 = controllingpoint1 and startpoint2 = controllingpoint2 /* only at beginning of file putc if controllingblocksize = 1 putc Matching block of 1 record putc File 1: record 1 matches with putc File 2: record 1 else putc Matching block of ~controllingblocksize records putc File 1: records 1 to ~controllingblocksize match with putc File 2: records 1 to ~controllingblocksize end putc startpoint1 += controllingblocksize startpoint2 += controllingblocksize if startpoint1 > max1 if startpoint2 > max2 putc Files are identical goto END end putc File 1 matches File 2 but is shorter goto END end if startpoint2 > max2 putc File 2 matches File 1 but is shorter goto END end goto State1 end if startpoint1 <> controllingpoint1 and startpoint2 <> controllingpoint2 putc Mismatching groups of records if startpoint1 = controllingpoint1 - 1 putc File 1: record ~startpoint1 else putc File 1: records ~startpoint1 to ~(controllingpoint1 - 1) end loop for i = startpoint1 to controllingpoint1 - 1 tget [X,i] rec1 putc ←~rec1 repeat if startpoint2 = controllingpoint2 - 1 putc File 2: record ~startpoint2 else putc File 2: records ~startpoint2 to ~(controllingpoint2 - 1) end loop for i = startpoint2 to controllingpoint2 - 1 tget [Y,i] rec2 putc →~rec2 repeat else if startpoint1 <> controllingpoint1 if startpoint1 = controllingpoint1 - 1 putc Unmatched record in File 1 putc File 1: record ~startpoint1 else putc Unmatched records in File 1 putc File 1: records ~startpoint1 to ~(controllingpoint1 - 1) end loop for i = startpoint1 to controllingpoint1 - 1 tget [X,i] rec1 putc ←~rec1 repeat end if startpoint2 <> controllingpoint2 if startpoint2 = controllingpoint2 - 1 putc Unmatched record in File 2 putc File 2: record ~startpoint2 else putc Unmatched records in File 2 putc File 2: records ~startpoint2 to ~(controllingpoint2 - 1) end loop for i = startpoint2 to controllingpoint2 - 1 tget [Y,i] rec2 putc →~rec2 repeat end end i = controllingpoint1 + controllingblocksize - 1 j = controllingpoint2 + controllingblocksize - 1 putc if controllingblocksize = 1 putc Matching block of 1 record putc File 1: record ~controllingpoint1 matches with putc File 2: record ~controllingpoint2 else putc Matching block of ~controllingblocksize records putc File 1: records ~controllingpoint1 to ~i match with putc File 2: records ~controllingpoint2 to ~j end putc startpoint1 = i + 1 startpoint2 = j + 1 if startpoint1 > max1 if startpoint2 > max2 putc End of both files goto END end putc End of file 1 goto END end if startpoint2 > max2 putc End of file 2 goto END end goto State1 END: putc putc Type any key to exit getk k run