Format for the CFT file ======================= First, a note about version numbers. The original .CFT compression program was written in zbex. It produced .CFT files with a version number 1.00. When .CFT compression was added to the DMUSE program, the .CFT files it produced were given a version number of 1.01 to set them apart from the files produced by zbex (although the files were, in fact, identical). Making a change 10/18/03 We would like to allow more than 255 pages in a Case 2 compression. It turns out that there is an empty byte in Field 6, which is just before Field 7, which contains the number of pages. We will commandeer Field 6 and use it to store the "high end" portion of the number of pages, thus increasing the maximum number of pages to over 32K. Making a change 04/12/06 Adding records to transmit source information and to allow melodic searches has expanded the size of page files by about 21 percent. Especially the addition of "A P " pitch records has changed the dynamics of compression to the point where the compression process itself was in need of a revision. A new compression algorithm has been designed, which is about 50 percent more efficient than the old algorithm. A zbex program has been written to implement the new compression algorithm. The .CFT files it produces have been given the version number 1.02. The zbex program to expand .CFT files into .MPG files has been modified to work with both the version 1.00/1.01 files and the 1.02 files. In contrast to the old compression algorithm, which made use of a two compression techniques, the new algorithm uses four compression techniques. It is at least twice as complicated and takes about ten times longer to run. For this reason, we do not anticipate that it will be incorporated into the DMUSE environment. The expansion program in the DMUSE environment will be modified, however, to read version 1.02 files. Version 1.01 files will still be buildable in the DMUSE environment and will continue to be part of the standard. I. Title and Offset information Field 1: (8 bytes) Source of file e.g. CCARH Field 2: (4 bytes) Version number (compression technique, etc) version 1.00 (0x100) uses the original (pre 2006) list of extended character codes version 1.01 (0x101) same as version 1.00, but generated by DMUSE version 1.02 (0x102) -> modified (04/12/06) extended character codes plus a group of "common" records Field 3: (4 bytes) Checksum number (for file varification) Field 4: (4 bytes) Offset to File Organization section Field 5: (4 bytes) Offset to Data Group offsets II. Header information (offset = 24 bytes) Field 1: byte 1: length (up to 100) bytes 2..: composer Field 2: byte 1: length (up to 100) bytes 2..: work Field 3: byte 1: length (up to 50) bytes 2..: type of representation e.g Full Score Complete Parts Violin I Field 4: byte 1: length (up to 50) bytes 2..: resolution e.g. 300 dots/inch III. File Organization (offset contained in bytes 17..20 of file) A CFT file may be one of three types. 1. It may be a compression of a single file (generally not used) 2. It may be a compression of a set of MPG page files from a single directory. e.g. The full score (pages) of a work with only 1 movement 3. It may be a compression of <n> directories, each containing a set of MPG files. e.g. The full score (pages) of a work with <n> movements. The set of parts (pages) for <n> instruments. The part for one instrument, divided into <n> movements. Field 1: (one byte) 0 = case 1 1 = case 2 2 = case 3 Case 1: (version 1.00/1.01 only -- no Case 1 can be constructed in version 1.02) Field 2: (one byte) 0x00 Field 3: (one byte) 0x00 Field 4: byte 1: length (up to 100) bytes 2..: sub-divsion name (optional) e.g. Movement 1 e.g. Violine I Field 5: byte 1: length (up to 40) bytes 2..: note size e.g. 14 dots per staff line 21 dots per staff line 14 and 21 mixed Field 6: (one byte) 0x00 Field 7: (one byte) <page number of this MPG file> Field 8: (two bytes) 0x0002 Case 2: Field 2: byte 1: length (up to 13) bytes 2..: Name of default path (directory) for expansion Field 3: (one byte) 0x00 = Field 4 contains one sub-division name 0x01 = Field 4 contains 2-byte contents-flag plus multiple sub-division names 0x02 = Field 4 contains 2-byte contents-flag plus one sub-division name Field 4: format-00 byte 1: length (up to 100) bytes 2..: sub-divsion name (optional) e.g. Movement 1 e.g. Violine I Field 4: format-01 bytes 1-2: number of sub-division names (= number of pages) bytes 3-4: page number for "CONTENTS" (zero = no "CONTENTS" page) bytes 5-6: number of bytes of sub-division names bytes 7--: byte 1: length (up to 255) bytes 2--: name (" " = no name) Field 4: format-02 bytes 1-2: page number for "CONTENTS" (non-zero) byte 3: length (up to 100) bytes 4..: sub-divsion name (optional) Field 5: byte 1: length (up to 40) bytes 2..: note size e.g. 14 dots per staff line 21 dots per staff line 14 and 21 mixed Field 6: (one byte) <p> pages (high order byte) Field 7: (one byte) <p> pages Field 8: (two bytes) version 1.00/1.01: 0x0002 version 1.02: 0x0003 Case 3: Field 2: byte 1: length (up to 13) bytes 2..: Name of default path (directory) for expansion Field 3: (one byte) <n> (1 to 255) number of sub-directories Fields 4..8: are repeated <n> times Field 4: byte 1: length (up to 100) bytes 2..: sub-divsion name e.g. Movement 1 e.g. Violine I Field 5: byte 1: length (up to 40) bytes 2..: note size e.g. 14 dots per staff line 21 dots per staff line 14 and 21 mixed Field 6: byte 1: length (up to 13) bytes 2..: Name of default path (sub-directory) for expansion Field 7: (one byte) <p> pages Field 8: (two bytes) data group number of first page of this sub-directory IV. Data group offsets and lengths (offset contained in bytes 21..24 of file) Each offset/length pair is stored in two 4-byte groups. Offsets are measured from the beginning of the file. Case 1: There are only 2 data groups: i.e. the compression key and the page data Case 2: version 1.00/1.01: There are <p> + 1 data groups: (1) the compression key (2) data for <p> pages version 1.02: There are <p> + 2 data groups: (1) the compression key (2) data for the "common" records (3) data for <p> pages Case 3: version 1.00/1.01: __ There are \ <n> SUM | <p(i)> + 1 data groups: /__ i=1 i.e the compression key and the sum of all the pages from all of the sub-directories version 1.02: __ There are \ <n> SUM | <p(i)> + 2 data groups: /__ i=1 (1) the compression key (2) data for the "common" records (3) data for the sum of all the pages from all of the sub-directories V. Data groups The first data group is always the cipher. version 1.00/1.01: Each subsequent data group represents a page of music. Each page data group begins with three header fields. Field 1: byte 1: length bytes 2..: name of file (for expansion purposes) Field 2: 4 bytes: number of data records in file NOTE: This field is eliminated as unnessary in version 1.02 Feild 3: 4 bytes: length of data in bits. version 1.02: For Cases 2 & 3, the second data group is always the group of "common" records Field 1a: One byte: n = number of "Huffman" numbers. (0 = no common records) Field 1b: <n> 16 bit "Huffman" numbers Field 2: 2 bytes: number of "common" data records Feild 3: 4 bytes: length of data in bits. The "common" records have their own system of coding. All records start with a 2-bit code: 00 = code complete line using standard compression 01 = skip through the n-th blank (n = 1 to 8) and use standard compression for remainder of line. (lastchar = blank) 10 = use "xxx" coding. skip 2 + n (n = 0 to 7) x's 11 = use "xxx" coding. assume the same total number of leading x's as in the previous record. Codes 01 and 10 must be followed immediately by a 3-bit "skip" code. In addition, codes 10 and 11 require an additional 3-bit code indicating the length (m = 1 to 8) of the encoding. Lengths longer than 8 are not allowed. The "xxx" coding is possible when the current record has the same length as the preceding record in the list. The rule is simple; comparison is byte-wise. When there is no change, the code is "x"; when there is a change and the new byte is a number, the code is that number; when there is a change and the new byte is not a number, the "xxx" coding fails. The overall algorithm is as follows: (1) Try to derive an "xxx" line for the record a) if this record starts with 9 or less x's, you are done. b) if more than 9 x's, determine the length over which the two records match. If this sub-string contains 5 or more blanks, then use code2 (01) to represent the record. c) If b) fails, stick with the "xxx" line. (2) If "xxx" representation fails, determine the length over which the two records match. If this length is 7 or more, and if the matching sub-string does not equal "@ LINE: " or "@ TEXT: " exactly, use code2 (01) to represent the string. (3) If 2) fails, use code (00) to represent the string (4) If the previous representation was an "xxx" line, and if this line starts with the same number of x's, then the 2-bit "11" code for "xxx" can be used. Otherwise, the 5-bit "10bbb" code must be used. The "common" records are indexed by a number (1 to n < 5000). This number, which uncompressed would require 12 or 13 bits, can be compressed using a Huffman tree. The specification of this tree can be 1 0 communicated using <n> 16 bit numbers. 2 0 The best way to describe this is with 3 1 an example. At the right is a list of 4 1 17 numbers. What this tells us is that 5 3 this Huffman tree has: 6 6 7 24 1 3-bit code 8 27 1 4-bit code 9 33 3 5-bit codes 10 44 6 6-bit codes 11 97 12 141 etc. 13 249 14 529 This determines a unique Huffman coding 15 1520 scheme for the entire 4790 numbers. When 16 2053 each of this codes is weighted by its 17 62 frequency of occurance, the result is ------ typically a 1.6/1.0 compression over 4790 using the uncompressed 13-bit number. At the end of each line (record) in the page data encoding there is a single bit indicating how the next line is represented. 0 = next line is represented by cipher encoding. 1 = next line is a member of the "common" group. In this case, the member number is represented using the compression method above. Each subsequent data group represents a page of music. Each page data group begins with three header fields. Field 1: byte 1: length bytes 2..: name of file (for expansion purposes) Feild 2: 4 bytes: length of data in bits. NOTE: The version 1.00/1.01 Field 2: number of data records in file (4 bytes) has been eliminated as unnecessary VI. Description of cipher We start by assuming that the data consists solely of ASCII bytes between 32 and 126. In version 1.00/1.01: ---------------------- The first act of the MAKECFT program is to read all files as one long byte string, inserting a the byte 222 as "new line" and the byte 223 as "beginning of file." The bytes 128 to 221 are assigned to 94 of the most common byte combinations. 128 = "42 46 1 0 0" 160 = "-63 " 192 = "32→" 129 = "28 46 1 0 0" 161 = "-74 " 193 = "42→" 130 = "20 46 1 0 0" 162 = "576 " 194 = "43→" 131 = "1 82 6913 " 163 = "288 " 195 = "44→" 132 = "0 0 0 " 164 = "144 " 196 = "45→" 133 = "K 0 0 " 165 = "-1 " 197 = "48→" 134 = "1152 " 166 = "-2 " 198 = "49→" 135 = "1729 " 167 = "-5 " 199 = "53→" 136 = "2305 " 168 = "-7 " 200 = "54→" 137 = "3457 " 169 = "10 " 201 = "59→" 138 = "4609 " 170 = "14 " 202 = "60→" 139 = "5185 " 171 = "15 " 203 = "63→" 140 = "A D " 172 = "16 " 204 = "64→" 141 = "J N " 173 = "20 " 205 = "65→" 142 = "J R " 174 = "21 " 206 = "96→" 143 = "J B " 175 = "25 " 207 = "0→" 144 = "K 0 " 176 = "28 " 208 = "4→" 145 = "K 1 " 177 = "30 " 209 = "6→" 146 = "-10 " 178 = "31 " 210 = "8→" 147 = "-11 " 179 = "32 " 211 = "K " 148 = "-14 " 180 = "35 " 212 = "0 " 149 = "-15 " 181 = "42 " 213 = "1 " 150 = "-20 " 182 = "49 " 214 = "2 " 151 = "-21 " 183 = "52 " 215 = "3 " 152 = "-25 " 184 = "56 " 216 = "4 " 153 = "-28 " 185 = "63 " 217 = "5 " 154 = "-30 " 186 = "72 " 218 = "6 " 155 = "-32 " 187 = " * " 219 = "7 " 156 = "-35 " 188 = " - " 220 = "8 " 157 = "-42 " 189 = "12→" 221 = "9 " 158 = "-49 " 190 = "16→" 222 new line 159 = "-53 " 191 = "24→" 223 beginning of file A transition profile (i,j) is then constructed for the data. For each of the 192 bytes (32 to 223) we know the number of times that byte (i) is followed by byte (j). For each byte (i), the (j) profile values are sorted by size, largest first. For each (i) byte, these reordered values of "j" are the ciphers. For example, suppost i = 32 (blank). The profile of bytes which follows a blank might be as follows: 212 = "0 " 320 times represented by code 00 68 = "D" 168 times 01 50 = "2" 104 times 10 49 = "1" 68 times 11 000 132 = "0 0 0 " 44 times 11 001 213 = "1 " 18 times 11 010 67 = "C" 16 times 11 011 51 = "3" 9 times 11 100 52 = "4" 6 times 11 101 53 = "5" 4 times 11 110 54 = "6" 3 times 11 111 000 55 = "7" 2 times 11 111 001 56 = "8" 1 time 11 111 010 57 = "9" 1 time 11 111 011 76 = "L" 1 time 11 111 100 In this example, we say the method of representation is 3-7-7- because the first 3 ciphers (212, 68, and 50) are represented as a 2-bit code, with the fourth code (11) being an escape; the next 7 ciphers (49, 132, 213, 67, 51, 52, 53) are represented by the 7 5-bit codes, with the eighth 5-bit code (11 111) being an escape to the next set. In this example, we can calculate the "average" number of bits representing the transition from the blank (i = 32) character as follows: 320 x 2 = 640 bits 168 x 2 = 336 bits 104 x 2 = 208 bits 68 x 5 = 340 bits 44 x 5 = 220 bits 18 x 5 = 90 bits 16 x 5 = 80 bits 9 x 5 = 45 bits 6 x 5 = 30 bits 4 x 5 = 20 bits 3 x 8 = 24 bits 2 x 8 = 16 bits 1 x 8 = 8 bits 1 x 8 = 8 bits 1 x 8 = 8 bits ----- ------ 765 2073 bits Thus 765 transitions is represented by 2073 bits, or 260 bytes, representing a compression ratio of 2.95. In version 1.00/1.01, there are 8 possible methods of representation (1) 1-1-3-3-15-128 (2) 1-1-7-15-31-128 (3) 1-3-3-7-31-128 (4) 1-7-15-31-31-128 (5) 3-3-3-7-31-128 (6) 3-7-7-7-31-128 (7) 7-7-7-7-31-128 (8) 7-15-15-31-63-128 For each character (i = 32 .. 127, and 128 .. 223), the method of representation which gives the highest compression ratio is chosen. The cipher itself is stored as follows: For each character (i = 32 .. 127, and 128 .. 223), the transition list is stored -- most common to least common. The list is terminated by the 0x00 byte. If a particular character does not exist anywhere in the data (e.g., "q"), its transition list consists simply of the 0x00 byte. Following the 0x00 byte is the "method byte", a single byte 0x00 to 0x07 represent one of the eight possible methods of representation. De-compression is simple and fast. Suppose we have decoded a blank (i = 32) character. We look at the method for blanks and see that it is method (6), or 3-7-7-7-31-128. We then look at the next two bits in the .CFT compression. If they are 0x00, 0x01, or 0x02, we know the next character. It is "0 ", "D", or "2" repectively. Otherwise (0x03), we must read the next three bits in the .CFT compression, etc. In version 1.02 ----------------- The algorithm for bit-wise compression is essentially the same as version 1.00/1.01. The number of byte combinations is expanded and re-organized. 0 = New file 127 = "J D 0 " 170 = "-45 " 213 = "52 " 128 = "J M 0 " 171 = "-49 " 214 = "54 " 1 = " 4 22 11 11 " 129 = "T 0 1 " 172 = "-53 " 215 = "63 " 2 = "0 0 0 0 0 " 130 = "A P 1 " 173 = "-54 " 216 = "73 " 3 = "1 82 6913 " 131 = "A P 2 " 174 = "-63 " 217 = "112" 4 = "42 3 1 0 " 132 = "A P 3 " 175 = "-74 " 218 = "111" 5 = "36 3 1 0 " 133 = "144 1 " 176 = "865 " 219 = "110" 6 = "28 3 1 0 " 134 = "288 1 " 177 = "864 " 220 = "109" 7 = "24 3 1 0 " 135 = "432 1 " 178 = "576 " 221 = "108" 8 = "42 2 1 0 " 136 = "0 0 0 " 179 = "432 " 222 = "107" 9 = "36 2 1 0 " 137 = "J N 4 " 180 = "288 " 223 = "103" 10 = "28 2 1 0 " 138 = "6913 " 181 = "144 " 224 = "-9 " 11 = "24 2 1 0 " 139 = "6049 " 182 = "A P " 225 = "-7 " 12 = " 4 2 1 1 " 140 = "5761 " 183 = "A D " 226 = "-6 " 13 = " * 0 21 " 141 = "5185 " 184 = "J B " 227 = "B -" 14 = " * 0 18 " 142 = "4609 " 185 = "W 0 " 228 = "T -" 15 = " * 0 14 " 143 = "4321 " 186 = "K 0 " 229 = "B 1" 16 = " * 0 12 " 144 = "3457 " 187 = "10 " 230 = " * " 17 = " 3 2 21 " 145 = "2593 " 188 = "11 " 231 = " - " 18 = "@ LINE: " 146 = "2305 " 189 = "12 " 232 = " \!" 19 = "@ TEXT: " 147 = "2304 " 190 = "13 " 233 = " \@" 20 = "10000 0 " 148 = "2050 " 191 = "14 " 234 = " \#" 21 = " 3 2 1 " 149 = "1729 " 192 = "15 " 235 = "0 0" 22 = "J N 9 " 150 = "1728 " 193 = "16 " 236 = "\-" 23 = "J N 8 " 151 = "1153 " 194 = "17 " 237 = "T " 24 = "J N 7 " 152 = "1152 " 195 = "18 " 238 = "B " 25 = "J N 6 " 153 = "J B 1" 196 = "19 " 239 = "K " 26 = "J N 5 " 154 = "X 37 " 197 = "20 " 240 = "H " 27 = "J R 9 " 155 = "X 31 " 198 = "21 " 241 = "L " 28 = "J R 8 " 156 = " 2 2 " 199 = "22 " 242 = "S " 29 = "J R 7 " 157 = "-11 " 200 = "23 " 243 = "9 " 30 = "J R 6 " 158 = "-12 " 201 = "24 " 244 = "8 " 31 = "J R 5 " 159 = "-14 " 202 = "25 " 245 = "7 " 160 = "-18 " 203 = "27 " 246 = "6 " 32-126 = Lower ASCII 161 = "-21 " 204 = "28 " 247 = "5 " 162 = "-24 " 205 = "30 " 248 = "4 " 163 = "-27 " 206 = "31 " 249 = "3 " 164 = "-28 " 207 = "32 " 250 = "2 " 165 = "-30 " 208 = "35 " 251 = "1 " 166 = "-31 " 209 = "36 " 252 = "0 " 167 = "-35 " 210 = "42 " 253 = "nnn " (+10 bits) 168 = "-36 " 211 = "45 " 254 = "nnnn " (+10 bits) 169 = "-42 " 212 = "49 " 255 = New Line The number of representation methods has also been expanded. Version 1.02 recognizes the following 33 representation methods: 1) 1---1---1---3--15--31-256 2) 1---1---1---7---3--15-256 3) 1---1---3---3--15-127-128 4) 1---1---7---3---7--31-256 5) 1---3---3---7--31-127-128 6) 1---3---7--15--15--31-256 7) 1---3---7---1--15--31-256 8) 1---3--15--15--31--63-128 9) 1---7---1---1--15--31-256 10) 1---7---1---3--15--63-256 11) 1---7---3---1---7--31-256 12) 1---7---3--15--31--63-256 13) 1---7---7---7--31--63-256 14) 1---7--15--15--31--63-128 15) 1---7--15--31--31--63-128 16) 1--15--31--63--63--63--64 17) 3---1---1---3---7--31-256 18) 3---1---3---7--15--31-256 19) 3---3---3---7--15--31-256 20) 3---3---7--15--15--31-256 21) 3---3--15--31--31--63-128 22) 3---7---3---7--15--31-256 23) 3---7---3--15--31--63-256 24) 3--15--31--31--31--63-128 25) 7---3---3---7--31--63-256 26) 7---3---7---7--15--31-256 27) 7---7--15--31--31--63-128 28) 7--31--31--31--31--63-128 29) 15---3---3--15--63--63-128 30) 31---7---7--15--63--63-128 31) 63--15--31--31--63--63--64 32) 127--31--31--31--31---7--64 33) 256---1---1---1---1---1--64 In addition, there are 10 more "Huffman" style representations as follows: Method ->41 42 43 44 45 46 47 48 49 50 ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- 1 1 0 0 0 0 0 0 0 0 0 2 0 2 2 1 1 0 0 0 0 0 3 1 1 0 2 0 5 5 4 1 1 4 2 2 5 4 6 5 2 7 5 0 5 3 4 3 4 6 1 4 0 8 14 6 5 4 3 4 6 2 4 1 10 14 7 4 4 3 4 6 2 4 2 10 14 8 4 4 3 4 6 2 4 2 10 14 9 4 4 3 4 6 5 4 2 10 14 10 4 4 3 4 6 0 4 5 10 14 11 4 4 3 4 6 0 4 0 10 0 12 4 4 1 4 6 0 4 0 10 0 13 4 4 1 4 6 0 4 0 9 0 14 216 4 3 0 0 234 4 233 1 171 211x20 223x18 213x19 195x18 209x20 162x17 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 256 256 256 256 256 256 256 256 256 256 In addition, the files are analyzed for an optimum set of "common" records, and these are represented in their own data set following directly after the cipher. In the .CFT compression, each new record begins with one bit (0 = normal compression, 1 = common record). The number of "common" records is determined at compression time and is represented in the "common" data group.