Format for the CFT file
=======================
First, a note about version numbers. The original .CFT compression
program was written in zbex. It produced .CFT files with a version
number 1.00. When .CFT compression was added to the DMUSE program,
the .CFT files it produced were given a version number of 1.01 to
set them apart from the files produced by zbex (although the files
were, in fact, identical).
Making a change 10/18/03
We would like to allow more than 255 pages in a Case 2 compression.
It turns out that there is an empty byte in Field 6, which is just
before Field 7, which contains the number of pages. We will commandeer
Field 6 and use it to store the "high end" portion of the number of
pages, thus increasing the maximum number of pages to over 32K.
Making a change 04/12/06
Adding records to transmit source information and to allow melodic
searches has expanded the size of page files by about 21 percent.
Especially the addition of "A P " pitch records has changed the
dynamics of compression to the point where the compression process
itself was in need of a revision. A new compression algorithm has
been designed, which is about 50 percent more efficient than the
old algorithm.
A zbex program has been written to implement the new compression
algorithm. The .CFT files it produces have been given the version
number 1.02. The zbex program to expand .CFT files into .MPG
files has been modified to work with both the version 1.00/1.01
files and the 1.02 files.
In contrast to the old compression algorithm, which made use of
a two compression techniques, the new algorithm uses four compression
techniques. It is at least twice as complicated and takes about ten
times longer to run. For this reason, we do not anticipate that it
will be incorporated into the DMUSE environment. The expansion
program in the DMUSE environment will be modified, however, to
read version 1.02 files. Version 1.01 files will still be buildable
in the DMUSE environment and will continue to be part of the standard.
I. Title and Offset information
Field 1: (8 bytes) Source of file e.g. CCARH
Field 2: (4 bytes) Version number (compression technique, etc)
version 1.00 (0x100) uses the original (pre 2006) list of extended character codes
version 1.01 (0x101) same as version 1.00, but generated by DMUSE
version 1.02 (0x102) -> modified (04/12/06) extended character codes
plus a group of "common" records
Field 3: (4 bytes) Checksum number (for file varification)
Field 4: (4 bytes) Offset to File Organization section
Field 5: (4 bytes) Offset to Data Group offsets
II. Header information (offset = 24 bytes)
Field 1: byte 1: length (up to 100)
bytes 2..: composer
Field 2: byte 1: length (up to 100)
bytes 2..: work
Field 3: byte 1: length (up to 50)
bytes 2..: type of representation
e.g Full Score
Complete Parts
Violin I
Field 4: byte 1: length (up to 50)
bytes 2..: resolution
e.g. 300 dots/inch
III. File Organization (offset contained in bytes 17..20 of file)
A CFT file may be one of three types.
1. It may be a compression of a single file (generally not used)
2. It may be a compression of a set of MPG page files from a
single directory.
e.g. The full score (pages) of a work with only 1 movement
3. It may be a compression of <n> directories, each containing
a set of MPG files.
e.g. The full score (pages) of a work with <n> movements.
The set of parts (pages) for <n> instruments.
The part for one instrument, divided into <n> movements.
Field 1: (one byte) 0 = case 1
1 = case 2
2 = case 3
Case 1: (version 1.00/1.01 only -- no Case 1 can be constructed in version 1.02)
Field 2: (one byte) 0x00
Field 3: (one byte) 0x00
Field 4: byte 1: length (up to 100)
bytes 2..: sub-divsion name (optional)
e.g. Movement 1
e.g. Violine I
Field 5: byte 1: length (up to 40)
bytes 2..: note size
e.g. 14 dots per staff line
21 dots per staff line
14 and 21 mixed
Field 6: (one byte) 0x00
Field 7: (one byte) <page number of this MPG file>
Field 8: (two bytes) 0x0002
Case 2:
Field 2: byte 1: length (up to 13)
bytes 2..: Name of default path (directory) for expansion
Field 3: (one byte) 0x00 = Field 4 contains one sub-division name
0x01 = Field 4 contains 2-byte contents-flag
plus multiple sub-division names
0x02 = Field 4 contains 2-byte contents-flag
plus one sub-division name
Field 4: format-00
byte 1: length (up to 100)
bytes 2..: sub-divsion name (optional)
e.g. Movement 1
e.g. Violine I
Field 4: format-01
bytes 1-2: number of sub-division names (= number of pages)
bytes 3-4: page number for "CONTENTS"
(zero = no "CONTENTS" page)
bytes 5-6: number of bytes of sub-division names
bytes 7--: byte 1: length (up to 255)
bytes 2--: name (" " = no name)
Field 4: format-02
bytes 1-2: page number for "CONTENTS" (non-zero)
byte 3: length (up to 100)
bytes 4..: sub-divsion name (optional)
Field 5: byte 1: length (up to 40)
bytes 2..: note size
e.g. 14 dots per staff line
21 dots per staff line
14 and 21 mixed
Field 6: (one byte) <p> pages (high order byte)
Field 7: (one byte) <p> pages
Field 8: (two bytes)
version 1.00/1.01: 0x0002
version 1.02: 0x0003
Case 3:
Field 2: byte 1: length (up to 13)
bytes 2..: Name of default path (directory) for expansion
Field 3: (one byte) <n> (1 to 255) number of sub-directories
Fields 4..8: are repeated <n> times
Field 4: byte 1: length (up to 100)
bytes 2..: sub-divsion name
e.g. Movement 1
e.g. Violine I
Field 5: byte 1: length (up to 40)
bytes 2..: note size
e.g. 14 dots per staff line
21 dots per staff line
14 and 21 mixed
Field 6: byte 1: length (up to 13)
bytes 2..: Name of default path (sub-directory)
for expansion
Field 7: (one byte) <p> pages
Field 8: (two bytes) data group number of first page of this
sub-directory
IV. Data group offsets and lengths (offset contained in bytes 21..24 of file)
Each offset/length pair is stored in two 4-byte groups. Offsets are
measured from the beginning of the file.
Case 1: There are only 2 data groups: i.e. the compression key
and the page data
Case 2: version 1.00/1.01: There are <p> + 1 data groups:
(1) the compression key
(2) data for <p> pages
version 1.02: There are <p> + 2 data groups:
(1) the compression key
(2) data for the "common" records
(3) data for <p> pages
Case 3: version 1.00/1.01:
__
There are \ <n>
SUM | <p(i)> + 1 data groups:
/__
i=1
i.e the compression key and the sum of all the pages
from all of the sub-directories
version 1.02:
__
There are \ <n>
SUM | <p(i)> + 2 data groups:
/__
i=1
(1) the compression key
(2) data for the "common" records
(3) data for the sum of all the pages
from all of the sub-directories
V. Data groups
The first data group is always the cipher.
version 1.00/1.01:
Each subsequent data group represents a page of music.
Each page data group begins with three header fields.
Field 1: byte 1: length
bytes 2..: name of file (for expansion purposes)
Field 2: 4 bytes: number of data records in file
NOTE: This field is eliminated as unnessary in version 1.02
Feild 3: 4 bytes: length of data in bits.
version 1.02:
For Cases 2 & 3, the second data group is always the group
of "common" records
Field 1a: One byte: n = number of "Huffman" numbers.
(0 = no common records)
Field 1b: <n> 16 bit "Huffman" numbers
Field 2: 2 bytes: number of "common" data records
Feild 3: 4 bytes: length of data in bits.
The "common" records have their own system of coding. All
records start with a 2-bit code:
00 = code complete line using standard compression
01 = skip through the n-th blank (n = 1 to 8) and use
standard compression for remainder of line.
(lastchar = blank)
10 = use "xxx" coding. skip 2 + n (n = 0 to 7) x's
11 = use "xxx" coding. assume the same total number
of leading x's as in the previous record.
Codes 01 and 10 must be followed immediately by a 3-bit
"skip" code.
In addition, codes 10 and 11 require an additional 3-bit
code indicating the length (m = 1 to 8) of the encoding.
Lengths longer than 8 are not allowed.
The "xxx" coding is possible when the current record has
the same length as the preceding record in the list. The
rule is simple; comparison is byte-wise. When there is no
change, the code is "x"; when there is a change and the
new byte is a number, the code is that number; when there
is a change and the new byte is not a number, the "xxx"
coding fails.
The overall algorithm is as follows:
(1) Try to derive an "xxx" line for the record
a) if this record starts with 9 or less x's, you are done.
b) if more than 9 x's, determine the length over which the
two records match. If this sub-string contains 5 or
more blanks, then use code2 (01) to represent the record.
c) If b) fails, stick with the "xxx" line.
(2) If "xxx" representation fails, determine the length over
which the two records match. If this length is 7 or
more, and if the matching sub-string does not equal
"@ LINE: " or "@ TEXT: " exactly, use code2 (01) to
represent the string.
(3) If 2) fails, use code (00) to represent the string
(4) If the previous representation was an "xxx" line, and if
this line starts with the same number of x's, then the
2-bit "11" code for "xxx" can be used. Otherwise, the
5-bit "10bbb" code must be used.
The "common" records are indexed by a number (1 to n < 5000).
This number, which uncompressed would require 12 or 13 bits,
can be compressed using a Huffman tree.
The specification of this tree can be 1 0
communicated using <n> 16 bit numbers. 2 0
The best way to describe this is with 3 1
an example. At the right is a list of 4 1
17 numbers. What this tells us is that 5 3
this Huffman tree has: 6 6
7 24
1 3-bit code 8 27
1 4-bit code 9 33
3 5-bit codes 10 44
6 6-bit codes 11 97
12 141
etc. 13 249
14 529
This determines a unique Huffman coding 15 1520
scheme for the entire 4790 numbers. When 16 2053
each of this codes is weighted by its 17 62
frequency of occurance, the result is ------
typically a 1.6/1.0 compression over 4790
using the uncompressed 13-bit number.
At the end of each line (record) in the page data encoding
there is a single bit indicating how the next line is
represented.
0 = next line is represented by cipher encoding.
1 = next line is a member of the "common" group.
In this case, the member number is represented
using the compression method above.
Each subsequent data group represents a page of music.
Each page data group begins with three header fields.
Field 1: byte 1: length
bytes 2..: name of file (for expansion purposes)
Feild 2: 4 bytes: length of data in bits.
NOTE: The version 1.00/1.01 Field 2: number of data records
in file (4 bytes) has been eliminated as unnecessary
VI. Description of cipher
We start by assuming that the data consists solely of ASCII bytes
between 32 and 126.
In version 1.00/1.01:
----------------------
The first act of the MAKECFT program is to read all files as one
long byte string, inserting a the byte 222 as "new line" and the
byte 223 as "beginning of file." The bytes 128 to 221 are assigned
to 94 of the most common byte combinations.
128 = "42 46 1 0 0" 160 = "-63 " 192 = "32→"
129 = "28 46 1 0 0" 161 = "-74 " 193 = "42→"
130 = "20 46 1 0 0" 162 = "576 " 194 = "43→"
131 = "1 82 6913 " 163 = "288 " 195 = "44→"
132 = "0 0 0 " 164 = "144 " 196 = "45→"
133 = "K 0 0 " 165 = "-1 " 197 = "48→"
134 = "1152 " 166 = "-2 " 198 = "49→"
135 = "1729 " 167 = "-5 " 199 = "53→"
136 = "2305 " 168 = "-7 " 200 = "54→"
137 = "3457 " 169 = "10 " 201 = "59→"
138 = "4609 " 170 = "14 " 202 = "60→"
139 = "5185 " 171 = "15 " 203 = "63→"
140 = "A D " 172 = "16 " 204 = "64→"
141 = "J N " 173 = "20 " 205 = "65→"
142 = "J R " 174 = "21 " 206 = "96→"
143 = "J B " 175 = "25 " 207 = "0→"
144 = "K 0 " 176 = "28 " 208 = "4→"
145 = "K 1 " 177 = "30 " 209 = "6→"
146 = "-10 " 178 = "31 " 210 = "8→"
147 = "-11 " 179 = "32 " 211 = "K "
148 = "-14 " 180 = "35 " 212 = "0 "
149 = "-15 " 181 = "42 " 213 = "1 "
150 = "-20 " 182 = "49 " 214 = "2 "
151 = "-21 " 183 = "52 " 215 = "3 "
152 = "-25 " 184 = "56 " 216 = "4 "
153 = "-28 " 185 = "63 " 217 = "5 "
154 = "-30 " 186 = "72 " 218 = "6 "
155 = "-32 " 187 = " * " 219 = "7 "
156 = "-35 " 188 = " - " 220 = "8 "
157 = "-42 " 189 = "12→" 221 = "9 "
158 = "-49 " 190 = "16→" 222 new line
159 = "-53 " 191 = "24→" 223 beginning of file
A transition profile (i,j) is then constructed for the data. For
each of the 192 bytes (32 to 223) we know the number of times that
byte (i) is followed by byte (j).
For each byte (i), the (j) profile values are sorted by size, largest
first. For each (i) byte, these reordered values of "j" are the
ciphers. For example, suppost i = 32 (blank). The profile of bytes
which follows a blank might be as follows:
212 = "0 " 320 times represented by code 00
68 = "D" 168 times 01
50 = "2" 104 times 10
49 = "1" 68 times 11 000
132 = "0 0 0 " 44 times 11 001
213 = "1 " 18 times 11 010
67 = "C" 16 times 11 011
51 = "3" 9 times 11 100
52 = "4" 6 times 11 101
53 = "5" 4 times 11 110
54 = "6" 3 times 11 111 000
55 = "7" 2 times 11 111 001
56 = "8" 1 time 11 111 010
57 = "9" 1 time 11 111 011
76 = "L" 1 time 11 111 100
In this example, we say the method of representation is 3-7-7-
because the first 3 ciphers (212, 68, and 50) are represented as
a 2-bit code, with the fourth code (11) being an escape; the
next 7 ciphers (49, 132, 213, 67, 51, 52, 53) are represented
by the 7 5-bit codes, with the eighth 5-bit code (11 111) being
an escape to the next set. In this example, we can calculate
the "average" number of bits representing the transition from
the blank (i = 32) character as follows:
320 x 2 = 640 bits
168 x 2 = 336 bits
104 x 2 = 208 bits
68 x 5 = 340 bits
44 x 5 = 220 bits
18 x 5 = 90 bits
16 x 5 = 80 bits
9 x 5 = 45 bits
6 x 5 = 30 bits
4 x 5 = 20 bits
3 x 8 = 24 bits
2 x 8 = 16 bits
1 x 8 = 8 bits
1 x 8 = 8 bits
1 x 8 = 8 bits
----- ------
765 2073 bits
Thus 765 transitions is represented by 2073 bits, or 260 bytes,
representing a compression ratio of 2.95.
In version 1.00/1.01, there are 8 possible methods of representation
(1) 1-1-3-3-15-128
(2) 1-1-7-15-31-128
(3) 1-3-3-7-31-128
(4) 1-7-15-31-31-128
(5) 3-3-3-7-31-128
(6) 3-7-7-7-31-128
(7) 7-7-7-7-31-128
(8) 7-15-15-31-63-128
For each character (i = 32 .. 127, and 128 .. 223), the method
of representation which gives the highest compression ratio is
chosen.
The cipher itself is stored as follows:
For each character (i = 32 .. 127, and 128 .. 223), the transition
list is stored -- most common to least common. The list is terminated
by the 0x00 byte. If a particular character does not exist anywhere
in the data (e.g., "q"), its transition list consists simply of the
0x00 byte. Following the 0x00 byte is the "method byte", a single
byte 0x00 to 0x07 represent one of the eight possible methods of
representation.
De-compression is simple and fast. Suppose we have decoded a
blank (i = 32) character. We look at the method for blanks and
see that it is method (6), or 3-7-7-7-31-128. We then look at the
next two bits in the .CFT compression. If they are 0x00, 0x01,
or 0x02, we know the next character. It is "0 ", "D", or "2"
repectively. Otherwise (0x03), we must read the next three bits
in the .CFT compression, etc.
In version 1.02
-----------------
The algorithm for bit-wise compression is essentially the same as
version 1.00/1.01. The number of byte combinations is expanded and
re-organized.
0 = New file 127 = "J D 0 " 170 = "-45 " 213 = "52 "
128 = "J M 0 " 171 = "-49 " 214 = "54 "
1 = " 4 22 11 11 " 129 = "T 0 1 " 172 = "-53 " 215 = "63 "
2 = "0 0 0 0 0 " 130 = "A P 1 " 173 = "-54 " 216 = "73 "
3 = "1 82 6913 " 131 = "A P 2 " 174 = "-63 " 217 = "112"
4 = "42 3 1 0 " 132 = "A P 3 " 175 = "-74 " 218 = "111"
5 = "36 3 1 0 " 133 = "144 1 " 176 = "865 " 219 = "110"
6 = "28 3 1 0 " 134 = "288 1 " 177 = "864 " 220 = "109"
7 = "24 3 1 0 " 135 = "432 1 " 178 = "576 " 221 = "108"
8 = "42 2 1 0 " 136 = "0 0 0 " 179 = "432 " 222 = "107"
9 = "36 2 1 0 " 137 = "J N 4 " 180 = "288 " 223 = "103"
10 = "28 2 1 0 " 138 = "6913 " 181 = "144 " 224 = "-9 "
11 = "24 2 1 0 " 139 = "6049 " 182 = "A P " 225 = "-7 "
12 = " 4 2 1 1 " 140 = "5761 " 183 = "A D " 226 = "-6 "
13 = " * 0 21 " 141 = "5185 " 184 = "J B " 227 = "B -"
14 = " * 0 18 " 142 = "4609 " 185 = "W 0 " 228 = "T -"
15 = " * 0 14 " 143 = "4321 " 186 = "K 0 " 229 = "B 1"
16 = " * 0 12 " 144 = "3457 " 187 = "10 " 230 = " * "
17 = " 3 2 21 " 145 = "2593 " 188 = "11 " 231 = " - "
18 = "@ LINE: " 146 = "2305 " 189 = "12 " 232 = " \!"
19 = "@ TEXT: " 147 = "2304 " 190 = "13 " 233 = " \@"
20 = "10000 0 " 148 = "2050 " 191 = "14 " 234 = " \#"
21 = " 3 2 1 " 149 = "1729 " 192 = "15 " 235 = "0 0"
22 = "J N 9 " 150 = "1728 " 193 = "16 " 236 = "\-"
23 = "J N 8 " 151 = "1153 " 194 = "17 " 237 = "T "
24 = "J N 7 " 152 = "1152 " 195 = "18 " 238 = "B "
25 = "J N 6 " 153 = "J B 1" 196 = "19 " 239 = "K "
26 = "J N 5 " 154 = "X 37 " 197 = "20 " 240 = "H "
27 = "J R 9 " 155 = "X 31 " 198 = "21 " 241 = "L "
28 = "J R 8 " 156 = " 2 2 " 199 = "22 " 242 = "S "
29 = "J R 7 " 157 = "-11 " 200 = "23 " 243 = "9 "
30 = "J R 6 " 158 = "-12 " 201 = "24 " 244 = "8 "
31 = "J R 5 " 159 = "-14 " 202 = "25 " 245 = "7 "
160 = "-18 " 203 = "27 " 246 = "6 "
32-126 = Lower ASCII 161 = "-21 " 204 = "28 " 247 = "5 "
162 = "-24 " 205 = "30 " 248 = "4 "
163 = "-27 " 206 = "31 " 249 = "3 "
164 = "-28 " 207 = "32 " 250 = "2 "
165 = "-30 " 208 = "35 " 251 = "1 "
166 = "-31 " 209 = "36 " 252 = "0 "
167 = "-35 " 210 = "42 " 253 = "nnn " (+10 bits)
168 = "-36 " 211 = "45 " 254 = "nnnn " (+10 bits)
169 = "-42 " 212 = "49 " 255 = New Line
The number of representation methods has also been expanded. Version 1.02
recognizes the following 33 representation methods:
1) 1---1---1---3--15--31-256
2) 1---1---1---7---3--15-256
3) 1---1---3---3--15-127-128
4) 1---1---7---3---7--31-256
5) 1---3---3---7--31-127-128
6) 1---3---7--15--15--31-256
7) 1---3---7---1--15--31-256
8) 1---3--15--15--31--63-128
9) 1---7---1---1--15--31-256
10) 1---7---1---3--15--63-256
11) 1---7---3---1---7--31-256
12) 1---7---3--15--31--63-256
13) 1---7---7---7--31--63-256
14) 1---7--15--15--31--63-128
15) 1---7--15--31--31--63-128
16) 1--15--31--63--63--63--64
17) 3---1---1---3---7--31-256
18) 3---1---3---7--15--31-256
19) 3---3---3---7--15--31-256
20) 3---3---7--15--15--31-256
21) 3---3--15--31--31--63-128
22) 3---7---3---7--15--31-256
23) 3---7---3--15--31--63-256
24) 3--15--31--31--31--63-128
25) 7---3---3---7--31--63-256
26) 7---3---7---7--15--31-256
27) 7---7--15--31--31--63-128
28) 7--31--31--31--31--63-128
29) 15---3---3--15--63--63-128
30) 31---7---7--15--63--63-128
31) 63--15--31--31--63--63--64
32) 127--31--31--31--31---7--64
33) 256---1---1---1---1---1--64
In addition, there are 10 more "Huffman" style representations as
follows:
Method ->41 42 43 44 45 46 47 48 49 50
---- ---- ---- ---- ---- ---- ---- ---- ---- ----
1 1 0 0 0 0 0 0 0 0 0
2 0 2 2 1 1 0 0 0 0 0
3 1 1 0 2 0 5 5 4 1 1
4 2 2 5 4 6 5 2 7 5 0
5 3 4 3 4 6 1 4 0 8 14
6 5 4 3 4 6 2 4 1 10 14
7 4 4 3 4 6 2 4 2 10 14
8 4 4 3 4 6 2 4 2 10 14
9 4 4 3 4 6 5 4 2 10 14
10 4 4 3 4 6 0 4 5 10 14
11 4 4 3 4 6 0 4 0 10 0
12 4 4 1 4 6 0 4 0 10 0
13 4 4 1 4 6 0 4 0 9 0
14 216 4 3 0 0 234 4 233 1 171
211x20 223x18 213x19 195x18 209x20 162x17
----- ----- ----- ----- ----- ----- ----- ----- ----- -----
256 256 256 256 256 256 256 256 256 256
In addition, the files are analyzed for an optimum set of "common"
records, and these are represented in their own data set following
directly after the cipher. In the .CFT compression, each new
record begins with one bit (0 = normal compression, 1 = common record).
The number of "common" records is determined at compression time and
is represented in the "common" data group.