Format for the CFT file
                          =======================


    First, a note about version numbers.  The original .CFT compression
    program was written in zbex.  It produced .CFT files with a version
    number 1.00.  When .CFT compression was added to the DMUSE program,
    the .CFT files it produced were given a version number of 1.01 to
    set them apart from the files produced by zbex (although the files
    were, in fact, identical).

    Making a change 10/18/03

    We would like to allow more than 255 pages in a Case 2 compression.
    It turns out that there is an empty byte in Field 6, which is just
    before Field 7, which contains the number of pages.  We will commandeer
    Field 6 and use it to store the "high end" portion of the number of
    pages, thus increasing the maximum number of pages to over 32K.

    Making a change 04/12/06

    Adding records to transmit source information and to allow melodic
    searches has expanded the size of page files by about 21 percent.
    Especially the addition of "A P " pitch records has changed the
    dynamics of compression to the point where the compression process
    itself was in need of a revision.  A new compression algorithm has
    been designed, which is about 50 percent more efficient than the
    old algorithm.

    A zbex program has been written to implement the new compression
    algorithm.  The .CFT files it produces have been given the version
    number 1.02.  The zbex program to expand .CFT files into .MPG
    files has been modified to work with both the version 1.00/1.01
    files and the 1.02 files.

    In contrast to the old compression algorithm, which made use of
    a two compression techniques, the new algorithm uses four compression
    techniques.  It is at least twice as complicated and takes about ten
    times longer to run.  For this reason, we do not anticipate that it
    will be incorporated into the DMUSE environment.  The expansion
    program in the DMUSE environment will be modified, however, to
    read version 1.02 files.  Version 1.01 files will still be buildable
    in the DMUSE environment and will continue to be part of the standard.



I. Title and Offset information

    Field 1: (8 bytes)  Source of file    e.g. CCARH

    Field 2: (4 bytes)  Version number (compression technique, etc)

       version 1.00 (0x100)  uses the original (pre 2006) list of extended character codes
       version 1.01 (0x101)  same as version 1.00, but generated by DMUSE
       version 1.02 (0x102)  -> modified (04/12/06) extended character codes
                                  plus a group of "common" records

    Field 3: (4 bytes)  Checksum number (for file varification)

    Field 4: (4 bytes)  Offset to File Organization section

    Field 5: (4 bytes)  Offset to Data Group offsets


II. Header information (offset = 24 bytes)

    Field 1:  byte 1:    length (up to 100)
              bytes 2..: composer

    Field 2:  byte 1:    length (up to 100)
              bytes 2..: work

    Field 3:  byte 1:    length (up to 50)
              bytes 2..: type of representation

                           e.g Full Score
                               Complete Parts
                               Violin I

    Field 4:  byte 1:    length (up to 50)
              bytes 2..: resolution

                           e.g. 300 dots/inch

III. File Organization (offset contained in bytes 17..20 of file)

    A CFT file may be one of three types.

       1. It may be a compression of a single file (generally not used)

       2. It may be a compression of a set of MPG page files from a
            single directory.

               e.g. The full score (pages) of a work with only 1 movement

       3. It may be a compression of <n> directories, each containing
            a set of MPG files.

               e.g. The full score (pages) of a work with <n> movements.
                    The set of parts (pages) for <n> instruments.
                    The part for one instrument, divided into <n> movements.

       Field 1: (one byte)  0 = case 1
                            1 = case 2
                            2 = case 3

    Case 1: (version 1.00/1.01 only -- no Case 1 can be constructed in version 1.02)

       Field 2:  (one byte)  0x00

       Field 3:  (one byte)  0x00

       Field 4:  byte 1:    length (up to 100)
                 bytes 2..: sub-divsion name (optional)

                              e.g. Movement 1
                              e.g. Violine I

       Field 5:  byte 1:    length (up to 40)
                 bytes 2..: note size

                              e.g. 14 dots per staff line
                                   21 dots per staff line
                                   14 and 21 mixed

       Field 6:  (one byte)  0x00

       Field 7:  (one byte)  <page number of this MPG file>

       Field 8:  (two bytes) 0x0002


    Case 2:

       Field 2:  byte 1:    length (up to 13)
                 bytes 2..: Name of default path (directory) for expansion

       Field 3:  (one byte)  0x00 = Field 4 contains one sub-division name
                             0x01 = Field 4 contains 2-byte contents-flag
                                      plus multiple sub-division names
                             0x02 = Field 4 contains 2-byte contents-flag
                                      plus one sub-division name

       Field 4: format-00

                  byte 1:    length (up to 100)
                  bytes 2..: sub-divsion name (optional)

                               e.g. Movement 1
                               e.g. Violine I

       Field 4: format-01

                  bytes 1-2: number of sub-division names (= number of pages)
                  bytes 3-4: page number for "CONTENTS"
                               (zero = no "CONTENTS" page)
                  bytes 5-6: number of bytes of sub-division names
                  bytes 7--: byte 1: length (up to 255)
                             bytes 2--: name (" " = no name)

       Field 4: format-02

                  bytes 1-2: page number for "CONTENTS" (non-zero)
                  byte 3:    length (up to 100)
                  bytes 4..: sub-divsion name (optional)



       Field 5:  byte 1:    length (up to 40)
                 bytes 2..: note size

                              e.g. 14 dots per staff line
                                   21 dots per staff line
                                   14 and 21 mixed

       Field 6:  (one byte)  <p> pages (high order byte)

       Field 7:  (one byte)  <p> pages

       Field 8:  (two bytes)

           version 1.00/1.01: 0x0002
           version 1.02:      0x0003

    Case 3:

       Field 2:  byte 1:    length (up to 13)
                 bytes 2..: Name of default path (directory) for expansion

       Field 3:  (one byte)  <n> (1 to 255) number of sub-directories

       Fields 4..8:  are repeated <n> times

          Field 4:  byte 1:    length (up to 100)
                    bytes 2..: sub-divsion name

                                 e.g. Movement 1
                                 e.g. Violine I

          Field 5:  byte 1:    length (up to 40)
                    bytes 2..: note size

                                 e.g. 14 dots per staff line
                                      21 dots per staff line
                                      14 and 21 mixed

          Field 6:  byte 1:    length (up to 13)
                    bytes 2..: Name of default path (sub-directory)
                                 for expansion

          Field 7:  (one byte)  <p> pages

          Field 8:  (two bytes) data group number of first page of this
                       sub-directory


IV.  Data group offsets and lengths (offset contained in bytes 21..24 of file)

     Each offset/length pair is stored in two 4-byte groups.  Offsets are
        measured from the beginning of the file.


     Case 1:  There are only 2 data groups: i.e. the compression key
                and the page data

     Case 2:  version 1.00/1.01: There are <p> + 1 data groups:
                                   (1) the compression key
                                   (2) data for <p> pages

              version 1.02:      There are <p> + 2 data groups:
                                   (1) the compression key
                                   (2) data for the "common" records
                                   (3) data for <p> pages

     Case 3:  version 1.00/1.01:
                              __
                There are     \ <n>
                          SUM  |   <p(i)>  +  1    data groups:
                              /__
                               i=1

                i.e the compression key and the sum of all the pages
                      from all of the sub-directories

              version 1.02:
                              __
                There are     \ <n>
                          SUM  |   <p(i)>  +  2    data groups:
                              /__
                               i=1

                   (1) the compression key
                   (2) data for the "common" records
                   (3) data for the sum of all the pages
                        from all of the sub-directories


V.  Data groups

     The first data group is always the cipher.

     version 1.00/1.01:

        Each subsequent data group represents a page of music.

        Each page data group begins with three header fields.

           Field 1:  byte 1:    length
                     bytes 2..: name of file (for expansion purposes)

           Field 2:  4 bytes:   number of data records in file

              NOTE: This field is eliminated as unnessary in version 1.02


           Feild 3:  4 bytes:   length of data in bits.

     version 1.02:

        For Cases 2 & 3, the second data group is always the group
           of "common" records

           Field 1a: One byte:  n = number of "Huffman" numbers.
                                (0 = no common records)

           Field 1b: <n> 16 bit "Huffman" numbers

           Field 2:  2 bytes:   number of "common" data records

           Feild 3:  4 bytes:   length of data in bits.

           The "common" records have their own system of coding.  All
           records start with a 2-bit code:

             00 = code complete line using standard compression
             01 = skip through the n-th blank (n = 1 to 8) and use
                    standard compression for remainder of line.
                    (lastchar = blank)
             10 = use "xxx" coding.  skip 2 + n (n = 0 to 7) x's
             11 = use "xxx" coding.  assume the same total number
                    of leading x's as in the previous record.

           Codes 01 and 10 must be followed immediately by a 3-bit
           "skip" code.

           In addition, codes 10 and 11 require an additional 3-bit
           code indicating the length (m = 1 to 8) of the encoding.
           Lengths longer than 8 are not allowed.

           The "xxx" coding is possible when the current record has
           the same length as the preceding record in the list.  The
           rule is simple; comparison is byte-wise.  When there is no
           change, the code is "x"; when there is a change and the
           new byte is a number, the code is that number; when there
           is a change and the new byte is not a number, the "xxx"
           coding fails.

           The overall algorithm is as follows:

           (1) Try to derive an "xxx" line for the record
              a) if this record starts with 9 or less x's, you are done.
              b) if more than 9 x's, determine the length over which the
                   two records match.  If this sub-string contains 5 or
                   more blanks, then use code2 (01) to represent the record.
              c) If b) fails, stick with the "xxx" line.

           (2) If "xxx" representation fails, determine the length over
                 which the two records match.  If this length is 7 or
                 more, and if the matching sub-string does not equal
                 "@ LINE: " or "@ TEXT: " exactly, use code2 (01) to
                 represent the string.

           (3) If 2) fails, use code (00) to represent the string

           (4) If the previous representation was an "xxx" line, and if
                 this line starts with the same number of x's, then the
                 2-bit "11" code for "xxx" can be used.  Otherwise, the
                 5-bit "10bbb" code must be used.

           The "common" records are indexed by a number (1 to n < 5000).
           This number, which uncompressed would require 12 or 13 bits,
           can be compressed using a Huffman tree.
           The specification of this tree can be         1      0
           communicated using <n> 16 bit numbers.        2      0
           The best way to describe this is with         3      1
           an example.  At the right is a list of        4      1
           17 numbers.  What this tells us is that       5      3
           this Huffman tree has:                        6      6
                                                         7     24
               1 3-bit code                              8     27
               1 4-bit code                              9     33
               3 5-bit codes                            10     44
               6 6-bit codes                            11     97
                                                        12    141
               etc.                                     13    249
                                                        14    529
           This determines a unique Huffman coding      15   1520
           scheme for the entire 4790 numbers.  When    16   2053
           each of this codes is weighted by its        17     62
           frequency of occurance, the result is            ------
           typically a 1.6/1.0 compression over              4790
           using the uncompressed 13-bit number.

           At the end of each line (record) in the page data encoding
           there is a single bit indicating how the next line is
           represented.

               0 = next line is represented by cipher encoding.
               1 = next line is a member of the "common" group.
                   In this case, the member number is represented
                   using the compression method above.

        Each subsequent data group represents a page of music.

        Each page data group begins with three header fields.

           Field 1:  byte 1:    length
                     bytes 2..: name of file (for expansion purposes)

           Feild 2:  4 bytes:   length of data in bits.

           NOTE: The version 1.00/1.01 Field 2: number of data records
                 in file (4 bytes) has been eliminated as unnecessary


VI.  Description of cipher

     We start by assuming that the data consists solely of ASCII bytes
     between 32 and 126.

   In version 1.00/1.01:
   ----------------------

     The first act of the MAKECFT program is to read all files as one
     long byte string, inserting a the byte 222 as "new line" and the
     byte 223 as "beginning of file."  The bytes 128 to 221 are assigned
     to 94 of the most common byte combinations.

       128 = "42 46 1 0 0"    160 = "-63 "      192 = "32→"
       129 = "28 46 1 0 0"    161 = "-74 "      193 = "42→"
       130 = "20 46 1 0 0"    162 = "576 "      194 = "43→"
       131 = "1 82 6913 "     163 = "288 "      195 = "44→"
       132 = "0 0 0 "         164 = "144 "      196 = "45→"
       133 = "K 0 0 "         165 = "-1 "       197 = "48→"
       134 = "1152 "          166 = "-2 "       198 = "49→"
       135 = "1729 "          167 = "-5 "       199 = "53→"
       136 = "2305 "          168 = "-7 "       200 = "54→"
       137 = "3457 "          169 = "10 "       201 = "59→"
       138 = "4609 "          170 = "14 "       202 = "60→"
       139 = "5185 "          171 = "15 "       203 = "63→"
       140 = "A D "           172 = "16 "       204 = "64→"
       141 = "J N "           173 = "20 "       205 = "65→"
       142 = "J R "           174 = "21 "       206 = "96→"
       143 = "J B "           175 = "25 "       207 = "0→"
       144 = "K 0 "           176 = "28 "       208 = "4→"
       145 = "K 1 "           177 = "30 "       209 = "6→"
       146 = "-10 "           178 = "31 "       210 = "8→"
       147 = "-11 "           179 = "32 "       211 = "K "
       148 = "-14 "           180 = "35 "       212 = "0 "
       149 = "-15 "           181 = "42 "       213 = "1 "
       150 = "-20 "           182 = "49 "       214 = "2 "
       151 = "-21 "           183 = "52 "       215 = "3 "
       152 = "-25 "           184 = "56 "       216 = "4 "
       153 = "-28 "           185 = "63 "       217 = "5 "
       154 = "-30 "           186 = "72 "       218 = "6 "
       155 = "-32 "           187 = " * "       219 = "7 "
       156 = "-35 "           188 = " - "       220 = "8 "
       157 = "-42 "           189 = "12→"       221 = "9 "
       158 = "-49 "           190 = "16→"       222   new line
       159 = "-53 "           191 = "24→"       223   beginning of file

     A transition profile (i,j) is then constructed for the data.  For
     each of the 192 bytes (32 to 223) we know the number of times that
     byte (i) is followed by byte (j).

     For each byte (i), the (j) profile values are sorted by size, largest
     first.  For each (i) byte, these reordered values of "j" are the
     ciphers.  For example, suppost i = 32 (blank).  The profile of bytes
     which follows a blank might be as follows:

       212 = "0 "          320 times   represented by code 00
        68 = "D"           168 times                       01
        50 = "2"           104 times                       10
        49 = "1"            68 times                       11 000
       132 = "0 0 0 "       44 times                       11 001
       213 = "1 "           18 times                       11 010
        67 = "C"            16 times                       11 011
        51 = "3"             9 times                       11 100
        52 = "4"             6 times                       11 101
        53 = "5"             4 times                       11 110
        54 = "6"             3 times                       11 111 000
        55 = "7"             2 times                       11 111 001
        56 = "8"             1 time                        11 111 010
        57 = "9"             1 time                        11 111 011
        76 = "L"             1 time                        11 111 100

     In this example, we say the method of representation is 3-7-7-
     because the first 3 ciphers (212, 68, and 50) are represented as
     a 2-bit code, with the fourth code (11) being an escape; the
     next 7 ciphers (49, 132, 213, 67, 51, 52, 53) are represented
     by the 7 5-bit codes, with the eighth 5-bit code (11 111) being
     an escape to the next set.  In this example, we can calculate
     the "average" number of bits representing the transition from
     the blank (i = 32) character as follows:

          320 x 2  =  640 bits
          168 x 2  =  336 bits
          104 x 2  =  208 bits

           68 x 5  =  340 bits
           44 x 5  =  220 bits
           18 x 5  =   90 bits
           16 x 5  =   80 bits
            9 x 5  =   45 bits
            6 x 5  =   30 bits
            4 x 5  =   20 bits

            3 x 8  =   24 bits
            2 x 8  =   16 bits
            1 x 8  =    8 bits
            1 x 8  =    8 bits
            1 x 8  =    8 bits
         -----      ------
          765        2073 bits

     Thus 765 transitions is represented by 2073 bits, or 260 bytes,
     representing a compression ratio of 2.95.

     In version 1.00/1.01, there are 8 possible methods of representation

           (1)  1-1-3-3-15-128
           (2)  1-1-7-15-31-128
           (3)  1-3-3-7-31-128
           (4)  1-7-15-31-31-128
           (5)  3-3-3-7-31-128
           (6)  3-7-7-7-31-128
           (7)  7-7-7-7-31-128
           (8)  7-15-15-31-63-128

     For each character (i = 32 .. 127, and 128 .. 223), the method
     of representation which gives the highest compression ratio is
     chosen.

     The cipher itself is stored as follows:

     For each character (i = 32 .. 127, and 128 .. 223), the transition
     list is stored -- most common to least common.  The list is terminated
     by the 0x00 byte.  If a particular character does not exist anywhere
     in the data (e.g., "q"), its transition list consists simply of the
     0x00 byte.  Following the 0x00 byte is the "method byte", a single
     byte 0x00 to 0x07 represent one of the eight possible methods of
     representation.

     De-compression is simple and fast.  Suppose we have decoded a
     blank (i = 32) character.  We look at the method for blanks and
     see that it is method (6), or 3-7-7-7-31-128.  We then look at the
     next two bits in the .CFT compression.  If they are 0x00, 0x01,
     or 0x02, we know the next character.  It is "0 ", "D", or "2"
     repectively.  Otherwise (0x03), we must read the next three bits
     in the .CFT compression, etc.

   In version 1.02
   -----------------

     The algorithm for bit-wise compression is essentially the same as
     version 1.00/1.01.  The number of byte combinations is expanded and
     re-organized.

       0 = New file        127 = "J D 0 "   170 = "-45 "   213 = "52 "
                           128 = "J M 0 "   171 = "-49 "   214 = "54 "
       1 = " 4 22 11 11 "  129 = "T 0 1 "   172 = "-53 "   215 = "63 "
       2 = "0 0 0 0 0 "    130 = "A P 1 "   173 = "-54 "   216 = "73 "
       3 = "1 82 6913 "    131 = "A P 2 "   174 = "-63 "   217 = "112"
       4 = "42 3 1 0 "     132 = "A P 3 "   175 = "-74 "   218 = "111"
       5 = "36 3 1 0 "     133 = "144 1 "   176 = "865 "   219 = "110"
       6 = "28 3 1 0 "     134 = "288 1 "   177 = "864 "   220 = "109"
       7 = "24 3 1 0 "     135 = "432 1 "   178 = "576 "   221 = "108"
       8 = "42 2 1 0 "     136 = "0 0 0 "   179 = "432 "   222 = "107"
       9 = "36 2 1 0 "     137 = "J N 4 "   180 = "288 "   223 = "103"
      10 = "28 2 1 0 "     138 = "6913 "    181 = "144 "   224 = "-9 "
      11 = "24 2 1 0 "     139 = "6049 "    182 = "A P "   225 = "-7 "
      12 = " 4 2 1 1 "     140 = "5761 "    183 = "A D "   226 = "-6 "
      13 = " * 0 21 "      141 = "5185 "    184 = "J B "   227 = "B -"
      14 = " * 0 18 "      142 = "4609 "    185 = "W 0 "   228 = "T -"
      15 = " * 0 14 "      143 = "4321 "    186 = "K 0 "   229 = "B 1"
      16 = " * 0 12 "      144 = "3457 "    187 = "10 "    230 = " * "
      17 = " 3 2 21 "      145 = "2593 "    188 = "11 "    231 = " - "
      18 = "@ LINE: "      146 = "2305 "    189 = "12 "    232 = " \!"
      19 = "@ TEXT: "      147 = "2304 "    190 = "13 "    233 = " \@"
      20 = "10000 0 "      148 = "2050 "    191 = "14 "    234 = " \#"
      21 = " 3 2 1 "       149 = "1729 "    192 = "15 "    235 = "0 0"
      22 = "J N 9 "        150 = "1728 "    193 = "16 "    236 = "\-"
      23 = "J N 8 "        151 = "1153 "    194 = "17 "    237 = "T "
      24 = "J N 7 "        152 = "1152 "    195 = "18 "    238 = "B "
      25 = "J N 6 "        153 = "J B 1"    196 = "19 "    239 = "K "
      26 = "J N 5 "        154 = "X 37 "    197 = "20 "    240 = "H "
      27 = "J R 9 "        155 = "X 31 "    198 = "21 "    241 = "L "
      28 = "J R 8 "        156 = " 2 2 "    199 = "22 "    242 = "S "
      29 = "J R 7 "        157 = "-11 "     200 = "23 "    243 = "9 "
      30 = "J R 6 "        158 = "-12 "     201 = "24 "    244 = "8 "
      31 = "J R 5 "        159 = "-14 "     202 = "25 "    245 = "7 "
                           160 = "-18 "     203 = "27 "    246 = "6 "
  32-126 = Lower ASCII     161 = "-21 "     204 = "28 "    247 = "5 "
                           162 = "-24 "     205 = "30 "    248 = "4 "
                           163 = "-27 "     206 = "31 "    249 = "3 "
                           164 = "-28 "     207 = "32 "    250 = "2 "
                           165 = "-30 "     208 = "35 "    251 = "1 "
                           166 = "-31 "     209 = "36 "    252 = "0 "
                           167 = "-35 "     210 = "42 "    253 = "nnn " (+10 bits)
                           168 = "-36 "     211 = "45 "    254 = "nnnn " (+10 bits)
                           169 = "-42 "     212 = "49 "    255 = New Line

     The number of representation methods has also been expanded.  Version 1.02
     recognizes the following 33 representation methods:

                   1)    1---1---1---3--15--31-256
                   2)    1---1---1---7---3--15-256
                   3)    1---1---3---3--15-127-128
                   4)    1---1---7---3---7--31-256
                   5)    1---3---3---7--31-127-128
                   6)    1---3---7--15--15--31-256
                   7)    1---3---7---1--15--31-256
                   8)    1---3--15--15--31--63-128
                   9)    1---7---1---1--15--31-256
                  10)    1---7---1---3--15--63-256
                  11)    1---7---3---1---7--31-256
                  12)    1---7---3--15--31--63-256
                  13)    1---7---7---7--31--63-256
                  14)    1---7--15--15--31--63-128
                  15)    1---7--15--31--31--63-128
                  16)    1--15--31--63--63--63--64

                  17)    3---1---1---3---7--31-256
                  18)    3---1---3---7--15--31-256
                  19)    3---3---3---7--15--31-256
                  20)    3---3---7--15--15--31-256
                  21)    3---3--15--31--31--63-128
                  22)    3---7---3---7--15--31-256
                  23)    3---7---3--15--31--63-256
                  24)    3--15--31--31--31--63-128
                  25)    7---3---3---7--31--63-256
                  26)    7---3---7---7--15--31-256
                  27)    7---7--15--31--31--63-128
                  28)    7--31--31--31--31--63-128
                  29)   15---3---3--15--63--63-128
                  30)   31---7---7--15--63--63-128
                  31)   63--15--31--31--63--63--64
                  32)  127--31--31--31--31---7--64
                  33)  256---1---1---1---1---1--64

     In addition, there are 10 more "Huffman" style representations as
     follows:

   Method ->41     42     43     44     45     46     47     48     49     50
           ----   ----   ----   ----   ----   ----   ----   ----   ----   ----
      1      1      0      0      0      0      0      0      0      0      0
      2      0      2      2      1      1      0      0      0      0      0
      3      1      1      0      2      0      5      5      4      1      1
      4      2      2      5      4      6      5      2      7      5      0
      5      3      4      3      4      6      1      4      0      8     14
      6      5      4      3      4      6      2      4      1     10     14
      7      4      4      3      4      6      2      4      2     10     14
      8      4      4      3      4      6      2      4      2     10     14
      9      4      4      3      4      6      5      4      2     10     14
     10      4      4      3      4      6      0      4      5     10     14
     11      4      4      3      4      6      0      4      0     10      0
     12      4      4      1      4      6      0      4      0     10      0
     13      4      4      1      4      6      0      4      0      9      0
     14    216      4      3      0      0    234      4    233      1    171
                 211x20  223x18 213x19 195x18        209x20        162x17
          -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
           256    256    256    256    256    256    256    256    256    256


     In addition, the files are analyzed for an optimum set of "common"
     records, and these are represented in their own data set following
     directly after the cipher.  In the .CFT compression, each new
     record begins with one bit (0 = normal compression, 1 = common record).
     The number of "common" records is determined at compression time and
     is represented in the "common" data group.