Humdrum lab 7

From CCARH Wiki
Jump to navigation Jump to search

Regular Expressions

regular expressions

Basic Regular Expressions

"Basic" regular expressions are the initial implementation syntax of grep, that came with unix in 1973. Here are the "metacharacters" in the basic implementation of regular expressions:

Basic-regular-expressions.png

Dot metacharacter

The dot, or period, character is used to indicate any single character. In the following example, the regular expressin "c.t" will match to any three characters which start with "c", end with "t", and have any single character between these two characters.

Basic-regular-expression-dot.png


Star metacharacter

The star, or asterisk, character is used to indicate the the previous character (or parentheses group) will be matched if it occurs 0 or more times in the search string.

In the following example, the regular expression "c*t" will match to strings that contain zero or more "c" characters followed by the letter "t":

Basic-regular-expression-star.png

Note that "*" must be preceded by a character. If the "*" comes at the start of a line, that is an error because there is nothing to the left of the star for it to operate on.


Square-bracket metacharacters

Square brackets enclose a list of allowed characters in a matched string. Only one of the characters will be matched in a search string.

Basic-regular-expression-square-brackets.png


There is more syntax related to square brackets. You can negate the list by adding "^" as the first charcter, such as match to all characters that are not vowels: "[^aeiou]".

Basic-regular-expression-square-brackets-negate.png


Another syntax is a character range, such as "[0-9]" which is equivalent to "[0123456789]", or "[A-Ga-g]" which is equivalent to "[ABCDEFGabcdefg]".

Basic-regular-expression-square-brackets-range.png


Carat metacharacter

The carat metacharacter (^) is a line *anchor*. This character indicates that the matched characters (that follow) must occur at the start of the line. Notice that this character does double duty, as it is also the negation metacharacter when at the start of a list in square brackets!

In the following example "^cat" matches to the first occurrence of "cat" on the line:

Basic-regular-expression-square-brackets-carat.png


Dollar metacharacter

The dollar metacharacter ($) is another line *anchor*. This character indicates that the matched characters (that precede) must occur at the end of the line.

In the following example "cat$" matches to the second occurrence of "cat" on the line:

Basic-regular-expression-square-brackets-dollar.png


Backslash metacharacter

The backslash metacharacter is used to un-metafy a metacharacter, turning it into a normal character. For example "c*" means zero or more letters c's, while "c\*" means the letter a followed by an asterisk in a matched string:


Basic-regular-expression-backslash.png


Dot-star metacharacter combination

A dot followed by a star means ``anything``. The dot means any one character, and the start after it means "one or more" of any one character. So ".*" will match to absolutely anything, including an empty line.

Using Basic regular expressions

Finding Humdrum files which contain a minor key designation:

   grep -l '^\*[a-g][#-]*:'  *.krn

The -l option for grep means to only show the filename, not the actual matched line(s).

Also notice that the regular expression is enclosed in single quotes. This is usually the safest thing to do; otherwise, the bash shell may try to sneak a look into the regular expression and try to change things due to its own metacharacters. Putting single quotes around it will tell the shell to not treat any characters inside of the quotes as any of its metacharacters.

Getting a list of files containing a major key designation:

   grep -l '^\*[A-G][#-]*:' *.krn

The capital letters for the pitch name indicate major keys in Humdrum **kern data.

To search for files that have any sort of key designation:

    grep -l '^\*[A-Ga-g][#-]*:' *.krn

This matches to major (A-G) and minor (a-g) keys. Another way to do this search is:

    grep -il '^\*[a-g][#-]*:' *.krn

The -i option means to ignore the case of the letters, so lower and upper case letters are equivalent when matching in a string.

The example regular expression is not 100% correct, as it would be possible to match to a nonsense key designation such as "*F-#:" which is F-flat-sharp major. But as this is nonsense, it is not expected in the data, so not really a problem. "*F##:" is allowed, meaning F-double-sharp major. This is not a particularly common key signature, and the sanity of the composer or their music editor should probably be checked if it is used...

"grep" means Global Regular Expression Print

Extended Regular Expressions

Basic regular expressions were popular and useful, so in 1975 Extended regular expressions were developed to extend them with more possibilities. Extended regular expressions add the following metacharacters to the basic set:


Extended-regular-expressions.png


Question metacharacter

The question mark requires either 0 or 1 of the previous item to be present in a matched string. Think of it like a yes/no question, with the preceding character being optional:

Extended-regular-expression-question.png


Plus metacharacter

The plus metacharacter is similar to the start meta character, except that there must be at least one of the preceding item in a matched string:

Extended-regular-expression-plus.png

Curly-bracket metacharacters

The curly-bracket metacharacters are used to fully generalize the counting operators. Within the curly braces can occur one or two numbers. "{4}" means exactly four of the previous item are required in a matched string. "{2-5}" means that the previous item must occur at least twice but not more than 5 times in a row.

The three previous counting operators are "*" from basic regular expressions, and "?" and "+" from extended regular expressions. These can all be expressed equivalently with curly-bracket ranges:

Extended-regular-expression-curly.png


Parentheses metacharacters

Parentheses, (), are used to group more than one character together into a single unit for coordination with counting operators:

Extended-regular-expression-parenthese.png

In the above example the characters "ca" are enclosed in parentheses. This causes the "?" operator to consider the two characters as a single item (or atom in regular-expressionese). A Matched string can have either "ca" or not before the "t" character.


Pipe metacharacter

The pipe metacharacter is a logical or operation. Either the item on the left, or the item on the right is to be matched.

Typically the two options for the or operation are enclosed in grouping parentheses. In the following regular expression, either "ca" or "ho" can precede the "t" character (or the "t" character can be by itself due to the question metacharacter making "(ca|ho)" optional.

Extended-regular-expression-or.png