Humdrum lab 7

From CCARH Wiki
Jump to navigation Jump to search

Regular Expressions

Regular expressions are an essential concept when processing text strings, and they are built into many programming languages, such as AWK, PERL, C++, python, and javascript

Basic Regular Expressions

"Basic" regular expressions are the initial implementation syntax of grep, that came with unix in 1973. Here are the "metacharacters" in the basic implementation of regular expressions:

Basic-regular-expressions.png

Dot metacharacter

The dot, or period, character is used to indicate any single character. In the following example, the regular expressin "c.t" will match to any three characters which start with "c", end with "t", and have any single character between these two characters.

Basic-regular-expression-dot.png


Star metacharacter

The star, or asterisk, character is used to indicate the the previous character (or parentheses group) will be matched if it occurs 0 or more times in the search string.

In the following example, the regular expression "c*t" will match to strings that contain zero or more "c" characters followed by the letter "t":

Basic-regular-expression-star.png

Note that "*" must be preceded by a character. If the "*" comes at the start of a line, that is an error because there is nothing to the left of the star for it to operate on.

Carat metacharacter

The carat metacharacter (^) is a line *anchor*. This character indicates that the matched characters (that follow) must occur at the start of the line. Notice that this character does double duty, as it is also the negation metacharacter when at the start of a list in square brackets!

In the following example "^cat" matches to the first occurrence of "cat" on the line:

Basic-regular-expression-square-brackets-carat.png


Dollar metacharacter

The dollar metacharacter ($) is another line *anchor*. This character indicates that the matched characters (that precede) must occur at the end of the line.

In the following example "cat$" matches to the second occurrence of "cat" on the line:

Basic-regular-expression-square-brackets-dollar.png


Backslash metacharacter

The backslash metacharacter is used to un-metafy a metacharacter, turning it into a normal character. For example "c*" means zero or more letters c's, while "c\*" means the letter a followed by an asterisk in a matched string:


Basic-regular-expression-backslash.png



Square-bracket metacharacters

Square brackets enclose a list of allowed characters in a matched string. Only one of the characters will be matched in a search string.

Basic-regular-expression-square-brackets.png


There is more syntax related to square brackets. You can negate the list by adding "^" as the first charcter, such as match to all characters that are not vowels: "[^aeiou]".

Basic-regular-expression-square-brackets-negate.png


Another syntax is a character range, such as "[0-9]" which is equivalent to "[0123456789]", or "[A-Ga-g]" which is equivalent to "[ABCDEFGabcdefg]".

Basic-regular-expression-square-brackets-range.png

Metacharacters in square brackets

There are several special cases for dealing with metacharacters in square brackets. Charcters such as "*" and "." have no special meaning inside of square brackets, so you can consider them regular characters.

Putting a square bracket inside of the list is tricky, since the characters themselves have a special meaning. A simple method would be to add a backslash character before the bracket in the list. Another method is to place the bracket character at the beginning or ending of the list:

Basic-regular-expression-square-brackets-bracket.png

This is similar for the dash. When between two characters in a list, it is a range metacharacter. Place it at the start or end of the list to make it a regular character (or put a backslash in front of it to force it to be a regular character):


Basic-regular-expression-square-brackets-dash.png


Dot-star metacharacter combination

A dot followed by a star means ``anything``. The dot means any one character, and the start after it means "one or more" of any one character. So ".*" will match to absolutely anything, including an empty line.

Using Basic regular expressions

Finding Humdrum files which contain a minor key designation:

   grep -l '^\*[a-g][#-]*:'  *.krn

The -l option for grep means to only show the filename, not the actual matched line(s).

Also notice that the regular expression is enclosed in single quotes. This is usually the safest thing to do; otherwise, the bash shell may try to sneak a look into the regular expression and try to change things due to its own metacharacters. Putting single quotes around it will tell the shell to not treat any characters inside of the quotes as any of its metacharacters.

Getting a list of files containing a major key designation:

   grep -l '^\*[A-G][#-]*:' *.krn

The capital letters for the pitch name indicate major keys in Humdrum **kern data.

To search for files that have any sort of key designation:

    grep -l '^\*[A-Ga-g][#-]*:' *.krn

This matches to major (A-G) and minor (a-g) keys. Another way to do this search is:

    grep -il '^\*[a-g][#-]*:' *.krn

The -i option means to ignore the case of the letters, so lower and upper case letters are equivalent when matching in a string.

The example regular expression is not 100% correct, as it would be possible to match to a nonsense key designation such as "*F-#:" which is F-flat-sharp major. But as this is nonsense, it is not expected in the data, so not really a problem. "*F##:" is allowed, meaning F-double-sharp major. This is not a particularly common key signature, and the sanity of the composer or their music editor should probably be checked if it is used...

"grep" means Global Regular Expression Print

Extended Regular Expressions

Basic regular expressions were popular and useful, so in 1975 Extended regular expressions were developed to extend them with more possibilities. Extended regular expressions add the following metacharacters to the basic set:


Extended-regular-expressions.png


Question metacharacter

The question mark requires either 0 or 1 of the previous item to be present in a matched string. Think of it like a yes/no question, with the preceding character being optional:

Extended-regular-expression-question.png


Plus metacharacter

The plus metacharacter is similar to the start meta character, except that there must be at least one of the preceding item in a matched string:

Extended-regular-expression-plus.png

Curly-bracket metacharacters

The curly-bracket metacharacters are used to fully generalize the counting operators. Within the curly braces can occur one or two numbers. "{4}" means exactly four of the previous item are required in a matched string. "{2-5}" means that the previous item must occur at least twice but not more than 5 times in a row.

The three previous counting operators are "*" from basic regular expressions, and "?" and "+" from extended regular expressions. These can all be expressed equivalently with curly-bracket ranges:

Extended-regular-expression-curly.png


Parentheses metacharacters

Parentheses, (), are used to group more than one character together into a single unit for coordination with counting operators:

Extended-regular-expression-parenthese.png

In the above example the characters "ca" are enclosed in parentheses. This causes the "?" operator to consider the two characters as a single item (or atom in regular-expressionese). A Matched string can have either "ca" or not before the "t" character.


Pipe metacharacter

The pipe metacharacter is a logical or operation. Either the item on the left, or the item on the right is to be matched.

Typically the two options for the or operation are enclosed in grouping parentheses. In the following regular expression, either "ca" or "ho" can precede the "t" character (or the "t" character can be by itself due to the question metacharacter making "(ca|ho)" optional.

Extended-regular-expression-or.png


Using Extended regular expressions

Using extended regular expressions with grep requires the -E option:

      grep -E -l '^\*[a-g][-#]?' *.krn

This will match to all minor keys which have up to one sharp/flat after the pitch name. Alternatively the egrep program can be used as an alias for grep -E:

     egrep -l '^\*[a-g][-#]?' *.krn

In other words, you should always used egrep rather than grep, particularly if you want to use extended regular expressions.

Extended regular expressions in grep

However, you can use extended regular expressions in plain grep if you really desired to do that. In such cases, you add a backslash in front of extended metacharacters in order to turn them into basic regular expression metacharacters...

In other words, you can use "?" in basic regular expressions even though it is an extended regular expression;

     grep -l '^\*[a-g][-#]\?' *.krn

is equivalent to:

     egrep -l '^\*[a-g][-#]?' *.krn

In basic regular expressions, "?" is a regular question mark, but "\?" is a metacharacter. In extended regular expressions, these meanings are reversed: "?" is a metacharacter and "\?" is a regular question mark. This is why you should probably stick to egrep so as not to be confused too much. Otherwise, you must memorize which metacharacters originate in which regular expression variant syntax.

PERL regular expressions

About 10 years after extended regular expressions were developed, the PERL language was created. PERL is basically an extension of bash shell scripting and an enhancement of the AWK programming language. Note that the "A" author of AWK was the creator of egrep.

PERL further extends regular expressions in several ways, as is the current state-of-the-art implementation of regular expressions. Most other modern languages adopt PERL regular expression syntax, although there is slightly variations in the implementations in each programming languages. Javascript is missing a few advanced features, but starting with the 2018 version of the standard it should be fairly complete.

See the Advanced topics section of this page for extensions to regular expression syntax introduced by the PERL language.

Perl syntax and grep

On linux computers (using GNU version of grep), you can use grep -P to turn on perl-style regular expressions. Also egrep -P works, but note that pgrep does not work, as that command already existed before PERL was developed. In MacOS, the -P option for grep and egrep do not work, since these programs have a separate and more primitive development history.

If you want to use the perl-syntax with grep on MacOS, then install GNU versions with homebrew:

  brew install grep

This will install the GNU version of grep as ggrep, so the perl syntax can be used with ggrep -P.

More grep examples

How many C pitch-classes are present in the Bach chorale data set?

    humcat -s h://chorales | serialize | grep -v '^[!*]' | grep -i c | grep -v '[#-]' | wc -l
    8295

Meaning of the individual parts of the command pipeline:

Command Meaning
humcat -s h://chorales Download the chorales as a single stream of data.
serialize Serialize the parts (one part after another in the data stream).
grep -v '^[!*]' Remove lines starting with ! or *. This is similar to ridx -H.
grep -i c Extract all diatonic C pitch-classes.
grep -v '[#-]' Remove cases such as C-sharp and C-flat.
wc -l Count the number of remaining C-naturals.


  • What is the most common pitch class?
  • How can secondary tied notes be excluded from the counts?
  • What is the most common pitch-class to be tied
  • How many C pitch classes occur below middle C?
  • How many C pitch classes occur at middle C and above?
  • How many C pitch classes occur above middle C?
  • How many C5 pitches occur in the chorales?
  • How many C pitch classes have an eighth-note duration?
  • How many C pitch classes have a dotted quarter note duration?

How to win at hangman

Impress your friend by always winning hangman with the help of regular expressions.

On MacOS, there is a dictionary file for spell-checking located at:

   /usr/share/dict/words

There are about 1/4 of a million words listed in it:

   wc -l /usr/share/dict/words

Linux and unix will probably have a similar dictionary, but may be in another location.

Go to the website http://www.hangman.no and then go to the "frequently misspelled" game. Then when a word is selected, run a command like this:

   grep '^........$' /usr/share/dict/words | wc -l

There are 8 dots in the regular expression representing 8 unknown letters.

Guess a letter, such as "e". If the letter is in the word, then replace the dots with the matched e's:

   grep '^e..e....$' /usr/share/dict/words 

Gradually guess letters and fill in the matched letters in the regular expression. There will be fewer and fewer options left after each step.



If you want to really be good at hangman, then this program is for you:

#!/usr/bin/perl
# vim: ts=3

use strict;
my $dictionary = "/usr/share/dict/words";
die "Cannot find dictionary" if !-r $dictionary;
die "Usage: $0 word used-letters\n" if @ARGV < 2;

# The first argument is the word being used.  Use . or _ to 
# represent unknown characters, otherwise fill in the known letters.
# example:   .le...le.
my $word    = $ARGV[0];

# The second argument is the list of letters that have
# already been guessed.  For no guesses, use "".
# Letters in the word do not have to be added to this list.
# example: abcdhatzy or ""
my $letters = $ARGV[1];

my $regex = $word;
if ($letters eq "") {
	$regex =~ s/[_.]/./g;
} else {
	$regex =~ s/[_.]/[^$letters]/g;
}
$regex = "^$regex\$";

my @possiblewords = `grep $regex $dictionary`;
my $count = @possiblewords;
print "There are $count candidate words.\n";
printLetterProbabilitiesByWord($letters . $word, @possiblewords);

sub printLetterProbabilitiesByWord {
	my ($letters, @words) = @_;
	my $wordcount = @words;
	chomp @words;
	my %wordcounter;
	foreach my $word (@words) {
		$word =~ s/[ _$letters]//g;
		$word =~ tr/A-Z/a-z/;
		my %counter = ();
		my @letters = split(//, $word);
		foreach my $letter (@letters) { $counter{$letter}++; }
		foreach my $key (keys %counter) { $wordcounter{$key}++ }
	}
	my @output;
	foreach my $letter (keys %wordcounter) {
		my $percent = int($wordcounter{$letter} / $wordcount * 100 + 0.5);
		push @output, [$letter, $percent];
	}
	@output = sort {$b->[1] <=> $a->[1] } @output;
	foreach my $item (@output) {
		print "\t", $item->[0], ":\t", $item->[1], "% probability of being in word\n";
	}
}


Here is a game in action:

Hangman.png

And here is the results of the above PERL program:

     nohang ..e.i..e "ar"
     There are 6 candidate words.
	l:	83% probability of being in word
	b:	50% probability of being in word
	c:	50% probability of being in word
	n:	50% probability of being in word
	d:	33% probability of being in word
	h:	33% probability of being in word
	y:	17% probability of being in word
	u:	17% probability of being in word
	t:	17% probability of being in word
	p:	17% probability of being in word
	s:	17% probability of being in word
	x:	17% probability of being in word
	f:	17% probability of being in word

The next most likely character to choose is "L" since it occurs 83% of the time in the possible answers.


Advanced topics

More anchors

Basic and extended regular expressions have two anchor metacharacters: caret (^) which anchors the match to the start of a line, and dollar ($) which anchors the match to the end of the line. PERL regular expressions introduced additional anchors: \b means a word boundary, and \B means not at a word boundary.

Here is how \b works. In the following regular expression, the \b prevents the string "perl" from being followed by a letter or number, since this would continue the "word":


Regular-expression-wordboundary.png


The negation of \b is \B. This means that the word must continue and cannot end at the \B position in the match:

Regular-expression-notwordboundary.png


Character set shortcuts

Extended regular expressions introduce character classes. These are shortcuts for groups of characters that can be added inside the [] character list. They are not too useful, but might sometimes be of use. For example [:upper:] means [A-Z], and [:digit:] is the same as [0-9].

A more useful shortcuts for character sets was introduced in PERL regular expressions:

Shortcut Equivalence Meaning
\d [0-9] any digit
\D [^0-9] anything that is not a digit
\s [ \t\n\r\f\v] any space character (space, tab, newline, linefeed, form-feed, vertical tab)
\S [^ \t\n\r\f\v] anything that is not a space
\w [A-Za-z0-9_] a "word" character
\W [^A-Za-z0-9_] not a "word" character

In regular expression syntaxes that understand these shortcuts, you can put them inside of [], such as [\d\s] which would mean any digit or space character.

Non-capturing group

Parentheses are used to group multiple atoms together. In most implementations of regular expressions, they also serve as a capture group that can be extracted from the matched string. Implementing capture groups is computationally expensive, since the computer needs to keep track of the match groups. To avoid using using parentheses for capturing substrings, use this form:

    (?:           )

In other words, prefix the contents of the parentheses with "?:".


Greedy versus Lazy matching

Count operators are typically greedy, meaning that they will try to match a searched string as much as possible before continuing with search of the regular expression. In more modern implementations of regular expression syntax, there question mark is used as a lazy matching qualifier.

For example, searching for ".*" in the following string will match from the first quote to the last quote on the line:


Regular-expression-greedy.png

To generate a lazy search, add a question mark before the star: ".*?". This will cause the match to stop at the next quote character rather than the last one on the line:

Regular-expression-lazy.png


Lookarounds

Lookarounds are similar to non-capture groups and also related to lazy matching.

Lookaheads

Lookbehinds

Further reading

  • quick reference that lists regular expressions and which syntaxes use them.