Subsequently Matched
with Raku and Perl

by Arne Sommer

Subsequently Matched with Raku and Perl

[115] Published 12. February 2021.

This is my response to the Perl Weekly Challenge #099.

Challenge #099.1: Pattern Match

You are given a string $S and a pattern $P.

Write a script to check if given pattern validate the entire string. Print 1 if pass otherwise 0.

The patterns can also have the following characters:
? - Match any single character.
* - Match any sequence of characters.

Example 1:
Input: $S = "abcde" $P = "a*e"
Output: 1
Example 2:
Input: $S = "abcde" $P = "a*d"
Output: 0
Example 3:
Input: $S = "abcde" $P = "?b*d"
Output: 0
Example 4:
Input: $S = "abcde" $P = "a*c?e"
Output: 1

The «?» and «*» characters are Shell Metacharacters (see e.g. flylib.com/books/en/4.356.1.126/1/) used by shells (e.g. bash) and other programs (e.g. grep, sed and awk) in a Unix (and Unix-like) environment. Raku and Perl does not support them, so we have to translate them, like this:

Shell MetacharacterRaku/Perl Regex
*.*
?.

The challenge does not say what to do about other special characters (recognised as regexes by Raku and Perl), so I'll ignore that for now.

File: pattern-match
#! /usr/bin/env raku

unit sub MAIN (Str $S where $S.chars > 0,         # [1]
               Str $P is copy where $P.chars > 0, # [1a]
	       :v(:$verbose));

$P.=trans( [ '*', '?' ] => [ '.*' , '.' ]);       # [2]

say ": Regex: $P " if $verbose;

say $S ~~ /^ <$P> $/ ?? 1 !! 0                    # [3]

[1] Ensure that both strings have at last one character.

[2] Use trans to translate the substrings on the left side of the arrow to the ones on the right side. Note that trans does not change the string itself, but returns the modified version. So we use .= to assign the value back to the variable. Also note the is copy in [1a] so that we can change the value.

[3] Apply the regex. Note the < and > around the variable telling Raku to treat regex metacharacters in the variable as such. The anchors ensure that we match the whole string (beginning: ^, end: $). Using it without them would treat the variable as literal text.

See docs.raku.org/routine/trans for more information about trans.

Running it:

$ ./pattern-match abcde "a*e"
1

$ ./pattern-match abcde "a*d"
0

$ ./pattern-match abcde "?b*d"
0

$ ./pattern-match -v abcde "a*c?e"
1

With verbose mode, to show the underlying regex:

$ ./pattern-match -v abcde "a*e"
: Regex: a.*e 
1

$ ./pattern-match -v abcde "a*d"
: Regex: a.*d 
0

$ ./pattern-match -v abcde "?b*d"
: Regex: .b.*d 
0

$ ./pattern-match -v abcde "a*c?e"
: Regex: a.*c.e 
1

Looing good.

A Perl Version

This is straight forward translation of the Raku version.

File: pattern-match-perl
#! /usr/bin/env perl

use strict;
use warnings;
use feature 'say';

use Getopt::Long;

my $verbose = 0;

GetOptions("verbose" => \$verbose);

my $S = shift(@ARGV);
my $P = shift(@ARGV);

die '"$S" must have length' unless length $S;
die '"$P" must have length' unless length $P;

$P =~ s/\*/.*/g;           # [1]
$P =~ s/\?/./g;            # [1]

say ": Perl Regex: $P " if $verbose;

say $S =~ /^$P$/ ? 1 : 0;  # [2]

[1] Perl does not have a trans routine, but we can use s///. Twice, once for each character sequence.

[2] The varible is interpolated, and treated as a regex if it contains any (as it does). If we had written it like this in Raku, any metacharcters would have been qouted. We can achieve the in Perl with quotemeta. But we do want to treat them as metacharacters, so this does not apply here.

Running it gives the same result as the Raku version:

$ ./pattern-match-perl -v abcde "a*e"
: Perl Regex: a.*e 
1

$ ./pattern-match-perl -v abcde "a*d"
: Perl Regex: a.*d 
0

$ ./pattern-match-perl -v abcde "?b*d"
: Perl Regex: .b.*d 
0

$ ./pattern-match-perl -v abcde "a*c?e"
: Perl Regex: a.*c.e 
1

Challenge #099.2: Unique Subsequence

You are given two strings $S and $T.

Write a script to find out count of different unique subsequences matching $T without changing the position of characters.

Example 1:
Input: $S = 'littleit', $T = 'lit'
Output: 4

    1: [lit] tleit
    2: [li] t [t] leit
    3: [li] ttlei [t]
    4: litt [l] e [it]
Example 2:
Input: $S = 'london', $T = 'lon'
Output: 3

    1: [lon] don
    2: [lo] ndo [n]
    3: [l] ond [on]

A Bitmap

Let us construct a bit mask (a binary number) that tells if the corresponding character should be included in the result. This is a brute force approach, but that should be ok here.

Here is an illustration of the concept, for the third match of the second example:

The character is included if the corresponding bitmap value is 1.

In this case we have six letters, so the task is to construct the binary numbers from 0 (or really 000000) to 111111.

The upper limit is easy(ish), with the String Repetition Operator x:

> say '1' x 'london'.chars;    # -> 111111
> say '1' x 'littleit'.chars;  # -> 11111111

See docs.raku.org/routine/x for more information about x.

Binary numbers are tedious to work with, so we switch back to decimal (with parse-base(2)) for the loop, and convert the loop value back (with fmt, so that we can zero-pad the value).

> my $length = 'london'.chars;
6
    
> my $binary = '1' x $length;
111111

> my $max = $binary.parse-base(2);
63

> say $_.fmt('%0' ~ $length ~ 'b') for 0 .. $max;
000000
000001
000010
000011
...
111100
111101
111110
111111

See docs.raku.org/routine/parse-base for more information about parse-base.

See docs.raku.org/routine/fmt for more information about fmt.

The program is almost trivial now:

File: unique-subsequence
#! /usr/bin/env raku

unit sub MAIN (Str $S where $S.chars > 0,           # [1]
               Str $T where $T.chars > 0,           # [1a]
	       :v(:$verbose));

my $S-length = $S.chars;                            # [2]
my $T-length = $T.chars;                            # [2a]
my $binary   = '1' x $S-length;                     # [3]
my $max      = $binary.parse-base(2);               # [3a]
my $matches  = 0;                                   # [4]

for 1 .. $max -> $current                           # [5]
{
  my $mask = $current.fmt('%0' ~ $S-length ~ 'b');  # [6]

  if $mask.comb.sum != $T-length                    # [7]
  {
    say ":   Skipped binary mask '{ $mask }' - wrong number of 1s" if $verbose;
    next;
  }

  my $candidate = (^$S-length).map({ $mask.substr($_, 1) eq '1'
     ## 8 ######   # 8a ####### # 8b ##########################

  ?? $S.substr($_,1) !! '' }).join;
  # 8c ############# # 8d ### 8e ##

  if $candidate eq $T                               # [9]
  {
    $matches++;                                     # [9a]
    say ": + Match found with binary mask '{ $mask }'." if $verbose;
  }
  else
  {
    say ":   Considering binary mask '{ $mask}' - no match" if $verbose;
  }
}

say $matches;

[1] Ensure that both string have at last one character.

[2] The lengths of the two strings.

[3] The binary flag, with the highest value, and the decimal counterpart [3a].

[4] The number of matches.

[5] Iterating from 1 to the decimal equivalence of the maximum binary value.

[6] Convert the decimal value to binary. The format expression '%0' ~ $S-length ~ 'b' ensures that the string has at least $S-length characters, and the leading zero indicates that any missing length should be compensated by leading zeroes. The 'b' part tells us to print the binary version of the number.

[7] This shortcut speeds up the program quite a bit, but isn't stricly necessary. It skips the rest of the current iteration if the number of 1s in the binary mask is different from the length of $T, as we cannot get a match in that case.

[8] The candidate string given the current binary mask. We start with all the indices of the string $S [8a], and use map to get at the corresponding (= with the same index) binary digit in the current bitmask [8b]. If the digit is 1, we keep the character form $S [8c]. If not, we ignore it (by returning an empty string) [8d]. Dinally, we take that list of characters (including empty strings) and join them together to a single string [8e].

[9] Have we found $T? If so increase the counter [9a].

Running it:

$ ./unique-subsequence littleit lit
5

$ ./unique-subsequence london lon
3

The first one differs from the given result in the challenge. So what is going on? Verbose mode to the rescue. I have filtered out the "Match found" lines, as the rest really is noise:

$ ./unique-subsequence -v littleit lit | grep +
: + Match found with binary mask '00001011'.  # 4
: + Match found with binary mask '10000011'.  # ?
: + Match found with binary mask '11000001'.  # 3 
: + Match found with binary mask '11010000'.  # 2
: + Match found with binary mask '11100000'.  # 1

$ ./unique-subsequence -v london lon | grep +
: + Match found with binary mask '100011'.    # 3
: + Match found with binary mask '110001'.    # 2
: + Match found with binary mask '111000'.    # 1

The numbers at the end are the corresponding entries in the challenge. The one marked with «?» is new. It clerly is a valid combination, so the challenge got it wrong. The answer is 5 for the first example. (The second example is correct, though.)

Note that the challenge has been updated with the fifth match. I checked after writing this section.

Perl

This is a straight forward translation of the Raku version.

File: unique-subsequence-perl
#! /usr/bin/env perl

use strict;
use warnings;
use feature 'say';

use Getopt::Long;
use List::Util qw(sum);

my $verbose = 0;

GetOptions("verbose" => \$verbose);

my $S = shift(@ARGV);
my $T = shift(@ARGV);

die '"$S" must have length' unless length $S;
die '"$T" must have length' unless length $T;

my $S_length = length $S;
my $T_length = length $T;
my $binary   = '1' x $S_length;

my $max      = oct('0b' . $binary);
my $matches  = 0;

for my $current (1 .. $max)
{
  my $mask = sprintf("%0" . $S_length ."b", $current);

  if (sum(split(//, $mask)) != $T_length)
  {
    say ":   Skipped binary mask '{ $mask }' - wrong number of 1s" if $verbose;
    next;
  }

  my $candidate = join("", map { substr($mask, $_, 1) eq '1'
         ? substr($S, $_,1) : '' } (0 .. $S_length -1));

  if ($candidate eq $T)
  {
    $matches++;
    say ": + Match found with binary mask '$mask'." if $verbose;
  }
  else
  {
    say ":   Considering binary mask ' $mask' - no match" if $verbose;
  }
}

say $matches;

Running it gives the same result as the Raku version:

$ ./unique-subsequence-perl -v littleit lit | grep +
: + Match found with binary mask '00001011'.
: + Match found with binary mask '10000011'.
: + Match found with binary mask '11000001'.
: + Match found with binary mask '11010000'.
: + Match found with binary mask '11100000'.

$ ./unique-subsequence-perl -v london lon | grep +
: + Match found with binary mask '100011'.
: + Match found with binary mask '110001'.
: + Match found with binary mask '111000'.

And that's it.