Hi Emby,
Thank you, that was very helpful.
We know that the rules of VB dictate that a comment must be on a single
line, and that it is identified by a single quote that is not surrounded by
double-quotes. That is, if you wish to comment across multiple lines, you
must put a comment marker on each line, essentially creating a separate
comment on each line. Any characters to the right of the comment are
commented out of the code. We also know that the comment may appear at any
point in the line, not necessarily at the beginning.
So, now that we're down to a single string, and a few simple rules:
1. "token" is defined as any character data enclosed by curly brackets.
2. Any token inside a VB comment should be ignored.
3. Any token inside a matching pair of double-quotes should be ignored.
4. All other tokens should be matched.
And I came up with this:
(?m)(?<=^[^']*)'[^\n]+$|(?:"[^"]*"|({[^}]+}))
Let me explain a bit. This regular expression takes advantage of a
characteristic of regular expressions: Regular expressions consume a string
as they are parsed. That is, they move through a string in basically a
"forward-only" manner (other than "backtracking," which is a special case
used in lookarounds mainly). So, if a portion of a string is matched by one
regular expression, it is not available for further matching.
So, I worked backwards from matching tokens to the 2 exceptions where they
should *not* be matched. The token-matching regular expression is simple:
{[^}]+}
Translated, this says a match is a '{' character, followed by any number of
characters that are *not* '}' followed by a '}' character. Simple enough. It
matches every token in the string. Now we want to weed out the
non-qualifying tokens. Since the comment is the one that always weeds
everything out, I left that for last (first). You'll see why in a minute.
The rule for quoted tokens is expressed as follows:
"[^"]*"
It is similar to the first: a double-quote, followed by any number of
characters that are *not* a double-quote, followed by a double-quote.
Now, how do we get these 2 working together? We use the OR operator - '|'.
When we OR these together, we get this:
"[^"]*"|{[^}]+}
This seems to expand the number of matches, since matches are now *added*
that include non-tokens. Here's where the "consuming" aspect comes in. The
matches that match the first rule include matches of tokens inside the
double-quote pairs. So, the only real problem here is separating the 2
groups. So, we use a group (of course!).
"[^"]*"|({[^}]+})
At this point, all tokens are matched, including those inside double-quote
pairs. The only ones that we want are the ones inside "group 1" (the only
capturing group in the regular expression). So, by using that group, we
eliminate the matches inside the double-quote pairs.
We have one last hurdle now. We want to eliminate anything inside a comment.
I left this for last because the comment eliminates *everything* inside it,
including the double-quote pairs, and thus consumes the most of the 3 rules.
This will make the regular expression more efficient, as it has less work to
do with each match.
The rule for comments, again, is a bit more compkicated:
(?m)(?<=^[^']*)'[^\n]+$
First, it must limit a comment to a single line. This is done with the '^'
(start of string/line) and '$' (end of string/line) characters. I also used
the "(?m)" directive, which indicates that the '^' and '$' characters match
at new lines.
So, it begins with a positive look-behind: (?<=^[^']*) which means "the
following is *only* a match if preceded by this regular expression" followed
by the newline character, and a character group which indicates 0 or more
non-single-quotes. The condition applies to the rest of the regular
expression (without the condition matching - lookarounds do not consume) - a
single-quote, followed by 1 or more non-line-break characters, followed by a
line break or the end of the string.
This covers comments which begin in the middle of a line as well as at the
beginning. The lookbehind prevents the characters preceding the single-quote
from being consumed, thereby making them available for the other 2
conditions. I finished up by (1) grouping the second 2 regular expressions
into a single non-capturing group - (?:"[^"]*"|({[^}]+})), making them a
single alternative to the first, and ORing them all together.
In essense, it says, "Match the first (comment) group first. With what is
left over, match either the quoted strings, or the left-over tokens, and put
the left-over tokens into a group." You can do a regular expression match,
and use the values in Group 1 to do your replacements.
I tested it fairly thoroughly. Let me know if it works for you.
--
HTH,
Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist
A lifetime is made up of
Lots of short moments.
Hi Kevin,
You are indeed correct. The original question was a more general, "how can
this be done with RE's?"
To be specific, I will have a set of strings, which I will call tokens, each
of which will consist of upper case alpha-numeric characters in curly
brackets. I also have another set of strings which is the translated value
of these tokens. So If I have 5 tokens, I will have 5 translated values, one
for each token.
I will also have a code snippet - a potentially multi-line string - which
will contain embedded tokens. My task is to replace the tokens in the code
snippet string with their translated values. The snippet is a VB code
string. But:
1) any token in a code line to the right of a single quote character which
is not itself in a quoted string should not be replaced
2) any token that is within a quoted string should not be translated
Sorry, but I'm giving examples coz I'm not sure I've described it well or
completely
Known Tokens Translation
{AREA} 7075
{HEIGHT} 2512
{WIDTH} 75
{FOO} "Yes"
Snippet Translated code
If {AREA}>1000 Then If 7075>1000 Then
Return "Large" Return "Large"
' {AREA} token not used ' {AREA} token not used
ElseIf {HEIGHT}>2500 Then ElseIf 2512>2500 Then
Return "Tall" Return "Tall"
ElseIf {WIDTH} >50 Then ElseIf 75 >50 Then
Return " ' " & {FOO} Return " ' " & "Yes"
Else Else
Return " is {FOO} !" Return " is {FOO} !"
End If End If
Our system compiles the resulting snippet on the fly and executes it to
provide the app with a scripting capability.
Thanks for any help you can extend.