VBA Regex syntax problems (XL2003)

K

ker_01

I have a long string, in the format

BAR garbage text FOO desired string 1 BAR desired string 2 BAR desired
string 3 BAR desired string 4 BAR (etc).

What I'd like to extract is:
desired string 1
desired string 2
desired string 3
desired string 4

I'm trying variations of re.Pattern = "FOO|BAR [\s\S]*?BAR "
but I'm having some problems.

(1) The first returned string returns
BAR garbage text FOO desired string 1 BAR
where I'd prefer to only start with
FOO desired string 1 BAR ... or better,
desired string 1

I can work around that manually, but I welcome any suggestions on how to fix
it.

(2) Each search 'consumes' the terminating string "BAR", causing it to skip
every other string, giving me:
BAR garbage text FOO desired string 1 BAR
BAR desired string 3 BAR
BAR desired string 5 BAR

How do I force it to re-use the terminating string as the start string of
the next part of the search?

(3) I'm not sure if I need it or not, but I tried using re.multiline = True
just to be cautious, and I get a compile error "Method or data member not
found". I do have a reference to Microsoft VBScript Regular Expressions 1.0.
Do I need a different reference to enable multiline? Again, the strings still
seem to be coming across ok (including across multiple lines), I'm just
surprised I don't need (and in fact, can't set) the multiline parameter.

Thank you,
Keith
 
P

Per Jessen

Hi Keith

Look at this:

Sub AAA()
Dim MyString As String
Dim Output As String

MyString = "BAR garbage text FOO desired string 1 BAR desired string 2
BAR desired string 3 BAR desired string 4 BAR"
MyString = Mid(MyString, WorksheetFunction.Find("FOO", MyString) + 4)
Do
Output = Left(MyString, WorksheetFunction.Find(" BAR", MyString))
MyString = Mid(MyString, WorksheetFunction.Find("BAR", MyString) +
4)
Debug.Print Output
Loop Until MyString = ""
End Sub

Hopes this helps.
....
Per
 
R

Rick Rothstein

I'd be interested in knowing what text is in the "(etc)" part of your source
string. Can there be other FOO keywords? If so, do want the desired strings
between them and any BAR keywords that may follow it?
 
K

ker_01

Hi Rick!

In my source string, the 'FOO' string is only in the template once, right
before the first child string; I'm hoping that none of the 'child' strings
happen to contain it by chance, or else I'll be doing some hand-editing on
those records.

There are about 20-30 child strings per parent string I need to process; BAR
is in the template at the end of each child string, so it occurs once between
each child string and at the end of the last child string. Unfortunately, BAR
also occurs once at the top of the template, which is what was giving me
issue #1 below.

Modified Example:
[start of file]
garbage text BAR garbage text FOO desired string 1 BAR desired string 2 BAR
desired string 3 BAR desired string 4 BAR desired string...30 BAR more
garbage text that does not include FOO or BAR
[end of file]

Thanks!
Keith


Rick Rothstein said:
I'd be interested in knowing what text is in the "(etc)" part of your source
string. Can there be other FOO keywords? If so, do want the desired strings
between them and any BAR keywords that may follow it?

--
Rick (MVP - Excel)


ker_01 said:
I have a long string, in the format

BAR garbage text FOO desired string 1 BAR desired string 2 BAR desired
string 3 BAR desired string 4 BAR (etc).

What I'd like to extract is:
desired string 1
desired string 2
desired string 3
desired string 4

I'm trying variations of re.Pattern = "FOO|BAR [\s\S]*?BAR "
but I'm having some problems.

(1) The first returned string returns
BAR garbage text FOO desired string 1 BAR
where I'd prefer to only start with
FOO desired string 1 BAR ... or better,
desired string 1

I can work around that manually, but I welcome any suggestions on how to
fix
it.

(2) Each search 'consumes' the terminating string "BAR", causing it to
skip
every other string, giving me:
BAR garbage text FOO desired string 1 BAR
BAR desired string 3 BAR
BAR desired string 5 BAR

How do I force it to re-use the terminating string as the start string of
the next part of the search?

(3) I'm not sure if I need it or not, but I tried using re.multiline =
True
just to be cautious, and I get a compile error "Method or data member not
found". I do have a reference to Microsoft VBScript Regular Expressions
1.0.
Do I need a different reference to enable multiline? Again, the strings
still
seem to be coming across ok (including across multiple lines), I'm just
surprised I don't need (and in fact, can't set) the multiline parameter.

Thank you,
Keith
 
R

Rick Rothstein

Consider this coding approach...

Dim Children() As String, Parent As String, X As Long
Parent = "Garbage text BAR garbage text FOO desired string 1 BAR " & _
"desired string 2 BAR desired string 3 BAR desired string " & _
"4 BAR desired string 5 BAR More Garbage Text"
Children = Split(Split(Replace(Parent, " BAR ", "BAR"), "FOO ")(1), "BAR")
'
' Let's see if it worked...
'
For X = 0 To UBound(Children) - 1
Debug.Print Children(X)
Next

Note: This code is dependent on the keyword "BAR" being the last word of the
Parent string (I can code around that if it can be the last word, but your
sample seems to indicate it won't be). Also, the code is dependent on the
keywords BAR and FOO always being in upper case as your example shows (I can
code around this also if this will not always be the case).

--
Rick (MVP - Excel)


ker_01 said:
Hi Rick!

In my source string, the 'FOO' string is only in the template once, right
before the first child string; I'm hoping that none of the 'child' strings
happen to contain it by chance, or else I'll be doing some hand-editing on
those records.

There are about 20-30 child strings per parent string I need to process;
BAR
is in the template at the end of each child string, so it occurs once
between
each child string and at the end of the last child string. Unfortunately,
BAR
also occurs once at the top of the template, which is what was giving me
issue #1 below.

Modified Example:
[start of file]
garbage text BAR garbage text FOO desired string 1 BAR desired string 2
BAR
desired string 3 BAR desired string 4 BAR desired string...30 BAR more
garbage text that does not include FOO or BAR
[end of file]

Thanks!
Keith


Rick Rothstein said:
I'd be interested in knowing what text is in the "(etc)" part of your
source
string. Can there be other FOO keywords? If so, do want the desired
strings
between them and any BAR keywords that may follow it?

--
Rick (MVP - Excel)


ker_01 said:
I have a long string, in the format

BAR garbage text FOO desired string 1 BAR desired string 2 BAR desired
string 3 BAR desired string 4 BAR (etc).

What I'd like to extract is:
desired string 1
desired string 2
desired string 3
desired string 4

I'm trying variations of re.Pattern = "FOO|BAR [\s\S]*?BAR "
but I'm having some problems.

(1) The first returned string returns
BAR garbage text FOO desired string 1 BAR
where I'd prefer to only start with
FOO desired string 1 BAR ... or better,
desired string 1

I can work around that manually, but I welcome any suggestions on how
to
fix
it.

(2) Each search 'consumes' the terminating string "BAR", causing it to
skip
every other string, giving me:
BAR garbage text FOO desired string 1 BAR
BAR desired string 3 BAR
BAR desired string 5 BAR

How do I force it to re-use the terminating string as the start string
of
the next part of the search?

(3) I'm not sure if I need it or not, but I tried using re.multiline =
True
just to be cautious, and I get a compile error "Method or data member
not
found". I do have a reference to Microsoft VBScript Regular Expressions
1.0.
Do I need a different reference to enable multiline? Again, the strings
still
seem to be coming across ok (including across multiple lines), I'm just
surprised I don't need (and in fact, can't set) the multiline
parameter.

Thank you,
Keith
 
R

Ron Rosenfeld

I have a long string, in the format

BAR garbage text FOO desired string 1 BAR desired string 2 BAR desired
string 3 BAR desired string 4 BAR (etc).

What I'd like to extract is:
desired string 1
desired string 2
desired string 3
desired string 4

I'm trying variations of re.Pattern = "FOO|BAR [\s\S]*?BAR "
but I'm having some problems.

(1) The first returned string returns
BAR garbage text FOO desired string 1 BAR
where I'd prefer to only start with
FOO desired string 1 BAR ... or better,
desired string 1

I can work around that manually, but I welcome any suggestions on how to fix
it.

(2) Each search 'consumes' the terminating string "BAR", causing it to skip
every other string, giving me:
BAR garbage text FOO desired string 1 BAR
BAR desired string 3 BAR
BAR desired string 5 BAR

How do I force it to re-use the terminating string as the start string of
the next part of the search?

(3) I'm not sure if I need it or not, but I tried using re.multiline = True
just to be cautious, and I get a compile error "Method or data member not
found". I do have a reference to Microsoft VBScript Regular Expressions 1.0.
Do I need a different reference to enable multiline? Again, the strings still
seem to be coming across ok (including across multiple lines), I'm just
surprised I don't need (and in fact, can't set) the multiline parameter.

Thank you,
Keith

A few points:

1. I'd use Microsoft VBScript Regular Expressions 5.5
2. In VBScript, all the multiline option does is affect whether or not ^$
match at line breaks. Your use of [\s\S]+ is appropriate.

3. I would pre-process the string removing everything before the first FOO.
While you're at it, you should probably remove extraneous LF's.

e.g. Regex ".*?(FOO)|\n"
Replacement: "$1"

4. IT would have been helpful had you posted your code. But probably you need
to use submatches to tease out what you want. Here is code that seems to work
on your sample:

==========================
Option Explicit
Function FooBar(s As String, Index As Long) As String
Dim re As RegExp, mc As MatchCollection
Dim sS As String
Set re = New RegExp
With re
.IgnoreCase = True
.Global = True
.Pattern = ".*?(FOO)|\n"
sS = .Replace(s, "$1")

.Pattern = "(FOO|BAR)\s*([\s\S]+?)\s*(?=BAR)"
If .test(sS) = True Then
Set mc = .Execute(sS)
FooBar = mc(Index - 1).submatches(1)
End If
End With
End Function
=======================

The Index argument represents which of the instances of desired string are
extracted. 1=1st, etc.
--ron
 
K

ker_01

Ron, Rick, and Per- thank you all for your help. I'm going to go back and try
to implement your suggestions now.

Ron, also thank you for mentioning the 5.5; I hadn't seen it when I added
the 1.0, but I went back and looked, and there it was.

:)
Keith

Ron Rosenfeld said:
I have a long string, in the format

BAR garbage text FOO desired string 1 BAR desired string 2 BAR desired
string 3 BAR desired string 4 BAR (etc).

What I'd like to extract is:
desired string 1
desired string 2
desired string 3
desired string 4

I'm trying variations of re.Pattern = "FOO|BAR [\s\S]*?BAR "
but I'm having some problems.

(1) The first returned string returns
BAR garbage text FOO desired string 1 BAR
where I'd prefer to only start with
FOO desired string 1 BAR ... or better,
desired string 1

I can work around that manually, but I welcome any suggestions on how to fix
it.

(2) Each search 'consumes' the terminating string "BAR", causing it to skip
every other string, giving me:
BAR garbage text FOO desired string 1 BAR
BAR desired string 3 BAR
BAR desired string 5 BAR

How do I force it to re-use the terminating string as the start string of
the next part of the search?

(3) I'm not sure if I need it or not, but I tried using re.multiline = True
just to be cautious, and I get a compile error "Method or data member not
found". I do have a reference to Microsoft VBScript Regular Expressions 1.0.
Do I need a different reference to enable multiline? Again, the strings still
seem to be coming across ok (including across multiple lines), I'm just
surprised I don't need (and in fact, can't set) the multiline parameter.

Thank you,
Keith

A few points:

1. I'd use Microsoft VBScript Regular Expressions 5.5
2. In VBScript, all the multiline option does is affect whether or not ^$
match at line breaks. Your use of [\s\S]+ is appropriate.

3. I would pre-process the string removing everything before the first FOO.
While you're at it, you should probably remove extraneous LF's.

e.g. Regex ".*?(FOO)|\n"
Replacement: "$1"

4. IT would have been helpful had you posted your code. But probably you need
to use submatches to tease out what you want. Here is code that seems to work
on your sample:

==========================
Option Explicit
Function FooBar(s As String, Index As Long) As String
Dim re As RegExp, mc As MatchCollection
Dim sS As String
Set re = New RegExp
With re
.IgnoreCase = True
.Global = True
.Pattern = ".*?(FOO)|\n"
sS = .Replace(s, "$1")

.Pattern = "(FOO|BAR)\s*([\s\S]+?)\s*(?=BAR)"
If .test(sS) = True Then
Set mc = .Execute(sS)
FooBar = mc(Index - 1).submatches(1)
End If
End With
End Function
=======================

The Index argument represents which of the instances of desired string are
extracted. 1=1st, etc.
--ron
 
R

Rick Rothstein

Just for clarification, my code assigns as many child strings as there are
to the Children array... that array's first element is at index 0 (the Split
function *always* produces a zero-based array) and the last Child string is
contained at index Ubound(Children)-1... there is a last element at index
Ubound(Children), but it contains the trailing garbage. While not as
efficient code-wise, we can make the code store only the child strings with
out the garbage like this...

Dim Children() As String, Parent As String, X As Long
Parent = "Garbage text BAR garbage text FOO desired string 1 BAR " & _
"desired string 2 BAR desired string 3 BAR desired string " & _
"4 BAR desired string 5 BAR More Garbage Text"
Children = Split(Split(Replace(Parent, " BAR ", "BAR"), "FOO ")(1), "BAR")
ReDim Preserve Children(0 To UBound(Children) - 1)

Doing it this way allows you to iterate the array like this...

' Let's see if it worked...
'
For X = 0 To UBound(Children)
Debug.Print Children(X)
Next

where you would use the actual upper bound as the array will now only
contain child strings without the trailing garbage.

--
Rick (MVP - Excel)


Rick Rothstein said:
Consider this coding approach...

Dim Children() As String, Parent As String, X As Long
Parent = "Garbage text BAR garbage text FOO desired string 1 BAR " & _
"desired string 2 BAR desired string 3 BAR desired string " & _
"4 BAR desired string 5 BAR More Garbage Text"
Children = Split(Split(Replace(Parent, " BAR ", "BAR"), "FOO ")(1), "BAR")
'
' Let's see if it worked...
'
For X = 0 To UBound(Children) - 1
Debug.Print Children(X)
Next

Note: This code is dependent on the keyword "BAR" being the last word of
the Parent string (I can code around that if it can be the last word, but
your sample seems to indicate it won't be). Also, the code is dependent on
the keywords BAR and FOO always being in upper case as your example shows
(I can code around this also if this will not always be the case).

--
Rick (MVP - Excel)


ker_01 said:
Hi Rick!

In my source string, the 'FOO' string is only in the template once, right
before the first child string; I'm hoping that none of the 'child'
strings
happen to contain it by chance, or else I'll be doing some hand-editing
on
those records.

There are about 20-30 child strings per parent string I need to process;
BAR
is in the template at the end of each child string, so it occurs once
between
each child string and at the end of the last child string. Unfortunately,
BAR
also occurs once at the top of the template, which is what was giving me
issue #1 below.

Modified Example:
[start of file]
garbage text BAR garbage text FOO desired string 1 BAR desired string 2
BAR
desired string 3 BAR desired string 4 BAR desired string...30 BAR more
garbage text that does not include FOO or BAR
[end of file]

Thanks!
Keith


Rick Rothstein said:
I'd be interested in knowing what text is in the "(etc)" part of your
source
string. Can there be other FOO keywords? If so, do want the desired
strings
between them and any BAR keywords that may follow it?

--
Rick (MVP - Excel)



I have a long string, in the format

BAR garbage text FOO desired string 1 BAR desired string 2 BAR desired
string 3 BAR desired string 4 BAR (etc).

What I'd like to extract is:
desired string 1
desired string 2
desired string 3
desired string 4

I'm trying variations of re.Pattern = "FOO|BAR [\s\S]*?BAR "
but I'm having some problems.

(1) The first returned string returns
BAR garbage text FOO desired string 1 BAR
where I'd prefer to only start with
FOO desired string 1 BAR ... or better,
desired string 1

I can work around that manually, but I welcome any suggestions on how
to
fix
it.

(2) Each search 'consumes' the terminating string "BAR", causing it to
skip
every other string, giving me:
BAR garbage text FOO desired string 1 BAR
BAR desired string 3 BAR
BAR desired string 5 BAR

How do I force it to re-use the terminating string as the start string
of
the next part of the search?

(3) I'm not sure if I need it or not, but I tried using re.multiline =
True
just to be cautious, and I get a compile error "Method or data member
not
found". I do have a reference to Microsoft VBScript Regular
Expressions
1.0.
Do I need a different reference to enable multiline? Again, the
strings
still
seem to be coming across ok (including across multiple lines), I'm
just
surprised I don't need (and in fact, can't set) the multiline
parameter.

Thank you,
Keith
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top