Using regulare expressions to parse text (HTML)

E

Earl Teigrob

I am tring to scan a html string for all content within cretain tags. In the
simplifed example below, I would like to scan the source text for all
occurrences of text that start with "A" and end in "c" without any
overlapping. In the following example, the regular expression finds only one
result that inludes the first "A" to the last "c". This is not what I want.
I want every non overlapping occurrance of "A" and "c". The result set
should be

Abc, Abc, Abxc, Abxc

NOT

AbcbbAbc something elseAbxcXYZAbxc

as it is now

Does anyone know How this can be done

Thanks

Earl

private void ParseTest()

{

ListBox2.Items.Clear();

string SourceString = "XYZAbcbbAbc something elseAbxcXYZAbxcAb";

Regex r = new Regex("A.+c");

MatchCollection mc = r.Matches(SourceString);

foreach(Match m in mc)

{

ListBox2.Items.Add(m.ToString());

}

}

Outputs only one result
==> AbcbbAbc something elseAbxcXYZAbxc
 
J

Jochen Kalmbach

Earl said:
I am tring to scan a html string for all content within cretain tags.
In the simplifed example below, I would like to scan the source text
for all occurrences of text that start with "A" and end in "c" without
any overlapping. In the following example, the regular expression
finds only one result that inludes the first "A" to the last "c". This
is not what I want. I want every non overlapping occurrance of "A"
and "c". The result set should be
[...]
Outputs only one result
==> AbcbbAbc something elseAbxcXYZAbxc

You ALWAYS get only one result for regex-strings!

Maybe you should try to match ONE occurens and then find out the length
(match[1] and use then missed right string do do an other match until the
whole string was matched...

And the regex should look like: "(^(A[^c]*c))"


--
Greetings
Jochen

Do you need a memory-leak finder ?
http://www.codeproject.com/tools/leakfinder.asp

Do you need daily reports from your server?
http://sourceforge.net/projects/srvreport/
 
N

Niki Estner

".+" and ".*" will match as many characters as possible.
Use lazy quantifiers ".+?" or ".*?" instead.
That should do what you want.
Two advices:
1. Get some regular expression testing environment - I'm using Expresso, but
I guess there are others, too.
2. If you want to understand what you're doing, get a good book on the
topic!

Niki

PS: Maybe I misunderstood that other post: Of course "Matches" returns more
than one match if there is more than one match. I didn't test it, but I
think your code should run fine if you use ".+?" or "[^c]+".
 
J

Jochen Kalmbach

Niki said:
PS: Maybe I misunderstood that other post: Of course "Matches" returns
more than one match if there is more than one match. I didn't test it,
but I think your code should run fine if you use ".+?" or "[^c]+".

Sorry for the misunderstanding from my side...
The following works well:

<code>
using System.Text.RegularExpressions;
namespace ConsoleApplication1
{
class Class1
{
static void Main(string[] args)
{
string SourceString = "XYZAbcbbAbc something elseAbxcXYZAbxcAb";
Regex r = new Regex("A[^c]*c");

MatchCollection mc = r.Matches(SourceString);
foreach(Match m in mc)
{
System.Console.WriteLine(m.Groups[0]);
}
}
}
}
</code>

It also matches (Ac) if you do not want this use "A[^c]+c" instead.

--
Greetings
Jochen

Do you need a memory-leak finder ?
http://www.codeproject.com/tools/leakfinder.asp

Do you need daily reports from your server?
http://sourceforge.net/projects/srvreport/
 
E

Earl Teigrob

Perfect, thanks for the info. I have done a fair bit with regular
expressions but I was not aware of the concept of lazy qualifiers. This gets
me on track...

and...I will check out one of the re testing environments...great advice!

Earl

Niki Estner said:
".+" and ".*" will match as many characters as possible.
Use lazy quantifiers ".+?" or ".*?" instead.
That should do what you want.
Two advices:
1. Get some regular expression testing environment - I'm using Expresso, but
I guess there are others, too.
2. If you want to understand what you're doing, get a good book on the
topic!

Niki

PS: Maybe I misunderstood that other post: Of course "Matches" returns more
than one match if there is more than one match. I didn't test it, but I
think your code should run fine if you use ".+?" or "[^c]+".

Earl Teigrob said:
I am tring to scan a html string for all content within cretain tags. In the
simplifed example below, I would like to scan the source text for all
occurrences of text that start with "A" and end in "c" without any
overlapping. In the following example, the regular expression finds only one
result that inludes the first "A" to the last "c". This is not what I want.
I want every non overlapping occurrance of "A" and "c". The result set
should be

Abc, Abc, Abxc, Abxc

NOT

AbcbbAbc something elseAbxcXYZAbxc

as it is now

Does anyone know How this can be done

Thanks

Earl

private void ParseTest()

{

ListBox2.Items.Clear();

string SourceString = "XYZAbcbbAbc something elseAbxcXYZAbxcAb";

Regex r = new Regex("A.+c");

MatchCollection mc = r.Matches(SourceString);

foreach(Match m in mc)

{

ListBox2.Items.Add(m.ToString());

}

}

Outputs only one result
==> AbcbbAbc something elseAbxcXYZAbxc
 
Top