using a regular expression to match up to but not including html start/end tags

  • Thread starter Thread starter Andy B
  • Start date Start date
A

Andy B

I need to create a regular expression that will match a 5 digit number, a
space and then anything up to but not including the next closing html tag.
Here is an example:

<startTag>55555 any text</aClosingTag>

I need a Regex that will get all of the text between the html tags above
(the html tags are random and i do not know them before hand). The match
string always starts with at least 5 digits.
 
I need to create a regular expression that will match a 5 digit number, a
space and then anything up to but not including the next closing html tag..
Here is an example:

<startTag>55555 any text</aClosingTag>

I need a Regex that will get all of the text between the html tags above
(the html tags are random and i do not know them before hand). The match
string always starts with at least 5 digits.

Hi Andy,
There's a nice function on that link which retrieves the text between
the tags:
http://www.4guysfromrolla.com/demos/StripHTML1.asp

It also provides to test the function online as you see, when you
enter the line:
<startTag>55555 any text</aClosingTag> in textbox on the page, it
returns "55555 any text", presumably what you want.

Then you can use the same function in your project to benefit.

Hope this helps,

Onur Güzel
 
Hi Andy,
There's a nice function on that link which retrieves the text between
the tags:http://www.4guysfromrolla.com/demos/StripHTML1.asp

It also provides to test the function online as you see, when you
enter the line:
<startTag>55555 any text</aClosingTag> in textbox on the page, it
returns "55555 any text", presumably what you want.

Then you can use the same function in your project to benefit.

Hope this helps,

Onur Güzel

Andy,
I revised the code a bit, and paste that code to get the text between
HTML tags:

In the sample, strToSearch is the one that's in your post:

'-----------------------------------------
Dim strToSearch As String
' Your HTML line includin its tag
strToSearch = "<startTag>55555 any text</aClosingTag>"

' Initialize Regex type with proper pattern
Dim objRegExp As New Regex("<(.|\n)+?>")

' Define output variable
Dim strOutput As String

'Replace all HTML tag matches with the empty string
strOutput = objRegExp.Replace(strToSearch, "")

'Replace all < and > with &lt; and &gt;
strOutput = Replace(strOutput, "<", "&lt;")
strOutput = Replace(strOutput, ">", "&gt;")

'Show result in MsgBox
'Returns "5555 any text"
MsgBox(strOutput.ToString)

objRegExp = Nothing
'----------------------------------------------

Hope it's better,

Onur Güzel
 
Two recommendations: 1)
http://msdn.microsoft.com/en-us/library/az24scfc.aspx and a free product
named Expresso from www.ultrapico.com.

Also, having read some of the other replies, \d{5} matches exactly 5
characters, but since you said the string "always starts with at least 5
digits" maybe you will need \d{5,}. Also, beware the * as it is greedy. *?
may work better for you.

Do some reading, get Expresso and experiment with the suggestions provided
in the other replies. Regular expressions are very useful. Learning
something about them will pay a high dividend.

I am concerned about the fact that the html tags are "random". Depending on
what else is in the file you may have problems avoiding stuff you do no
want.

Good luck, Bob
 
"I am concerned about the fact that the html tags are "random". Depending
on what else is in the file you may have problems avoiding stuff you do not
want."

Hi. The html I am searching in is not mal formed. What I have is a list of
items on a page that start with at least a 5 digit number (\d{5,}) and then
the item title. There are directions for each item that may or may not be
given. Each item block (number, title and directions) are in a <p></p>
element. If there are directions for the item, there will be a <br /> after
the title. If there are no directions for the item, the title ends at </p>
[the end of the p element in question]. Here are a few examples:

<p>11111 This item has no directions</p>

<p>22222 This item has directions<br />1. Stand up. 2. Turn around. 3. Sit
down</p>

This is what I want to do with the Regex object:
1. Return a Match collection containing all p elements starting with at
least a 5 digit number.
2. Test for the <br /> html tag. If it does exist, split the title before
the <br /> and the directions after the <br /> into seperate regex groups.
3. Drop the html tags from the output.

Can this be done with 1 Regex expression?
 
"My suggestion for this is to take care of the html breaks later in your
code after you've captured the text you want. You're trying to do too much
in regular expressions, and it will become obnoxiously complex."

After a little bit of homework, I came up with this so far:

<p>(?<Number>\d{5,})(?<Title>.*)<br />(?<Steps>.*)</p>

The above works like a dream and I can get the text I need captured to the
Number, Title and Steps groups. Now I need to match the same exact thing but
without the Steps section. The example is: <p>11122 Title without steps</p>.
I need to take the results of both of these matches and put them all inside
of a single Match object. How do I do this?
 
< snip >
Agreed. However, I do not use Expresso. I do have it installed, but
it unfortunately does not work correctly with many regular expressions
I use.
< snip >

Can you elaborate? I've always assumed that Expresso uses .Net
RegularExpressions and that it would therefore be impossible for Expresso to
get a result different from a program using the same regex and options. The
only problem I've experienced with Expresso is that when it reads "Sample
Text" some characters, such as ñ (n with a tilde over it), get changed.

Bob
 
<snip>
I believe he will want greediness repetition. I'm fairly certain that
resorting to laziness will return undesired results. Consider the
following example, and try it with both greedy and lazy repetition:

<p>11111 ABC <p>123</p> DEF</p>
I have to admit that I am not sure I fully understand the difference between
".*" and ".*?". I can almost recite what the doc says, but that's not the
same as fully understanding. I haven't played with the example you gave yet
but I hope to today.

BUT ... in general I have found that ".*?" works better for me than ".*". I
had an interesting experience just yesterday. I developed a regex (using
Expresso) with approximately a half dozen uses of ".*". Actually, given my
experience, I was going to use ".*?", but remembering your post I decided to
use ".*". The resulting expression worked, but was taking over 1.7 seconds
to find a relatively short string in a relatively small file! Since this
expression would be used against over a thousand files I could not tolerate
such poor performance. So, not having any better ideas, I just changed all
of the uses of ".*"to ".*?". The expression still worked and took so little
CPU that it was not measurable.

I am not disagreeing with you, I am just reporting my experience.

Bob
 
.NET's regular expression support works fine. What I was referring to
is a bug (or undesired feature) specific to the version of Expresso I
currently have installed (version 3.0.2766.13570).

Take the following regular expression:

^\$\d+(?:\.\d{1,2}|)$

This is valid and works fine under .NET. It will match text that
contains a dollar sign followed by digits with or without hundredths.
Now try this in Expresso with a couple of test scenarios and watch
what happens.

Here's some test cases that all match:

$10
$99.99
$5.50
$0

When I click 'Run Match', I get zero matches. This is a bug. The
expression matches all lines of text. To confirm this, click
'Validate'. You will see that all lines match in this case.

I have the same level of Expresso (I think that we have the latest) and I
have the same experience with your expression and sample text. As I am sure
you know, but for the benefit of others who might be listening in, there's
no problem if you remove the $ at the end of your expression. Which I
understand may not be the expression which you need. I would agree that it
is an Expresso bug. But even so I can't imagine developing a non-trivial
regular expression without it. Have you reported this bug to Ultrapico? I
notice at the moment that the web site is down. I hope that doesn't mean
anything!

Thanks for making me aware of this.

Bob
 
Bob
I have the same level ofExpresso(I think that we have the latest) and I
have the same experience with your expression and sample text.  As I amsure
you know, but for the benefit of others who might be listening in, there's
no problem if you remove the $ at the end of your expression.  Which I
understand may not be the expression which you need.  I would agree that it
is anExpressobug.  But even so I can't imagine developing a non-trivial
regular expression without it.  Have you reported this bug to Ultrapico?  I
notice at the moment that the web site is down.  I hope that doesn't mean
anything!

Thanks for making me aware of this.

Bob- Hide quoted text -

- Show quoted text -

I admit that this is confusing, but it is not a bug in Expresso.
Regular expressions are very literal and you have to remember that a
Windows text file has line termination characters that have to be
matched properly. Specifically, each line ends with "\r\n" (carriage
return, line feed). The regular expression in your example properly
matches each of the examples if it is all by itself without any line
termination. (Try using any of the examples as the only text in the
"Sample Text" box, without a new line). If you use a number of
examples on separate lines, it will not work, just as it would not
work in code, unless you also match the carriage return character at
the end of each line. Try this regex, for example:

^\$\d+(?:\.\d{1,2}|)\r?$

This searches for your string, matches zero or one carriage returns,
then looks for the end of the string. (Be sure to turn OFF the
"Multiline" option, which has a confusing name). It will match every
line in your example.

The "Validate Line by Line" tool was designed specifically to avoid
this confusion. All it does it to take each line individually, without
any line termination characters and apply the regex to that line,
showing whether it matches the whole line, part of it, or none of it.
If you are expecting your text to have no embedded line termination,
it is the ideal tool to use. If you want to know what will happen if
the text has carriage returns, you should use the "Run Match" tool.

This is definitely confusing, but the goal of Expresso's design is to
show you exactly what would happen if you used the regex in your code.
 
Back
Top