regular expressions

  • Thread starter Thread starter JFB
  • Start date Start date
J

JFB

Hi All,
What is the pattern for a regular expression if i want to get the first
paragraph in a string between "<b>" tag?
String = "<b>match sample test<b><b>match2 sample2 test2<b>"
Get Result = "match sample test"
Thanks

JFB
 
BTW
With this pattern I got all the text between the first and the last <br>
<(?<tag>\w*)>(?<text>.*)<(?<tag>\w*)>
How can I get until the next <br>
Thanks
 
JFB said:
BTW
With this pattern I got all the text between the first and the last <br>
<(?<tag>\w*)>(?<text>.*)<(?<tag>\w*)>
How can I get until the next <br>

Try non-greedy matching with
.*?
instead of
.*
Or use
[^<]*
 
Thanks for you reply and help.
using <(?<tag>\w*)>(?<text>[^<]*)<(?<tag>\w*)>
It jumps and get the last one

"match2 sample2 test2"
any other idea?


Martin Honnen said:
JFB said:
BTW
With this pattern I got all the text between the first and the last <br>
<(?<tag>\w*)>(?<text>.*)<(?<tag>\w*)>
How can I get until the next <br>

Try non-greedy matching with
.*?
instead of
.*
Or use
[^<]*
 
Never mine, looks like the string is different.. now I'm confuse.
String= "<br><br>match sample test<br>match2 sample2 test2<br>"
How can I get the result?
result= "match sample test"
Thanks!




JFB said:
Thanks for you reply and help.
using <(?<tag>\w*)>(?<text>[^<]*)<(?<tag>\w*)>
It jumps and get the last one

"match2 sample2 test2"
any other idea?


Martin Honnen said:
JFB said:
BTW
With this pattern I got all the text between the first and the last <br>
<(?<tag>\w*)>(?<text>.*)<(?<tag>\w*)>
How can I get until the next <br>

Try non-greedy matching with
.*?
instead of
.*
Or use
[^<]*
 
This is getting better :), the string now is
String = ""<br /><br />match sample test<br />match2 sample2 test2<br />"
Please help, now i can't match at all.
Thanks



JFB said:
Never mine, looks like the string is different.. now I'm confuse.
String= "<br><br>match sample test<br>match2 sample2 test2<br>"
How can I get the result?
result= "match sample test"
Thanks!




JFB said:
Thanks for you reply and help.
using <(?<tag>\w*)>(?<text>[^<]*)<(?<tag>\w*)>
It jumps and get the last one

"match2 sample2 test2"
any other idea?


Martin Honnen said:
JFB wrote:
BTW
With this pattern I got all the text between the first and the last
<br>
<(?<tag>\w*)>(?<text>.*)<(?<tag>\w*)>
How can I get until the next <br>

Try non-greedy matching with
.*?
instead of
.*
Or use
[^<]*
 
JFB said:
This is getting better :), the string now is
String = ""<br /><br />match sample test<br />match2 sample2 test2<br />"
Please help, now i can't match at all.

That looks like an XML fragment now so you could parse it as XML e.g.

Dim xml As String = "<br /><br />match sample test<br />match2
sample2 test2<br />"
Dim settings As New XmlReaderSettings()
settings.ConformanceLevel = ConformanceLevel.Fragment
Dim doc As New XPathDocument(XmlReader.Create(New
StringReader(xml), settings))
Dim text As XPathNavigator =
doc.CreateNavigator().SelectSingleNode("br/following-sibling::text()")
If text IsNot Nothing Then
Console.WriteLine(text.Value)
End If

would output "match sample test".

Your earlier samples however were not XML fragments or documents so the
above approach would not work with them.
But if you know the input is an XML document or fragment then I wouldn't
bother to try to parse it with regular expressions but instead exploit
the power of XPath.
 
If you don't have it, get Expresso from UltraPico. It's a FREE tool which
makes it very easy to experiment with regular expressions.

Bob
 
Thanks again for you reply and help.
When I run this code. I'm getting this error:
' ', hexadecimal value 0x0B

Looks like the data from this doc file is not correct, but I open the word
file in notepad and looks ok with html format.
Maybe xml have problem reading my text?
The <br \> shows as square.
Do you have an idea how to solve this?
Regards

J:)hnny
 
JFB said:
When I run this code. I'm getting this error:
' ', hexadecimal value 0x0B

Looks like the data from this doc file is not correct, but I open the word
file in notepad and looks ok with html format.
Maybe xml have problem reading my text?
The <br \> shows as square.
Do you have an idea how to solve this?

Which code exactly do you run that gives that error for which statement
eaxctly? How does the input exactly look? Does it contain characters
that are not allowed in XML, such as control characters?

So far you have shown only variables with strings of markup.
If you have a file instead then you will need to show how you read the
file contents into a string respectively in terms of XML you would
normally let the XML parser do all that work meaning if you have a file
file1.xml then you would simply change the code I posted to

Dim settings As New XmlReaderSettings()
settings.ConformanceLevel = ConformanceLevel.Fragment
Dim doc As New XPathDocument(XmlReader.Create("file1.xml",
settings))
Dim text As XPathNavigator =
doc.CreateNavigator().SelectSingleNode("br/following-sibling::text()")
If text IsNot Nothing Then
Console.WriteLine(text.Value)
End If

If you still have problems then you need to provide more details as to
where the file comes from, how it is encoded.
 
Which code exactly do you run that gives that error for which statement
Error:{"' ', hexadecimal value 0x0B, is an invalid character. Line 1,
position 1."}
Line Code when the error show:
Dim doc As New XPath.XPathDocument(XmlReader.Create(New
StringReader(tempcontent), settings))
How does the input exactly look?
I have a word doc file that I need to read and get the name of address
block.
The paragraph looks like this when I edit the file with notepad.
<br /><br />

SHLOMI HELWA<br />

563 ELTINGVILLE BLVD.<br />

STATEN ISLAND, NY 10312<br />

<br />

<br />

<br />

Does it contain characters that are not allowed in XML, such as control
characters?
So far you have shown only variables with strings of markup.
If you have a file instead then you will need to show how you read the
file contents into a string respectively in terms of XML you would
normally let the XML parser do all that work meaning if you have a file
file1.xml then you would simply change the code I posted to
Please send me an email to jfb00(at)hotmail.com and I can send you the word
file.
I have many word files that I need to collect only the name of an address
block, so I reading and getting the paragraph that contains the address
block.
Here is my code:
Try

'for office xp

wordApp = CreateObject("Word.Application")

wordDoc = CreateObject("Word.document")

Catch

'for office 2000 and 97

wordApp = New Word.Application

wordDoc = New Word.Document

End Try

wordApp.Visible = False

wordDoc = wordApp.Documents.Open(FileName:=docName.ToString)



Dim tempcontent As String = ""

Dim subPara As Word.Paragraph

Dim paraCount As Integer

paraCount = 0

For Each subPara In wordDoc.Paragraphs

tempcontent = subPara.Range.Text

paraCount = paraCount + 1

If paraCount = 5 Then ''Here I get the address block

Exit For

End If

Next

Dim settings As New XmlReaderSettings()

settings.ConformanceLevel = ConformanceLevel.Fragment

settings.CheckCharacters = True

Dim doc As New XPath.XPathDocument(XmlReader.Create(New
StringReader(tempcontent), settings))

Dim text As XPath.XPathNavigator =
doc.CreateNavigator().SelectSingleNode("br/following-sibling::text()")

If text IsNot Nothing Then

MsgBox(text.Value)

End If



thanks for your help!
 
Thanks for you reply Bob,
I already get that but it doesn't help in my case because I have some
special character in my file.
Rgds
 
JFB said:
Error:{"' ', hexadecimal value 0x0B, is an invalid character. Line 1,
position 1."}
Line Code when the error show:
Dim doc As New XPath.XPathDocument(XmlReader.Create(New
StringReader(tempcontent), settings))

I have a word doc file that I need to read and get the name of address
block.

I am afraid a Word document can contain characters that are not allowed
in XML documents so using an XML parser on the contents will not work
unless you strip any not allowed characters first.
 
I used arrays and it works:
Dim ArrayCadenas() As String

ArrayCadenas = Split("<br \><br \>SHLOMI HELWA<br \>563 ELTINGVILLE BLVD.<br
\>STATEN ISLAND, NY 10312<br \><br \>","<br \>")

msbBox(ArrayCadenas(0).ToString)

Thanks for you reply and help!
 
Back
Top