How can I match a hyphen in this regular expression?

  • Thread starter Thread starter Peter
  • Start date Start date
P

Peter

Hi all,

I am searching through directories trying to find the prefix to a
number of files. Unfortunately the files don't have a standard naming
convention yet.

So some of them appear as:
THH307A.Monitoring Public Health Issues.doc ' A single period
THH307A- Monitoring Public Health Issues.doc ' A hyphen, space
THH307A Monitoring Public Health Issues.doc ' A single space
THH307 A.Monitoring Public Health Issues.doc ' A mess

At the moment I can match filenames that have a period or space after
the prefix but can't work out how to also match a hyphen.

This is my regex at the moment:
Dim PrefixRegex As Regex = New Regex("(?<prefix>[^\.| ]+)[\.| ](?
<unitName>.+)")

Can someone help me match the hyphen and maybe even the messy ones
where there could be a space near the end of the unit code.

Many thanks,

Peter.
 
Peter wrote:
So some of them appear as:
THH307A.Monitoring Public Health Issues.doc ' A single period
THH307A- Monitoring Public Health Issues.doc ' A hyphen, space
THH307A Monitoring Public Health Issues.doc ' A single space
THH307 A.Monitoring Public Health Issues.doc ' A mess

At the moment I can match filenames that have a period or space after
the prefix but can't work out how to also match a hyphen.

This is my regex at the moment:
Dim PrefixRegex As Regex = New Regex("(?<prefix>[^\.| ]+)[\.| ](?
<unitName>.+)")
<snip>

Maybe "(?<prefix>[^\.| ]+)[\.| ]+(?<unitName>.+)" will do. However,
the last example (with an embeded space in the prefix) will be more
challenging...


HTH.

Regards.

Branco.
 
Peter wrote:

THH307A.Monitoring Public Health Issues.doc ' A single period
THH307A- Monitoring Public Health Issues.doc ' A hyphen, space
THH307A Monitoring Public Health Issues.doc ' A single space
THH307 A.Monitoring Public Health Issues.doc ' A mess
At the moment I can match filenames that have a period or space after
the prefix but can't work out how to also match a hyphen.
This is my regex at the moment:
Dim PrefixRegex As Regex = New Regex("(?<prefix>[^\.| ]+)[\.| ](?
<unitName>.+)")

<snip>

Maybe "(?<prefix>[^\.| ]+)[\.| ]+(?<unitName>.+)" will do. However,
the last example (with an embeded space in the prefix) will be more
challenging...

HTH.

Regards.

Branco.


Hey Branco,

Thanks for replying. I tried what you suggested but it appears to do
the same thing as my original regular expression. I am trying to
extract the prefix without the hyphen. But unfortunately using your
above mentioned regex the hyphen remains attached to the prefix.

I am using this small console app to test the regex:

Sub Main()

' My Regex:
' Dim PrefixRegex As Regex = New Regex("(?<prefix>[^\.| ]+)
[\.| ](?<unitName>.+)")

' Brancos Regex:
Dim PrefixRegex As Regex = New Regex("(?<prefix>[^\.| ]+)[\.| ]
+(?<unitName>.+)")
Dim filename As String = "THH307A Monitoring Public Health
Issues.doc"
Dim filename2 As String = "THH307A.Monitoring Public Health
Issues.doc"
Dim filename3 As String = "THH307A- Monitoring Public Health
Issues.doc "

Dim M As Match = PrefixRegex.Match(filename)
Dim M2 As Match = PrefixRegex.Match(filename2)
Dim M3 As Match = PrefixRegex.Match(filename3)

If M.Success Then
System.Console.WriteLine("Prefix: " &
M.Groups("prefix").Value)
System.Console.WriteLine("Unit Name: " &
M.Groups("unitName").Value)
Else
System.Console.WriteLine(filename & " is not a valid
filename")
End If

If M2.Success Then
System.Console.WriteLine("Prefix: " &
M2.Groups("prefix").Value)
System.Console.WriteLine("Unit Name: " &
M2.Groups("unitName").Value)
Else
System.Console.WriteLine(filename2 & " is not a valid
filename")
End If

If M3.Success Then
System.Console.WriteLine("Prefix: " &
M3.Groups("prefix").Value)
System.Console.WriteLine("Unit Name: " &
M3.Groups("unitName").Value)
Else
System.Console.WriteLine(filename3 & " is not a valid
filename")
End If

System.Console.WriteLine()
System.Console.WriteLine("Press Enter to Continue...")
System.Console.ReadLine()
End Sub

Output:
-----------
Prefix: THH307A
Unit Name: Monitoring Public Health Issues.doc
Prefix: THH307A
Unit Name: Monitoring Public Health Issues.doc
Prefix: THH307A-
Unit Name: Monitoring Public Health Issues.doc

Press Enter to Continue...

----------------------------------

Do you have any other ideas?

Thanks again,

Peter.
 
Peter wrote:
Thanks for replying. I tried what you suggested but it appears to do
the same thing as my original regular expression. I am trying to
extract the prefix without the hyphen. But unfortunately using your
above mentioned regex the hyphen remains attached to the prefix.
<snip>

Sorry, I really can't recall what I originally understood from your
first post (a real busy day on this side of the country)...

The thing with specifying a hyphen in a charclass is that it must be
the last element of the class. therefore, the regrex will probably be
like this:

"(?<prefix>[^\.| -]+)[\.| -]+(?<unitName>.+)"


HTH.

Regards,

Branco.
 
What about logically approaching it like this (to help catch "the mess" case):

1) strip off the file extension and the preceeding period
2) now you're left with just the file name
3) Reverse the string and then find the first occurence (in reality the last
occurence, since we reversed the string) of a non-alphanumeric character
that's not a space. Take everything after this character.
4) If the result of #3, above, is an empty string, then go back to the
original file name and just take everything to the left of the first space
(the file name must not contain any special characters to parse off of).
PREFIX FOUND.
5) Else, take the result of #3, above, reverse it to put it back in the
normal order. PREFIX FOUND.

One approach I've used when trying to parse strings in crazy formats, is to
apply whatever rules you've got so far to your list of strings. Make two
groups, those that you were able to parse correctly and those you weren't.
Look at the unparse-able group to see what rules you need to add to increase
recognized strings. Keep adding rules to shrink the size of the unparse-able
group. When you're done, you'll be left with a small group of strings that
you might have to parse manually if a parsing rule can't be created.

PJ Simon
 
Back
Top