regex syntax

  • Thread starter Thread starter jg
  • Start date Start date
J

jg

I am new to using both dotnet and regex. I have done the basic reading to
the point I thought I know how to use regex to extract date string. But I
ran into problems.


what is the best regex expression to look for month names or date string for
that matter?

from my testing, I could use
"((JAN)|(FEB)|(MAR)|(APR)|(MAY)|(JUN)|(JUL)|(AUG)|(SEP)|(OCT)|(NOV)|(DEC))"
not
'([ADFJMNOS][ACEOPU][BCGLNPRTVY])"
In other word I got syntax problem with the month pattern

I am working towards dealing with various date format I deal with
My object is to get the entire date string and parse into yyyy-mm-dd or
whatever the dotnet conversion routine will take.
I will have to deal with many long strings of 64K to 200K . This is the
reason I am locking for a good regex expression to minimize delays from
processing

I know I have to deal with
yyyy-mm-dd ( and variants thereof with dot or slash as separator instead
of dash, single digit month or day)
yyyy-MMM-dd ( or just space instead of -)
MMM d, yy ( or yyyy)
and the tougher ones like
d MMM yyyy
d MMM yy
 
have a look at regexlib.com for customized expressions

--
Regards,
Alvin Bruney
[Shameless Author Plug]
The Microsoft Office Web Components Black Book with .NET
available at www.lulu.com/owc, Amazon, B&H etc


Forth-coming VSTO.NET
 
thank you

However, I have no luck accessing that content. all I got was the Green
Logos. did not see anything.

Alvin Bruney said:
have a look at regexlib.com for customized expressions

--
Regards,
Alvin Bruney
[Shameless Author Plug]
The Microsoft Office Web Components Black Book with .NET
available at www.lulu.com/owc, Amazon, B&H etc


Forth-coming VSTO.NET
-------------------------------------------------------------------------------
jg said:
I am new to using both dotnet and regex. I have done the basic reading to
the point I thought I know how to use regex to extract date string. But I
ran into problems.


what is the best regex expression to look for month names or date string
for that matter?

from my testing, I could use

"((JAN)|(FEB)|(MAR)|(APR)|(MAY)|(JUN)|(JUL)|(AUG)|(SEP)|(OCT)|(NOV)|(DEC))"
not
'([ADFJMNOS][ACEOPU][BCGLNPRTVY])"
In other word I got syntax problem with the month pattern

I am working towards dealing with various date format I deal with
My object is to get the entire date string and parse into yyyy-mm-dd or
whatever the dotnet conversion routine will take.
I will have to deal with many long strings of 64K to 200K . This is the
reason I am locking for a good regex expression to minimize delays from
processing

I know I have to deal with
yyyy-mm-dd ( and variants thereof with dot or slash as separator
instead of dash, single digit month or day)
yyyy-MMM-dd ( or just space instead of -)
MMM d, yy ( or yyyy)
and the tougher ones like
d MMM yyyy
d MMM yy
 
jg said:
I know I have to deal with
yyyy-mm-dd ( and variants thereof with dot or slash as separator instead
of dash, single digit month or day)
yyyy-MMM-dd ( or just space instead of -)
MMM d, yy ( or yyyy)
and the tougher ones like
d MMM yyyy
d MMM yy

I have created a regex for you that works with all those samples. Here
it is:

(?<year>\d{4})[-\./\s](?<month>\d{1,2})[-\./\s](?<day>\d{1,2})$ |
(?<year>\d{4})[-\s](?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)[-\s](?<day>\d{1,2})$
|
(?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s(?<day>\d{1,2}),\s*?(?<year>\d{4}|\d{2})$
|
(?<day>\d{1,2})\s(?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s(?<year>\d{4}|\d{2})$

I tried this with the following samples, constructed from the templates
you gave:

2005-03-08
2005.03.08
2005/03/08
2005 03 08
2005 3 08
2005 3 8
2005 03 8
2005-MAR-08
2005 MAR 08
2005 MAR 8
MAR 8, 2005
MAR 08, 2005
MAR 8, 05
MAR 08, 05
8 MAR 2005
8 MAR 05
08 MAR 2005
08 MAR 05

As you can see, the expression is comprised of four different parts.
Each of these has a $ sign at the end, which you'll want to get rid of
before using the expression with your own long string. This is only
needed to test the expression in Regulator with multiple samples.

I tried this with the IgnoreWhitespace and the IgnoreCase options
switched on.

Hope this helps!

(If you have any trouble with the regex, I could send you the saved
Regulator file. Just in case things get mangled in the message or
something.)


Oliver Sturm
 
that is absolutely wonderful and helpful. Thank you very much. Your efforts
are well appreciated.
Thank you very much again for testing and explaining.

I will try that out..

Oliver Sturm said:
jg said:
I know I have to deal with
yyyy-mm-dd ( and variants thereof with dot or slash as separator
instead of dash, single digit month or day)
yyyy-MMM-dd ( or just space instead of -)
MMM d, yy ( or yyyy)
and the tougher ones like
d MMM yyyy
d MMM yy

I have created a regex for you that works with all those samples. Here it
is:

(?<year>\d{4})[-\./\s](?<month>\d{1,2})[-\./\s](?<day>\d{1,2})$ |
(?<year>\d{4})[-\s](?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)[-\s](?<day>\d{1,2})$
|
(?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s(?<day>\d{1,2}),\s*?(?<year>\d{4}|\d{2})$
|
(?<day>\d{1,2})\s(?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s(?<year>\d{4}|\d{2})$

I tried this with the following samples, constructed from the templates
you gave:

2005-03-08
2005.03.08
2005/03/08
2005 03 08
2005 3 08
2005 3 8
2005 03 8
2005-MAR-08
2005 MAR 08
2005 MAR 8
MAR 8, 2005
MAR 08, 2005
MAR 8, 05
MAR 08, 05
8 MAR 2005
8 MAR 05
08 MAR 2005
08 MAR 05

As you can see, the expression is comprised of four different parts. Each
of these has a $ sign at the end, which you'll want to get rid of before
using the expression with your own long string. This is only needed to
test the expression in Regulator with multiple samples.

I tried this with the IgnoreWhitespace and the IgnoreCase options switched
on.

Hope this helps!

(If you have any trouble with the regex, I could send you the saved
Regulator file. Just in case things get mangled in the message or
something.)


Oliver Sturm
--
omnibus ex nihilo ducendis sufficit unum
Spaces inserted to prevent google email destruction:
MSN oliver @ sturmnet.org Jabber sturm @ amessage.de
ICQ 27142619 http://www.sturmnet.org/blog
 
Great, it works even after taking out the $ and the space around the |.. I
did add \b before the entire expression to make sure the first part of the
date is on the word boundary. This way I can avoid some supposedly low
probability errors like some strange catalogue dot or dash notations



Now all I have to do is to make it work with January, February,... ( fully
spelled month names). I guess I can always add another 12 | parts to the
month expressions

jg said:
that is absolutely wonderful and helpful. Thank you very much. Your
efforts are well appreciated.
Thank you very much again for testing and explaining.

I will try that out..

Oliver Sturm said:
jg said:
I know I have to deal with
yyyy-mm-dd ( and variants thereof with dot or slash as separator
instead of dash, single digit month or day)
yyyy-MMM-dd ( or just space instead of -)
MMM d, yy ( or yyyy)
and the tougher ones like
d MMM yyyy
d MMM yy

I have created a regex for you that works with all those samples. Here it
is:

(?<year>\d{4})[-\./\s](?<month>\d{1,2})[-\./\s](?<day>\d{1,2})$ |
(?<year>\d{4})[-\s](?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)[-\s](?<day>\d{1,2})$
|
(?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s(?<day>\d{1,2}),\s*?(?<year>\d{4}|\d{2})$
|
(?<day>\d{1,2})\s(?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s(?<year>\d{4}|\d{2})$

I tried this with the following samples, constructed from the templates
you gave:

2005-03-08
2005.03.08
2005/03/08
2005 03 08
2005 3 08
2005 3 8
2005 03 8
2005-MAR-08
2005 MAR 08
2005 MAR 8
MAR 8, 2005
MAR 08, 2005
MAR 8, 05
MAR 08, 05
8 MAR 2005
8 MAR 05
08 MAR 2005
08 MAR 05

As you can see, the expression is comprised of four different parts. Each
of these has a $ sign at the end, which you'll want to get rid of before
using the expression with your own long string. This is only needed to
test the expression in Regulator with multiple samples.

I tried this with the IgnoreWhitespace and the IgnoreCase options
switched on.

Hope this helps!

(If you have any trouble with the regex, I could send you the saved
Regulator file. Just in case things get mangled in the message or
something.)


Oliver Sturm
--
omnibus ex nihilo ducendis sufficit unum
Spaces inserted to prevent google email destruction:
MSN oliver @ sturmnet.org Jabber sturm @ amessage.de
ICQ 27142619 http://www.sturmnet.org/blog
 
jg said:
Great, it works even after taking out the $ and the space around the |.. I
did add \b before the entire expression to make sure the first part of the
date is on the word boundary. This way I can avoid some supposedly low
probability errors like some strange catalogue dot or dash notations

Sure, I didn't know your exact circumstances, so you'd have to make
modifications to my sample to make it work for you completely.
Now all I have to do is to make it work with January, February,... ( fully
spelled month names). I guess I can always add another 12 | parts to the
month expressions

Sure you can. If you find the whole thing growing too much, maybe you
could define the various parts you need (the month expression, the day
expression, the two digit year, the four digit year) as string constants
in your code and use a String.Format to put them together to form the
complete regular expression before you use it. That way it might be a
bit more maintainable - otherwise you'll have to make every change to
one of the parts in many places, increasing the probability of an error.



Oliver Sturm
 
thank you again. you are wonderfully helpful.

I did find the pattern string getting too huge. So I started to split date
pattern into 3 components before using them to compose the final pattern,
although I did not use the string format method.
 
jg said:
I did find the pattern string getting too huge. So I started to split date
pattern into 3 components before using them to compose the final pattern,
although I did not use the string format method.

Well, if you ask me, you should always use String.Format when putting
together strings from more than two parts. A String.Format call can
create an arbitrarily complicated string in one operation, while a
concatenation a + b + c takes two operations at least. Strings are
immutable in .NET, so a + b + c will end up allocating several new
strings before the final result is ready.

The argument against this is that the compiler might get rid of some of
the overhead for you, at least when a, b and c are static strings. But I
don't like to depend on that, especially when the String.Format call is
usually so much better readable:

"At " + time.ToString() + ", the user " + user + "had a problem
accessing the " + resource + "resource."

String.Format("At {0}, the user {1} had a problem accessing the {2}
resource.", time, user, resource);



Oliver Sturm
 
Oliver Sturm said:
Well, if you ask me, you should always use String.Format when putting
together strings from more than two parts.

I disagree.
A String.Format call can
create an arbitrarily complicated string in one operation, while a
concatenation a + b + c takes two operations at least.

What do you count as an operation? Bear in mind that String.Format has
to do a lot more work in terms of parsing etc - I very much doubt that
there are many cases where it's more efficient.
Strings are
immutable in .NET, so a + b + c will end up allocating several new
strings before the final result is ready.

That's not true if a, b and c are already strings. a+b+c will simply
result in a call to String.Concat(a, b, c) which creates one string
without creating any intermediate ones. It's not like a+b+c is compiled
into (a+b)+c, evaluating a+b first.

string a = "a";
string b = "b";
string c = "c";

string x = a+b+c;

is compiled into:

IL_0000: ldstr "a"
IL_0005: stloc.0
IL_0006: ldstr "b"
IL_000b: stloc.1
IL_000c: ldstr "c"
IL_0011: stloc.2
IL_0012: ldloc.0
IL_0013: ldloc.1
IL_0014: ldloc.2
IL_0015: call string [mscorlib]System.String::Concat(string,
string,
string)
IL_001a: stloc.3
The argument against this is that the compiler might get rid of some of
the overhead for you, at least when a, b and c are static strings. But I
don't like to depend on that

You can depend on it in C# at least - it's in the specification, IIRC.
especially when the String.Format call is
usually so much better readable:

"At " + time.ToString() + ", the user " + user + "had a problem
accessing the " + resource + "resource."

String.Format("At {0}, the user {1} had a problem accessing the {2}
resource.", time, user, resource);

Sometimes String.Format is more readable; sometimes it's less readable.
In almost all cases, readability should be the key to determining which
to use.
 
Jon said:
I disagree.

I guess I should have qualified my statement better. I might have added
conditions like "and at least one of the parts is not a string in itself".
You can depend on it in C# at least - it's in the specification, IIRC.

I would readily assume it even without reading the specs. I would make a
test if it were in any way important to me. Until then, I wouldn't
depend on it.
Sometimes String.Format is more readable; sometimes it's less readable.
In almost all cases, readability should be the key to determining which
to use.

Right, that was my most important point as well. But apart from
concatenations of literal strings or variables/constants holding
strings, I can't imagine cases where the + concatenation would be more
readable (see above, IMO). Even in these cases I might tend to use
String.Format because during the course of development I find it much
easier to extend and change. I can always change it if the profiler says
it's a problem.



Oliver Sturm
 
Oliver Sturm said:
I guess I should have qualified my statement better. I might have added
conditions like "and at least one of the parts is not a string in itself".

Do you have evidence that String.Format doesn't itself convert the
arguments to intermediate strings? If it does, I can't see that using
it is saving any operations.
I would readily assume it even without reading the specs. I would make a
test if it were in any way important to me. Until then, I wouldn't
depend on it.

Well, take it from me - you *can* depend on it. (That's assuming that
by "static" you mean "constant".)
Right, that was my most important point as well. But apart from
concatenations of literal strings or variables/constants holding
strings, I can't imagine cases where the + concatenation would be more
readable (see above, IMO). Even in these cases I might tend to use
String.Format because during the course of development I find it much
easier to extend and change. I can always change it if the profiler says
it's a problem.

In cases with a single parameter you want at the end of the string, I
think it's more readable to have:

string x = "Age: "+age;

than:

string x = string.Format("Age: {0}", age);

It's very easy to change the former to the latter if you ever *do* want
to do anything more complicated.
 
Back
Top