Regular Expressions for email address (lengthy)

  • Thread starter Thread starter _AnonCoward
  • Start date Start date
A

_AnonCoward

I've looked around for awhile now for a solid email regexp string and there doesn't seem to be a consensus on what such a pattern would look like. A common approach is something along these lines:
^[a-zA-Z_0-9\.\-]+@[a-zA-Z_0-9\-]+(\.([a-zA-Z_0-9\-])+)*[a-zA-Z]{2,4}$
However, in researching the W3C documentation (RFC822 to be specific - http://www.w3.org/Protocols/rfc822/#z66), I determined this is not consistent with those specs and so I've come up with the following approach. Could others look this over and offer feed back? Thanx.

Per RFC822, an email address matches this pattern:
LOCAL-PART @ DOMAIN
The definition for LOCAL-PART is:
word *("." word)
Or in other words, a pattern of characters delimited by periods. The definition for "word" is either a "quoted-string" or an "atom".

A "quoted-string" is pretty much what it sounds like: a pair of double quotes enclosing a text string consisting of zero or more characters. The characters can either be "qtext" or a "quoted-pair". For qtext characters, the only limitations is that they may not be a double quote, a back slash or a CR (ASCII 13). A quoted-pair is "\" followed by any character.

An atom on the other hand is a string of zero or more chararacters that is best described by what's disallowed rather that what's permissible. An atom may contain any character except the following:
control chars (ASCII 0 to ASCII 31{HEX: x1F})
space char (ASCII 32 {HEX: x20})
any of the following "special" characters:
()<>@,;:\".[]
In my opinion, that allows for some genuinely odd possibilities for the local part of email addresses that are both counter-intuitive and unlikely in the extreme:
"a quoted string containing \"()<>@\" followed by".a-string_of#atom~characters.followed/by~"another quoted string"
Therefore as a practical matter, I'm limiting the LOCAL-PART of the email addresses to either a single quoted-string or one or more atoms delimited by periods. Further, I've chosen to limit the contents of a quoted-string to qtext only and make no provision for quoted-pairs.

I believe a qtext string can be expressed as:
[^"\x0D\\]
and a quoted-string therefore as:
("[^"\x0D\\]+")
The alternative way to express the LOCAL-PART of an email address is by one or more strings of "atoms" that are delimited by periods.

I believe a atom can be expressed as:
[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+
An allowable period delimited string of atoms would therefore consist of at least one string of atoms with at least 1 allowable character followed by a zero or more stings of atom characters preceded by period (like so):
[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+(\.[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+)*
This means the LOCAL-PART of the email address (in the somewhat limited implementation described above) can be represented as a quoted string or a combination of 1 or more atoms separated by periods
^(("[^"\x0D\\]+")|([^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+(\.[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+)*))
The DOMAIN portion of the email address is another creature altogether. A domain consists of one or more sub-domains that are delimted by periods. A sub-domain is either a string of 1 or more atom characters OR an IP address

The problem with this approach is it allows for multiple IP addresses to be concatenated with other IP addresses, atom strings or both. In theory then, the following is permissible:
0.0.0.0.1.1.1.1.abc.xyz.99.99.99.99.999.com
Clearly not a good idea. Of course, an IP address is a collection of stings of atoms separated by periods, so it raises the question should we even attempt to screen for them? As it happens, if you allow for a list of period separated atom strings you automatatically accommodate an IP address even if it is invalid (e.g.: 999.999.999.999). Unless you choose to limit the format of a non-IP domain string may take (for example, only allowing alpha characters for the final sub-domain string), I believe it is impossible to screen out invalid IP addresses.

In building this regular expression, I've chosen not to worry about IP addresses as they will automatically be accommodated. To that end, I'm stipulating that the DOMAIN of an email address is a collection of one or more sub-domains (strings of atoms separated by periods). Sub-domains, as limited above, borrow from the LOCAL-PART and look like the following:
[^\x00-\x20\(\)\<\>\[\]\.\\,;:"@]+(\.[^\x00-\x20\(\)\<\>\[\]\.\\,;:"@]+)*
And the full expression is....
^(("[^"\x0D\\]+")|([^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+(\.[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+)*))@([^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+(\.[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+)*)$

Thoughts?

Ralf Thompson
 
You can find all kinds of regular expressions at RegExLib.com

http://www.regexplib.com/Default.aspx

They provide this one for email validating:

^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$

-- Alan
I've looked around for awhile now for a solid email regexp string and there doesn't seem to be a consensus on what such a pattern would look like. A common approach is something along these lines:
^[a-zA-Z_0-9\.\-]+@[a-zA-Z_0-9\-]+(\.([a-zA-Z_0-9\-])+)*[a-zA-Z]{2,4}$
However, in researching the W3C documentation (RFC822 to be specific - http://www.w3.org/Protocols/rfc822/#z66), I determined this is not consistent with those specs and so I've come up with the following approach. Could others look this over and offer feed back? Thanx.

Per RFC822, an email address matches this pattern:
LOCAL-PART @ DOMAIN
The definition for LOCAL-PART is:
word *("." word)
Or in other words, a pattern of characters delimited by periods. The definition for "word" is either a "quoted-string" or an "atom".

A "quoted-string" is pretty much what it sounds like: a pair of double quotes enclosing a text string consisting of zero or more characters. The characters can either be "qtext" or a "quoted-pair". For qtext characters, the only limitations is that they may not be a double quote, a back slash or a CR (ASCII 13). A quoted-pair is "\" followed by any character.

An atom on the other hand is a string of zero or more chararacters that is best described by what's disallowed rather that what's permissible. An atom may contain any character except the following:
control chars (ASCII 0 to ASCII 31{HEX: x1F})
space char (ASCII 32 {HEX: x20})
any of the following "special" characters:
()<>@,;:\".[]
In my opinion, that allows for some genuinely odd possibilities for the local part of email addresses that are both counter-intuitive and unlikely in the extreme:
"a quoted string containing \"()<>@\" followed by".a-string_of#atom~characters.followed/by~"another quoted string"
Therefore as a practical matter, I'm limiting the LOCAL-PART of the email addresses to either a single quoted-string or one or more atoms delimited by periods. Further, I've chosen to limit the contents of a quoted-string to qtext only and make no provision for quoted-pairs.

I believe a qtext string can be expressed as:
[^"\x0D\\]
and a quoted-string therefore as:
("[^"\x0D\\]+")
The alternative way to express the LOCAL-PART of an email address is by one or more strings of "atoms" that are delimited by periods.

I believe a atom can be expressed as:
[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+
An allowable period delimited string of atoms would therefore consist of at least one string of atoms with at least 1 allowable character followed by a zero or more stings of atom characters preceded by period (like so):
[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+(\.[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+)*
This means the LOCAL-PART of the email address (in the somewhat limited implementation described above) can be represented as a quoted string or a combination of 1 or more atoms separated by periods
^(("[^"\x0D\\]+")|([^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+(\.[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+)*))
The DOMAIN portion of the email address is another creature altogether. A domain consists of one or more sub-domains that are delimted by periods. A sub-domain is either a string of 1 or more atom characters OR an IP address

The problem with this approach is it allows for multiple IP addresses to be concatenated with other IP addresses, atom strings or both. In theory then, the following is permissible:
0.0.0.0.1.1.1.1.abc.xyz.99.99.99.99.999.com
Clearly not a good idea. Of course, an IP address is a collection of stings of atoms separated by periods, so it raises the question should we even attempt to screen for them? As it happens, if you allow for a list of period separated atom strings you automatatically accommodate an IP address even if it is invalid (e.g.: 999.999.999.999). Unless you choose to limit the format of a non-IP domain string may take (for example, only allowing alpha characters for the final sub-domain string), I believe it is impossible to screen out invalid IP addresses.

In building this regular expression, I've chosen not to worry about IP addresses as they will automatically be accommodated. To that end, I'm stipulating that the DOMAIN of an email address is a collection of one or more sub-domains (strings of atoms separated by periods). Sub-domains, as limited above, borrow from the LOCAL-PART and look like the following:
[^\x00-\x20\(\)\<\>\[\]\.\\,;:"@]+(\.[^\x00-\x20\(\)\<\>\[\]\.\\,;:"@]+)*
And the full expression is....
^(("[^"\x0D\\]+")|([^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+(\.[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+)*))@([^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+(\.[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+)*)$

Thoughts?

Ralf Thompson
 
Sorry. I see that you already new that! http://www.regexplib.com/REDetails.aspx?regexp_id=26

-- Alan
You can find all kinds of regular expressions at RegExLib.com

http://www.regexplib.com/Default.aspx

They provide this one for email validating:

^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$

-- Alan
I've looked around for awhile now for a solid email regexp string and there doesn't seem to be a consensus on what such a pattern would look like. A common approach is something along these lines:
^[a-zA-Z_0-9\.\-]+@[a-zA-Z_0-9\-]+(\.([a-zA-Z_0-9\-])+)*[a-zA-Z]{2,4}$
However, in researching the W3C documentation (RFC822 to be specific - http://www.w3.org/Protocols/rfc822/#z66), I determined this is not consistent with those specs and so I've come up with the following approach. Could others look this over and offer feed back? Thanx.

Per RFC822, an email address matches this pattern:
LOCAL-PART @ DOMAIN
The definition for LOCAL-PART is:
word *("." word)
Or in other words, a pattern of characters delimited by periods. The definition for "word" is either a "quoted-string" or an "atom".

A "quoted-string" is pretty much what it sounds like: a pair of double quotes enclosing a text string consisting of zero or more characters. The characters can either be "qtext" or a "quoted-pair". For qtext characters, the only limitations is that they may not be a double quote, a back slash or a CR (ASCII 13). A quoted-pair is "\" followed by any character.

An atom on the other hand is a string of zero or more chararacters that is best described by what's disallowed rather that what's permissible. An atom may contain any character except the following:
control chars (ASCII 0 to ASCII 31{HEX: x1F})
space char (ASCII 32 {HEX: x20})
any of the following "special" characters:
()<>@,;:\".[]
In my opinion, that allows for some genuinely odd possibilities for the local part of email addresses that are both counter-intuitive and unlikely in the extreme:
"a quoted string containing \"()<>@\" followed by".a-string_of#atom~characters.followed/by~"another quoted string"
Therefore as a practical matter, I'm limiting the LOCAL-PART of the email addresses to either a single quoted-string or one or more atoms delimited by periods. Further, I've chosen to limit the contents of a quoted-string to qtext only and make no provision for quoted-pairs.

I believe a qtext string can be expressed as:
[^"\x0D\\]
and a quoted-string therefore as:
("[^"\x0D\\]+")
The alternative way to express the LOCAL-PART of an email address is by one or more strings of "atoms" that are delimited by periods.

I believe a atom can be expressed as:
[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+
An allowable period delimited string of atoms would therefore consist of at least one string of atoms with at least 1 allowable character followed by a zero or more stings of atom characters preceded by period (like so):
[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+(\.[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+)*
This means the LOCAL-PART of the email address (in the somewhat limited implementation described above) can be represented as a quoted string or a combination of 1 or more atoms separated by periods
^(("[^"\x0D\\]+")|([^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+(\.[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+)*))
The DOMAIN portion of the email address is another creature altogether. A domain consists of one or more sub-domains that are delimted by periods. A sub-domain is either a string of 1 or more atom characters OR an IP address

The problem with this approach is it allows for multiple IP addresses to be concatenated with other IP addresses, atom strings or both. In theory then, the following is permissible:
0.0.0.0.1.1.1.1.abc.xyz.99.99.99.99.999.com
Clearly not a good idea. Of course, an IP address is a collection of stings of atoms separated by periods, so it raises the question should we even attempt to screen for them? As it happens, if you allow for a list of period separated atom strings you automatatically accommodate an IP address even if it is invalid (e.g.: 999.999.999.999). Unless you choose to limit the format of a non-IP domain string may take (for example, only allowing alpha characters for the final sub-domain string), I believe it is impossible to screen out invalid IP addresses.

In building this regular expression, I've chosen not to worry about IP addresses as they will automatically be accommodated. To that end, I'm stipulating that the DOMAIN of an email address is a collection of one or more sub-domains (strings of atoms separated by periods). Sub-domains, as limited above, borrow from the LOCAL-PART and look like the following:
[^\x00-\x20\(\)\<\>\[\]\.\\,;:"@]+(\.[^\x00-\x20\(\)\<\>\[\]\.\\,;:"@]+)*
And the full expression is....
^(("[^"\x0D\\]+")|([^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+(\.[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+)*))@([^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+(\.[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+)*)$

Thoughts?

Ralf Thompson
 
Yes, I tried that site. However, the regexp offered doesn't seem to be fully up to the task. I tried posting a version of my comments here on that site, but it was severely truncated and most of my remarks were lost. I didn't bother trying to force the issue, but came here instead.

You can find all kinds of regular expressions at RegExLib.com

http://www.regexplib.com/Default.aspx

They provide this one for email validating:

^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$

-- Alan
I've looked around for awhile now for a solid email regexp string and there doesn't seem to be a consensus on what such a pattern would look like. A common approach is something along these lines:
^[a-zA-Z_0-9\.\-]+@[a-zA-Z_0-9\-]+(\.([a-zA-Z_0-9\-])+)*[a-zA-Z]{2,4}$
However, in researching the W3C documentation (RFC822 to be specific - http://www.w3.org/Protocols/rfc822/#z66), I determined this is not consistent with those specs and so I've come up with the following approach. Could others look this over and offer feed back? Thanx.

Per RFC822, an email address matches this pattern:
LOCAL-PART @ DOMAIN
The definition for LOCAL-PART is:
word *("." word)
Or in other words, a pattern of characters delimited by periods. The definition for "word" is either a "quoted-string" or an "atom".

A "quoted-string" is pretty much what it sounds like: a pair of double quotes enclosing a text string consisting of zero or more characters. The characters can either be "qtext" or a "quoted-pair". For qtext characters, the only limitations is that they may not be a double quote, a back slash or a CR (ASCII 13). A quoted-pair is "\" followed by any character.

An atom on the other hand is a string of zero or more chararacters that is best described by what's disallowed rather that what's permissible. An atom may contain any character except the following:
control chars (ASCII 0 to ASCII 31{HEX: x1F})
space char (ASCII 32 {HEX: x20})
any of the following "special" characters:
()<>@,;:\".[]
In my opinion, that allows for some genuinely odd possibilities for the local part of email addresses that are both counter-intuitive and unlikely in the extreme:
"a quoted string containing \"()<>@\" followed by".a-string_of#atom~characters.followed/by~"another quoted string"
Therefore as a practical matter, I'm limiting the LOCAL-PART of the email addresses to either a single quoted-string or one or more atoms delimited by periods. Further, I've chosen to limit the contents of a quoted-string to qtext only and make no provision for quoted-pairs.

I believe a qtext string can be expressed as:
[^"\x0D\\]
and a quoted-string therefore as:
("[^"\x0D\\]+")
The alternative way to express the LOCAL-PART of an email address is by one or more strings of "atoms" that are delimited by periods.

I believe a atom can be expressed as:
[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+
An allowable period delimited string of atoms would therefore consist of at least one string of atoms with at least 1 allowable character followed by a zero or more stings of atom characters preceded by period (like so):
[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+(\.[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+)*
This means the LOCAL-PART of the email address (in the somewhat limited implementation described above) can be represented as a quoted string or a combination of 1 or more atoms separated by periods
^(("[^"\x0D\\]+")|([^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+(\.[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+)*))
The DOMAIN portion of the email address is another creature altogether. A domain consists of one or more sub-domains that are delimted by periods. A sub-domain is either a string of 1 or more atom characters OR an IP address

The problem with this approach is it allows for multiple IP addresses to be concatenated with other IP addresses, atom strings or both. In theory then, the following is permissible:
0.0.0.0.1.1.1.1.abc.xyz.99.99.99.99.999.com
Clearly not a good idea. Of course, an IP address is a collection of stings of atoms separated by periods, so it raises the question should we even attempt to screen for them? As it happens, if you allow for a list of period separated atom strings you automatatically accommodate an IP address even if it is invalid (e.g.: 999.999.999.999). Unless you choose to limit the format of a non-IP domain string may take (for example, only allowing alpha characters for the final sub-domain string), I believe it is impossible to screen out invalid IP addresses.

In building this regular expression, I've chosen not to worry about IP addresses as they will automatically be accommodated. To that end, I'm stipulating that the DOMAIN of an email address is a collection of one or more sub-domains (strings of atoms separated by periods). Sub-domains, as limited above, borrow from the LOCAL-PART and look like the following:
[^\x00-\x20\(\)\<\>\[\]\.\\,;:"@]+(\.[^\x00-\x20\(\)\<\>\[\]\.\\,;:"@]+)*
And the full expression is....
^(("[^"\x0D\\]+")|([^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+(\.[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+)*))@([^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+(\.[^\x00-\x20\(\)\<\>\[\]\\,;:\."@]+)*)$

Thoughts?

Ralf Thompson
 
Back
Top