Regex Question

  • Thread starter Thread starter Pete Kane
  • Start date Start date
P

Pete Kane

Hi All, I need to parse a text file which contains lines like the one
shown below, however, one or more of the columns could be blank - the
only constant being the first which always begins with BX - I've cobbled
a program together which uses substrings, but think a Regex would be a
cleaner way ?


BX00027025 AD CW
NO'S ALLOCATED TO KEW TRUNKS ON ARRIVAL 29/03/1994
 
Hello Pete,
Hi All, I need to parse a text file which contains lines like the one
shown below, however, one or more of the columns could be blank - the
only constant being the first which always begins with BX - I've
cobbled a program together which uses substrings, but think a Regex
would be a cleaner way ?

BX00027025 AD CW
NO'S ALLOCATED TO KEW TRUNKS ON ARRIVAL 29/03/1994

Are these fixed length? How do you know when a column is blank?

Regex can be used for these kinds of record parsing, but they're generally
not the best tools. There are tools built specifically for this purpose (Sql
Server Integration Servvices comes to mind, Biztalk can also be used to read
these files programatically). If you want to do it yourself and need a bit
of help, then we need a bit more information about the contents of the file...

Like:
- How are records separated?
- How are columns separated?
- Is there any escaping going on?
- Are Newlines allowed in content?

Kinds regards,
 
Jesse said:
Hello Pete,


Are these fixed length? How do you know when a column is blank?

Regex can be used for these kinds of record parsing, but they're
generally not the best tools. There are tools built specifically for
this purpose (Sql Server Integration Servvices comes to mind, Biztalk
can also be used to read these files programatically). If you want to do
it yourself and need a bit of help, then we need a bit more information
about the contents of the file...

Like: - How are records separated?
- How are columns separated?
- Is there any escaping going on?
- Are Newlines allowed in content?

Kinds regards,
Hi Jesse, thanks for quick reply - the records are separated by newline
- the columns by spaces - no escaping - newlines not allowed - the
sample I gave has unfortunately wrapped in the editor but is an example
of a record with every field present - the column width are :
10,2,40,2,50 and 10 which is why I used substrings - I would just like
to know how to acchieve the same result using a Regex

regards
 
Hello Pete,
Hi Jesse, thanks for quick reply - the records are separated by
newline - the columns by spaces - no escaping - newlines not allowed -
the sample I gave has unfortunately wrapped in the editor but is an
example of a record with every field present - the column width are :
10,2,40,2,50 and 10 which is why I used substrings - I would just like
to know how to acchieve the same result using a Regex

Regex wouldn't be my first choice... But they would work:

private static Regex rx = new Regex("^(.{10}) (.{2}) (.{40}) (.{2}) (.{50})\r?$",
RegexOptions.Compiled | RegexOptions.Multiline);

And then in your parsing routine use:

public void Parse(string value)
{
Match m = rx.Match(value);
while (m.Success)
{
string line = m.Value;
string part1 = m.Groups[0].Value;
string part2 = m.Groups[1].Value;
/// etc ...
m = m.NextMatch();
}
}

Also have a look at the OleDB plaintext driver, which would be a better option
in my opinion.

http://www.codeproject.com/KB/cs/UsingJetForImport.aspx

And if you're importing these values into SQL Server I again suggest you
take a look at SQL Server Integration Services (SSIS).
 
Jesse said:
Hello Pete,
Hi Jesse, thanks for quick reply - the records are separated by
newline - the columns by spaces - no escaping - newlines not allowed -
the sample I gave has unfortunately wrapped in the editor but is an
example of a record with every field present - the column width are :
10,2,40,2,50 and 10 which is why I used substrings - I would just like
to know how to acchieve the same result using a Regex

Regex wouldn't be my first choice... But they would work:

private static Regex rx = new Regex("^(.{10}) (.{2}) (.{40}) (.{2})
(.{50})\r?$", RegexOptions.Compiled | RegexOptions.Multiline);

And then in your parsing routine use:

public void Parse(string value)
{
Match m = rx.Match(value);
while (m.Success)
{
string line = m.Value;
string part1 = m.Groups[0].Value;
string part2 = m.Groups[1].Value;
/// etc ...
m = m.NextMatch();
}
}

Also have a look at the OleDB plaintext driver, which would be a better
option in my opinion.

http://www.codeproject.com/KB/cs/UsingJetForImport.aspx

And if you're importing these values into SQL Server I again suggest you
take a look at SQL Server Integration Services (SSIS).
Tried it Jesse - doesn't work - m.Success always returns false -
m.Groups.Count is = to 1 and has a Value of ""
 
Hello Pete,
Jesse said:
Hello Pete,
Jesse Houwing wrote:

Hello Pete,

Hi All, I need to parse a text file which contains lines like the
one shown below, however, one or more of the columns could be
blank - the only constant being the first which always begins with
BX - I've cobbled a program together which uses substrings, but
think a Regex would be a cleaner way ?

BX00027025 AD CW
NO'S ALLOCATED TO KEW TRUNKS ON ARRIVAL 29/03/1994

Are these fixed length? How do you know when a column is blank?

Regex can be used for these kinds of record parsing, but they're
generally not the best tools. There are tools built specifically
for this purpose (Sql Server Integration Servvices comes to mind,
Biztalk can also be used to read these files programatically). If
you want to do it yourself and need a bit of help, then we need a
bit more information about the contents of the file...

Like: - How are records separated?
- How are columns separated?
- Is there any escaping going on?
- Are Newlines allowed in content?
Kinds regards,
--
Jesse Houwing
jesse.houwing at sogeti.nl
Hi Jesse, thanks for quick reply - the records are separated by
newline - the columns by spaces - no escaping - newlines not allowed
- the sample I gave has unfortunately wrapped in the editor but is
an example of a record with every field present - the column width
are : 10,2,40,2,50 and 10 which is why I used substrings - I would
just like to know how to acchieve the same result using a Regex
Regex wouldn't be my first choice... But they would work:

private static Regex rx = new Regex("^(.{10}) (.{2}) (.{40}) (.{2})
(.{50})\r?$", RegexOptions.Compiled | RegexOptions.Multiline);

And then in your parsing routine use:

public void Parse(string value)
{
Match m = rx.Match(value);
while (m.Success)
{
string line = m.Value;
string part1 = m.Groups[0].Value;
string part2 = m.Groups[1].Value;
/// etc ...
m = m.NextMatch();
}
}
Also have a look at the OleDB plaintext driver, which would be a
better option in my opinion.

http://www.codeproject.com/KB/cs/UsingJetForImport.aspx

And if you're importing these values into SQL Server I again suggest
you take a look at SQL Server Integration Services (SSIS).
Tried it Jesse - doesn't work - m.Success always returns false -
m.Groups.Count is = to 1 and has a Value of ""

Please check if I got the columns right, and I excluded the space from the
column length, but that migth have been a wrong assumption. The regex is
so plain simple that you should be able to solve this yourself without too
much effort.

If it's still not working, please post a sample file ( or email me one) with
at least a few rows in there.
 
Jesse said:
Hello Pete,
Jesse said:
Hello Pete,

Jesse Houwing wrote:

Hello Pete,

Hi All, I need to parse a text file which contains lines like the
one shown below, however, one or more of the columns could be
blank - the only constant being the first which always begins with
BX - I've cobbled a program together which uses substrings, but
think a Regex would be a cleaner way ?

BX00027025 AD CW
NO'S ALLOCATED TO KEW TRUNKS ON ARRIVAL 29/03/1994

Are these fixed length? How do you know when a column is blank?

Regex can be used for these kinds of record parsing, but they're
generally not the best tools. There are tools built specifically
for this purpose (Sql Server Integration Servvices comes to mind,
Biztalk can also be used to read these files programatically). If
you want to do it yourself and need a bit of help, then we need a
bit more information about the contents of the file...

Like: - How are records separated?
- How are columns separated?
- Is there any escaping going on?
- Are Newlines allowed in content?
Kinds regards,
--
Jesse Houwing
jesse.houwing at sogeti.nl
Hi Jesse, thanks for quick reply - the records are separated by
newline - the columns by spaces - no escaping - newlines not allowed
- the sample I gave has unfortunately wrapped in the editor but is
an example of a record with every field present - the column width
are : 10,2,40,2,50 and 10 which is why I used substrings - I would
just like to know how to acchieve the same result using a Regex

Regex wouldn't be my first choice... But they would work:

private static Regex rx = new Regex("^(.{10}) (.{2}) (.{40}) (.{2})
(.{50})\r?$", RegexOptions.Compiled | RegexOptions.Multiline);

And then in your parsing routine use:

public void Parse(string value)
{
Match m = rx.Match(value);
while (m.Success)
{
string line = m.Value;
string part1 = m.Groups[0].Value;
string part2 = m.Groups[1].Value;
/// etc ...
m = m.NextMatch();
}
}
Also have a look at the OleDB plaintext driver, which would be a
better option in my opinion.

http://www.codeproject.com/KB/cs/UsingJetForImport.aspx

And if you're importing these values into SQL Server I again suggest
you take a look at SQL Server Integration Services (SSIS).
Tried it Jesse - doesn't work - m.Success always returns false -
m.Groups.Count is = to 1 and has a Value of ""

Please check if I got the columns right, and I excluded the space from
the column length, but that migth have been a wrong assumption. The
regex is so plain simple that you should be able to solve this yourself
without too much effort.

If it's still not working, please post a sample file ( or email me one)
with at least a few rows in there.
Hi Jesse - I got it to work by modifying the widths of the columns but
agree with you it's not the best way - when you say look at oledb plain
text drive - what advantages would that give ? at the moment I use a
StreamReader and ReadLine() which works fine - thanks again for your help
 
Hello Peter,
Jesse said:
Hello Pete,
Jesse Houwing wrote:

Hello Pete,

Jesse Houwing wrote:

Hello Pete,

Hi All, I need to parse a text file which contains lines like
the one shown below, however, one or more of the columns could
be blank - the only constant being the first which always begins
with BX - I've cobbled a program together which uses substrings,
but think a Regex would be a cleaner way ?

BX00027025 AD
CW NO'S ALLOCATED TO KEW TRUNKS ON ARRIVAL
29/03/1994

Are these fixed length? How do you know when a column is blank?

Regex can be used for these kinds of record parsing, but they're
generally not the best tools. There are tools built specifically
for this purpose (Sql Server Integration Servvices comes to mind,
Biztalk can also be used to read these files programatically). If
you want to do it yourself and need a bit of help, then we need a
bit more information about the contents of the file...

Like: - How are records separated?
- How are columns separated?
- Is there any escaping going on?
- Are Newlines allowed in content?
Kinds regards,
--
Jesse Houwing
jesse.houwing at sogeti.nl
Hi Jesse, thanks for quick reply - the records are separated by
newline - the columns by spaces - no escaping - newlines not
allowed - the sample I gave has unfortunately wrapped in the
editor but is an example of a record with every field present -
the column width are : 10,2,40,2,50 and 10 which is why I used
substrings - I would just like to know how to acchieve the same
result using a Regex

Regex wouldn't be my first choice... But they would work:

private static Regex rx = new Regex("^(.{10}) (.{2}) (.{40}) (.{2})
(.{50})\r?$", RegexOptions.Compiled | RegexOptions.Multiline);

And then in your parsing routine use:

public void Parse(string value)
{
Match m = rx.Match(value);
while (m.Success)
{
string line = m.Value;
string part1 = m.Groups[0].Value;
string part2 = m.Groups[1].Value;
/// etc ...
m = m.NextMatch();
}
}
Also have a look at the OleDB plaintext driver, which would be a
better option in my opinion.
http://www.codeproject.com/KB/cs/UsingJetForImport.aspx

And if you're importing these values into SQL Server I again
suggest you take a look at SQL Server Integration Services (SSIS).

--
Jesse Houwing
jesse.houwing at sogeti.nl
Tried it Jesse - doesn't work - m.Success always returns false -
m.Groups.Count is = to 1 and has a Value of ""
Please check if I got the columns right, and I excluded the space
from the column length, but that migth have been a wrong assumption.
The regex is so plain simple that you should be able to solve this
yourself without too much effort.

If it's still not working, please post a sample file ( or email me
one) with at least a few rows in there.
Hi Jesse - I got it to work by modifying the widths of the columns but
agree with you it's not the best way - when you say look at oledb
plain text drive - what advantages would that give ? at the moment I
use a StreamReader and ReadLine() which works fine - thanks again for
your help

It allows you to open the file in a DataSet, which can very easily be databound
or uploaded to another ADO.NET source. You can also quary a Dataset using
Link to Datasets.

The best part it, that you don't need to write the parsing code yourself,
and that it's easily adaptable to new formats.
 
Back
Top