Parsing space delimited records

  • Thread starter Thread starter M1iS
  • Start date Start date
M

M1iS

I’m trying to parse out Amazon S3 server logs which are space delimited.
However date fields are in the following form:

[28/Oct/2008:21:44:21 +0000]

When I try to use the following code to split the record on the spaces it
also splits date field:

string[] fields = record.Split(' ');

What can I do to get around this?

Scott
 
Hi Scott,

I personally would use Regular Expressions to split the words in a smart
way. Below is a sample console application to demonstrate it. The regular
expression \[.*\]\s*|.+ means that it can select from two alternatives:

a) Text wrapped inside [ and ]
b) Any other text (your actual server log)

using System;
using System.Text.RegularExpressions;

class Program
{
static void Main(string[] args)
{
string expr = @"\[.*\]\s*|.+";
string line = "[28/Oct/2008:21:44:21 +0000] Test with p~nctuat!ion
word goes here!";

Regex regex = new Regex(expr);

foreach (Match m in regex.Matches(line))
{
string value = m.Value.Trim();

if (value.StartsWith("[") && value.EndsWith("]"))
{
// This is part of the timestamp
Console.WriteLine("TEST: time = " + value);
}
else
{
// This is an actual slice of the result
Console.WriteLine("TEST: word = " + value);
}
}

Console.Read();
}
}
 
I was hoping to avoid taking the time to create a regular expression as there
are 17 fields per S3 record. It took me a while but here is what I ended up
with:

(.*?)(\s+)(.*?)(\s+)(\[.*?\])(\s+)((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(?![\d])(\s+)(.*?)(\s+)(.*?)(\s+)(.*?)(\s+)(.*)(\s+)(".*?")(\s+)(.*?)(\s+)(.*?)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(".*?")(\s+)(".*?")

Yuck, I'd rather being doing about a million other things, but oh well
problem solved.



Stanimir Stoyanov said:
Hi Scott,

I personally would use Regular Expressions to split the words in a smart
way. Below is a sample console application to demonstrate it. The regular
expression \[.*\]\s*|.+ means that it can select from two alternatives:

a) Text wrapped inside [ and ]
b) Any other text (your actual server log)

using System;
using System.Text.RegularExpressions;

class Program
{
static void Main(string[] args)
{
string expr = @"\[.*\]\s*|.+";
string line = "[28/Oct/2008:21:44:21 +0000] Test with p~nctuat!ion
word goes here!";

Regex regex = new Regex(expr);

foreach (Match m in regex.Matches(line))
{
string value = m.Value.Trim();

if (value.StartsWith("[") && value.EndsWith("]"))
{
// This is part of the timestamp
Console.WriteLine("TEST: time = " + value);
}
else
{
// This is an actual slice of the result
Console.WriteLine("TEST: word = " + value);
}
}

Console.Read();
}
}

M1iS said:
I’m trying to parse out Amazon S3 server logs which are space delimited.
However date fields are in the following form:

[28/Oct/2008:21:44:21 +0000]

When I try to use the following code to split the record on the spaces it
also splits date field:

string[] fields = record.Split(' ');

What can I do to get around this?

Scott
 
I am sure there is *more* elegant solution to the problem, can you post a
sample log output, and do you want to get the individual words out of the
log?

E.g. if the log line is
[28/Oct/2008:21:44:21 +0000] Test with p~nctuat!ion word goes here!
would you like to have the timestamp, "Test", "with", etc as separate
matches? If so, you could split the text using string.Split() once you have
the actual log text (see my previous code example for the 'log text' case).

--
Stanimir Stoyanov
http://stoyanoff.info

M1iS said:
I was hoping to avoid taking the time to create a regular expression as
there
are 17 fields per S3 record. It took me a while but here is what I ended
up
with:

(.*?)(\s+)(.*?)(\s+)(\[.*?\])(\s+)((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(?![\d])(\s+)(.*?)(\s+)(.*?)(\s+)(.*?)(\s+)(.*)(\s+)(".*?")(\s+)(.*?)(\s+)(.*?)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(".*?")(\s+)(".*?")

Yuck, I'd rather being doing about a million other things, but oh well
problem solved.



Stanimir Stoyanov said:
Hi Scott,

I personally would use Regular Expressions to split the words in a smart
way. Below is a sample console application to demonstrate it. The regular
expression \[.*\]\s*|.+ means that it can select from two alternatives:

a) Text wrapped inside [ and ]
b) Any other text (your actual server log)

using System;
using System.Text.RegularExpressions;

class Program
{
static void Main(string[] args)
{
string expr = @"\[.*\]\s*|.+";
string line = "[28/Oct/2008:21:44:21 +0000] Test with
p~nctuat!ion
word goes here!";

Regex regex = new Regex(expr);

foreach (Match m in regex.Matches(line))
{
string value = m.Value.Trim();

if (value.StartsWith("[") && value.EndsWith("]"))
{
// This is part of the timestamp
Console.WriteLine("TEST: time = " + value);
}
else
{
// This is an actual slice of the result
Console.WriteLine("TEST: word = " + value);
}
}

Console.Read();
}
}

M1iS said:
I’m trying to parse out Amazon S3 server logs which are space
delimited.
However date fields are in the following form:

[28/Oct/2008:21:44:21 +0000]

When I try to use the following code to split the record on the spaces
it
also splits date field:

string[] fields = record.Split(' ');

What can I do to get around this?

Scott
 
Unless you're somehow married to the format, just drop the time zone:

string[] fields = record.Replace(' +0000','',Split(' ');
 
Er, make that:

string[] fields = record.Replace(' +0000','').Split(' ');

Mark S. Milley said:
Unless you're somehow married to the format, just drop the time zone:

string[] fields = record.Replace(' +0000','',Split(' ');


M1iS said:
I’m trying to parse out Amazon S3 server logs which are space delimited.
However date fields are in the following form:

[28/Oct/2008:21:44:21 +0000]

When I try to use the following code to split the record on the spaces it
also splits date field:

string[] fields = record.Split(' ');

What can I do to get around this?

Scott
 
Hello Stanimir,

If you do a Regex.Match with the following regex:

^((\[(?<result>[^\]]*)\]|(?<result>[^ ]*))([ ]|$)*

Should get you a Match object with 1 named group and 17 captures in there.
Exactly what you need...

You should also be able to use the Log parser class that the IIS team once
published... but I cannot find a link at the moment...

Jesse
I am sure there is *more* elegant solution to the problem, can you
post a sample log output, and do you want to get the individual words
out of the log?

E.g. if the log line is
[28/Oct/2008:21:44:21 +0000] Test with p~nctuat!ion word goes here!
would you like to have the timestamp, "Test", "with", etc as separate
matches? If so, you could split the text using string.Split() once you
have
the actual log text (see my previous code example for the 'log text'
case).
--
Stanimir Stoyanov
http://stoyanoff.info
I was hoping to avoid taking the time to create a regular expression
as
there
are 17 fields per S3 record. It took me a while but here is what I
ended
up
with:
(.*?)(\s+)(.*?)(\s+)(\[.*?\])(\s+)((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-
9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(?![\d])(\s+)
(.*?)(\s+)(.*?)(\s+)(.*?)(\s+)(.*)(\s+)(".*?")(\s+)(.*?)(\s+)(.*?)(\s
+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(".*?")(\s+)(".*?")

Yuck, I'd rather being doing about a million other things, but oh
well problem solved.

Stanimir Stoyanov said:
Hi Scott,

I personally would use Regular Expressions to split the words in a
smart way. Below is a sample console application to demonstrate it.
The regular expression \[.*\]\s*|.+ means that it can select from
two alternatives:

a) Text wrapped inside [ and ]
b) Any other text (your actual server log)
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main(string[] args)
{
string expr = @"\[.*\]\s*|.+";
string line = "[28/Oct/2008:21:44:21 +0000] Test with
p~nctuat!ion
word goes here!";
Regex regex = new Regex(expr);

foreach (Match m in regex.Matches(line))
{
string value = m.Value.Trim();
if (value.StartsWith("[") && value.EndsWith("]"))
{
// This is part of the timestamp
Console.WriteLine("TEST: time = " + value);
}
else
{
// This is an actual slice of the result
Console.WriteLine("TEST: word = " + value);
}
}
Console.Read();
}
}

I’m trying to parse out Amazon S3 server logs which are space
delimited.
However date fields are in the following form:
[28/Oct/2008:21:44:21 +0000]

When I try to use the following code to split the record on the
spaces
it
also splits date field:
string[] fields = record.Split(' ');

What can I do to get around this?

Scott
 
Below is an example of what is in a log file. I'm just trying to read the
logs and dump the fields into a database.

4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887 testBucket
[28/Oct/2008:21:44:21 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887
AAE9C2CCFFE5E6DB REST.GET.ACL - "GET /?acl HTTP/1.1" 200 - 556 - 488 - "-" "-"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887 testBucket
[28/Oct/2008:21:44:24 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887
66FB31B05AFA84E9 REST.GET.LOGGING_STATUS - "GET /?logging HTTP/1.1" 200 - 244
- 171 - "-" "-"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887 testBucket
[28/Oct/2008:21:44:56 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887
40AC4747CFF7ACFD REST.GET.BUCKET - "GET / HTTP/1.1" 200 - 1298 - 15 12 "-"
"Amazon S3 CSharp Library"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887 testBucket
[28/Oct/2008:21:44:56 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887
5938B6855868E040 REST.HEAD.BUCKET - "HEAD / HTTP/1.1" 200 - 1298 - 642 473
"-" "Amazon S3 CSharp Library"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887 testBucket
[28/Oct/2008:21:45:33 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887
16F565F75362B5A8 REST.HEAD.BUCKET - "HEAD / HTTP/1.1" 200 - 1298 - 508 293
"-" "Amazon S3 CSharp Library"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887 testBucket
[28/Oct/2008:21:45:33 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887
D61C9201C46617CF REST.PUT.OBJECT testFile.zip "PUT /testFile.zip HTTP/1.1"
200 - - 17428 334 11 "-" "Amazon S3 CSharp Library"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887 testBucket
[28/Oct/2008:21:45:34 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887
B2FEB30917A1F050 REST.GET.BUCKET - "GET / HTTP/1.1" 200 - 1634 - 181 15 "-"
"Amazon S3 CSharp Library"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887 testBucket
[28/Oct/2008:21:45:34 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887
B41FCF38CD590562 REST.HEAD.BUCKET - "HEAD / HTTP/1.1" 200 - 1634 - 15 13 "-"
"Amazon S3 CSharp Library"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887 testBucket
[28/Oct/2008:21:46:11 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887
C42BF5C887E61F18 REST.HEAD.BUCKET - "HEAD / HTTP/1.1" 200 - 1634 - 476 299
"-" "Amazon S3 CSharp Library"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887 testBucket
[28/Oct/2008:21:46:12 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887
A590228971F16081 REST.PUT.OBJECT testFile.zip "PUT /testFile.zip HTTP/1.1"
200 - - 1487163 20298 48 "-" "Amazon S3 CSharp Library"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887 testBucket
[28/Oct/2008:21:46:32 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887
6528418F2CCABB59 REST.HEAD.BUCKET - "HEAD / HTTP/1.1" 200 - 1969 - 312 309
"-" "Amazon S3 CSharp Library"
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887 testBucket
[28/Oct/2008:21:46:33 +0000] 127.0.0.1
4c54d3704f3a82592af823f518a6443186e92168fe07cdcdb20cfc2a21655887
EE65B98BD633E32C REST.GET.BUCKET - "GET / HTTP/1.1" 200 - 1969 - 16 14 "-"
"Amazon S3 CSharp Library"


Stanimir Stoyanov said:
I am sure there is *more* elegant solution to the problem, can you post a
sample log output, and do you want to get the individual words out of the
log?

E.g. if the log line is
[28/Oct/2008:21:44:21 +0000] Test with p~nctuat!ion word goes here!
would you like to have the timestamp, "Test", "with", etc as separate
matches? If so, you could split the text using string.Split() once you have
the actual log text (see my previous code example for the 'log text' case).

--
Stanimir Stoyanov
http://stoyanoff.info

M1iS said:
I was hoping to avoid taking the time to create a regular expression as
there
are 17 fields per S3 record. It took me a while but here is what I ended
up
with:

(.*?)(\s+)(.*?)(\s+)(\[.*?\])(\s+)((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(?![\d])(\s+)(.*?)(\s+)(.*?)(\s+)(.*?)(\s+)(.*)(\s+)(".*?")(\s+)(.*?)(\s+)(.*?)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(".*?")(\s+)(".*?")

Yuck, I'd rather being doing about a million other things, but oh well
problem solved.



Stanimir Stoyanov said:
Hi Scott,

I personally would use Regular Expressions to split the words in a smart
way. Below is a sample console application to demonstrate it. The regular
expression \[.*\]\s*|.+ means that it can select from two alternatives:

a) Text wrapped inside [ and ]
b) Any other text (your actual server log)

using System;
using System.Text.RegularExpressions;

class Program
{
static void Main(string[] args)
{
string expr = @"\[.*\]\s*|.+";
string line = "[28/Oct/2008:21:44:21 +0000] Test with
p~nctuat!ion
word goes here!";

Regex regex = new Regex(expr);

foreach (Match m in regex.Matches(line))
{
string value = m.Value.Trim();

if (value.StartsWith("[") && value.EndsWith("]"))
{
// This is part of the timestamp
Console.WriteLine("TEST: time = " + value);
}
else
{
// This is an actual slice of the result
Console.WriteLine("TEST: word = " + value);
}
}

Console.Read();
}
}

I’m trying to parse out Amazon S3 server logs which are space
delimited.
However date fields are in the following form:

[28/Oct/2008:21:44:21 +0000]

When I try to use the following code to split the record on the spaces
it
also splits date field:

string[] fields = record.Split(' ');

What can I do to get around this?

Scott
 
One of the following regular expressions might fit better:

\[.*\]|\"[^\"]*\"|[^\s-]+

or

\[.*\]|\"[^\"]*\"|[^\s]+

The difference is that the first omits single dashes as found on some rows
(in between figures), e.g.
200 - 1634 - 181 15

--
Stanimir Stoyanov
http://stoyanoff.info

M1iS said:
Below is an example of what is in a log file. I'm just trying to read the
logs and dump the fields into a database.

<SNIPPED>

Stanimir Stoyanov said:
I am sure there is *more* elegant solution to the problem, can you post a
sample log output, and do you want to get the individual words out of the
log?

E.g. if the log line is
[28/Oct/2008:21:44:21 +0000] Test with p~nctuat!ion word goes here!
would you like to have the timestamp, "Test", "with", etc as separate
matches? If so, you could split the text using string.Split() once you
have
the actual log text (see my previous code example for the 'log text'
case).

--
Stanimir Stoyanov
http://stoyanoff.info

M1iS said:
I was hoping to avoid taking the time to create a regular expression as
there
are 17 fields per S3 record. It took me a while but here is what I
ended
up
with:

(.*?)(\s+)(.*?)(\s+)(\[.*?\])(\s+)((?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))(?![\d])(\s+)(.*?)(\s+)(.*?)(\s+)(.*?)(\s+)(.*)(\s+)(".*?")(\s+)(.*?)(\s+)(.*?)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(\d+|-)(\s+)(".*?")(\s+)(".*?")

Yuck, I'd rather being doing about a million other things, but oh well
problem solved.



:

Hi Scott,

I personally would use Regular Expressions to split the words in a
smart
way. Below is a sample console application to demonstrate it. The
regular
expression \[.*\]\s*|.+ means that it can select from two
alternatives:

a) Text wrapped inside [ and ]
b) Any other text (your actual server log)

using System;
using System.Text.RegularExpressions;

class Program
{
static void Main(string[] args)
{
string expr = @"\[.*\]\s*|.+";
string line = "[28/Oct/2008:21:44:21 +0000] Test with
p~nctuat!ion
word goes here!";

Regex regex = new Regex(expr);

foreach (Match m in regex.Matches(line))
{
string value = m.Value.Trim();

if (value.StartsWith("[") && value.EndsWith("]"))
{
// This is part of the timestamp
Console.WriteLine("TEST: time = " + value);
}
else
{
// This is an actual slice of the result
Console.WriteLine("TEST: word = " + value);
}
}

Console.Read();
}
}

I’m trying to parse out Amazon S3 server logs which are space
delimited.
However date fields are in the following form:

[28/Oct/2008:21:44:21 +0000]

When I try to use the following code to split the record on the
spaces
it
also splits date field:

string[] fields = record.Split(' ');

What can I do to get around this?

Scott
 
Back
Top