string extraction

  • Thread starter Thread starter tommaso.gastaldi
  • Start date Start date
T

tommaso.gastaldi

I have a file containing some commands in free format. Each
command is terminated with ";". The ";" can also be found within the
command but, only enclosed within delimiters (' or ""). Example:

INSERT INTO nation (code, name) VALUES(700448768,
"za; sdfhsd''"sdfa");

INSERT INTO nation (code, name)

VALUES(701464576, 'msd; vasdvas ""hjh"" u');


My question is: what is the best code to extract, one at a time, these
commands.
The result should be (2 commands):
INSERT INTO nation (code, name) VALUES(700448768, "za; sdfhsd'"sdfa");
INSERT INTO nation (code, name) VALUES(701464576, 'msd; vasdvas ""hjh""
u');
I was thinking about regex, but it may be tricky to find the right one.
Any ideas?

-tom
 
You should take a look at parsing algorithms related to theory
surrounding compilers.

There's a common command found in many programming languages called
Split or Tokenize that allows you to specify a delimting character, and
returns some sort of collection of objects. Something like:
Array arrayCommands =Split(stringCommands, ";")

And then you would do something like:

foreach(String command in arrayCommands)
{
if(command ends with a ", then there was a quoted ';')
{
//so we add the quoted ; back in and combine again with the next
command which is really part of this command and shouldn't have been
split up
command = command + ';' + (next command in array)
delete next command in array
}
}

Thi is just psuedo code of course. (next command in array) could be
found by getting index of current command, adding one, and indexing
into the array. You'll need to find out what VB.NET's tokenize or
split function is and how it works. I'm sure there is something like
that.
 
It might be worth trying using Regex, but your delimiters don't seem to
have any symmetry.

In this line, for instance :
INSERT INTO nation (code, name) VALUES(700448768, "za; sdfhsd''"sdfa");

there are 3 double quotes, not 4 as one would expect. You seem to be
opening with a double quote and closing with a single quote. So, I
couldn't get far with constructing a Regex.
 
Hi snozz,

thank you very much for your advice: I will look for these
functions.

About the logic you kindly suggest I am not clear
and I have a question.

When I wrote:

<< The ";" can also be found within the
command but, only enclosed within delimiters (' or "") >>

I meant something like for instance

1. " ;; some string containing; semicolon; within "

not necessarily something like:

2. " ";"';' some string "";"" containing semicolon ... "

I have the impression that you are assuming the situation 2
and not 1. Is that so or I am missing something?

Another point is that the file can be several Gigs and I need a
kind of "buffered" logic. But I guess I could read a bounce of lines
at a time.

-tom



Snozz ha scritto:
 
Hi Cerebrus,

what I mean is that string follow exactly the same rules as in VB.NET
or SQL
the string

"za; sdfhsd''"sdfa"

in the command you refer to is ok because the string content:
<za; sdfhsd''"sdfa>

is meant to be rendered as: <za; sdfhsd''sdfa>
that is the double quotes "" that are within the string are rendered
as single quotes. Just the same as in VB.NET.

You are however right about example 2
it should have been:

2. " "";""';' some string "";"" containing semicolon ... "

Yes I have tried often to use regex, but it's complicate to
deal even with sImple cases of quotes enclosed within quotes.

------------------------

Put it simply, my question is: how do I extract commands of the type

myCommand ;

each command ends where a ; (not enclosed in a string) is found.

The commands are freely put within a very large file. myCommand can
contain internally
strings which contain the semicolon char. String can be delimited by
either " or '
and can contain internally the delimiter char. In such a case the
delimiter is doubled
(as in VB.NET, SQL, ...) and will be rendered as a single char.

-tom
 
Your best option is to probably use a .indexof methods on the total char
string looking for " and ;. Flags can tell you when to skip the ; inclosed
in ""'s
 
Thanks Dennis,

Actually I am not completely persuaded it can be done that way in
general as you could
have something like :

.... (" my preferred keywords ""work; work ; work"" ") ;

mmm I am afraid that all chars must be parsed so that one could put
flags to distinguish when an ; occurs within string delimiters and
when, instead is a command separator ....

-tom

Dennis ha scritto:
 
Since compilers already deal with this quite efficiently, then you
really will found solid practical algorithms if you look at some of the
theory that addresses programming languages, syntax, and parsing.

Might want to try a search for "recursive decent parser"

I think your type of parsing fallss under "lexical analysis" although
it might be "syntax analysis"
 
Back
Top