Jonathan said:
Peter said:
At the very least, it would be helpful if you could elaborate on these
stated requirement that you be able to "find the first occurrence that
is a given character(s)" and "to find the first occurrence that's not
a given character(s)".
Forgive me, as I'm not understanding how these are unclear. If it helps,
here are some examples (of many possible examples):
[Function 1] Find the index of the next 'a' character. (case-insensitive
IndexOf)
[Function 2] Find the index of the next 'a', 'b', or 'c' character.
(case-insensitve IndexOfAny)
[Function 3] Find the index of the next character that is not 'a'.
[Function 4] Find the index of the next character that is not 'a', 'b',
or 'c'.
It's not that the functions you're describing are unclear themselves.
What's unclear is the source of the requirement that those functions be
implemented at all.
I thought I was very specific that I'm working on a generic class right
now. It makes no requirements on what is being parsed. In my current
project, I'll be using it to parse HTML, but it could just as easily be
used to parse source code or something else entirely.
Unless you are a grad student, I think maybe you might want to rethink
this goal. What you're asking to do is a highly non-trivial problem,
and that is in fact why people write specific parsers for specific tasks.
HTML is relatively simple to parse. If what you need to parse right now
is just HTML, you should probably focus on that. It will take a
fraction of the time to implement an HTML parser than to implement
something that is truly general purpose. (Or you could even just use
one of the many HTML parsers available).
In fact, a truly general purpose parser using current technology will
essentially be a number of specific kinds of parsers in one. The
specific parsing techniques necessary depend so much on the exact
grammar of whatever you're parsing that, AFAIK, no one has come up with
a general-purpose way of describing grammars such that a single parsing
implementation can take care of any arbitrary grammar.
Now, from your general description, it seems like maybe you're not
really even at the point where you're trying to parse the input.
Instead, it sounds like you're dealing with the lexical analysis part.
And that's generally a simpler thing to generalize, I think.
But even there, you need to know what kind of parser you're going to
feed with the lexical output. I'm not aware of many parsers that take
as input individual characters (*), and for tokenizing input, the kind
of repeated scanning for specific characters is not something I've ever
seen.
(*) (an obvious exception being those parsers based on finite state
machines; and these can handle case-insensitivity just fine, as long as
you can represent your token choices in a way that can be translated
into the finite state machine; e.g. provide all possible variations, or
at least some subset that can be used to automatically generate all
possible variations, so that the FSM can be be built accordingly).
If you have some specific reference you're following that has suggested
this particular implementation, where you are scanning repeatedly
character-by-character, it would be helpful if you could provide a link
or other pointer to that reference so we can understand what you're doing.
On the other hand, if this technique is something you've invented on
your own, I would strongly suggest you research existing lexical
analysis and parsing technology (a field that's been around practically
as long as computers have) before trying to reinvent the wheel. It
would probably involve a lot fewer headaches.
Could anyone think of a better approach?
Not off the top of my head. [...]
Actually, maybe I can. How long are the input strings? And how many
times do you need to do this kind of search for each input string?
The strings are potentially very long. Many scans would be done but it
would be done sequentially. That is, it would start from the beginning
of the string and, once a scan finds what it's looking for, the next
scan would continue from that location. One potential use (of many) of
this class would be a programming language parser, that needs to
identify and extract each token from the source code.
There is definitely no need for that kind of scanning for parsing
programming languages. Programming languages have clear rules for
delimiting tokens, and generating the tokens is a simple affair of just
scanning through a string, breaking it up as needed. The String.Split()
method is nearly all you need, if it weren't for the fact that most
programming languages allow you to omit spaces between tokens where
there's no ambiguity (i.e. in C, you can't write "inti" and have the
lexer figure out the tokens are "int" and "i", but you can write "int
i,j" and it will know your tokens are "int", "i", ",", and "j"…in fact,
some implementations might just use the "," as a delimiter, depending on
how the rest of the analysis is handled).
I'm not sure if I fully understand this. As pointed out, the string is
always scanned from the "current position" to the end of the file, and
the current position advances as each token is identified. At any rate,
how would you use this to solve the case-insensitive issue? If you find
the index of an 'A' and then later find the index of an 'a', how would
you use those indices to locate 'A' ignoring case?
Your index would treat all characters that are equivalent ignoring case
as the same indexed entity. So your index would have a single entry,
corresponding to both 'A' and 'a'. Of course, this means you have to
have some rules for mapping a given character to a specific entity.
Again, if you're willing to make some restrictions on the input – for
example, being case-insensitive only for characters in the ASCII range –
then this mapping is trivial. Otherwise, you need some way of
representing the mapping, and converting individual characters according
to that representation (e.g. a hashtable containing all Unicode
characters that have case, mapping to some specific value that
identifies all characters that map to that value).
Pete