Anyone recognise this coding scheme?

  • Thread starter Thread starter Rob
  • Start date Start date
R

Rob

I've trying to import a CSV file exported from another application but it
uses some encoding scheme I've not seen before:

19/02/2007,106478,Bob Elder,Sinking Ship,Macclesfield,Cheshire,8,"Mike\'s
\Rude Awakening\"""""

The \' sequence is a single quote but that \Rude Awakening\"""" is a little
strange. I think it means take Rude Awakening and prefix it by the next
character after the final "\" but that's a guess.

Anyone recognise this encoding scheme?

Thanks, Rob.

PS. I've asked if we can find out from the original application authors but
they've not responded yet.
 
Is there, in fact a line break between "Mike\'s and \Rude Awakening\"""""
or is it, in fact, "Mike\'s \Rude Awakening\"""""

It might sound like a moot point but it could have an impact on how you
interpret it.

I don't think that it is an 'encoding scheme' perse, rather, it looks to
me like an 'escaping scheme' of some description.

In some escaping schemes a \ followed by another character acts as a
signal.

For example, in C# the escape sequence \r means a carriage return and the
escape sequence \n means a line feed. Often you see them together as
carriage return/line feed pair as in \r\n.

Another example, is where you are building a SQL string and you want an
apostrophe to be preserved in a name rather than being treated as a string
delimiter. For this one would code, in VB.Net, " ...'O''Brien'..." but, in
C# one would code " ...'O\'Brien' ...".

Again in C#, the sequence \" serves to imbed a quote in a string to give
the same effect as specifing the quote charracter twice in VB.Net.

In the example data you have given, only the \' makes any sense, and so I
suspect that the escaping scheme is bespoke and only the author will be
able to tell you how to interpret the escaped characters.
 
Stephany Young said:
Is there, in fact a line break between "Mike\'s and \Rude Awakening\"""""
or is it, in fact, "Mike\'s \Rude Awakening\"""""

HAHAHAHA... RFC 4180 specifies breaks using the Augmented Backus-Naur Form.
If there were a line break there you'd see it.
It might sound like a moot point but it could have an impact on how you
interpret it.

I don't think that it is an 'encoding scheme' perse, rather, it looks to
me like an 'escaping scheme' of some description.

In some escaping schemes a \ followed by another character acts as a
signal.

For example, in C# the escape sequence \r means a carriage return and the
escape sequence \n means a line feed. Often you see them together as
carriage return/line feed pair as in \r\n.

Another example, is where you are building a SQL string and you want an
apostrophe to be preserved in a name rather than being treated as a string
delimiter. For this one would code, in VB.Net, " ...'O''Brien'..." but, in
C# one would code " ...'O\'Brien' ...".

Again in C#, the sequence \" serves to imbed a quote in a string to give
the same effect as specifing the quote charracter twice in VB.Net.

In the example data you have given, only the \' makes any sense,
LMAO

and so I
suspect that the escaping scheme is bespoke and only the author will be
able to tell you how to interpret the escaped characters.

LMAO - you're blubbering and babbling incoherently just so you can post and
gain the power of the illusion of puffing up your horribly sunken chest, you
less than witless cretin.

RFC 4180 explictly states that literal double quotes that are not delimiters
must be escaped so that they can be distinguished from delimiters that are
also double quotes.
 
I don't think that it is an 'encoding scheme' perse, rather, it looks to
me like an 'escaping scheme' of some description.

I think you are correct and actually I think the enclosing is broken. I've
now got a way to reach the developers. Consider this string:

Taylor's "Old Head"

The encoding should replace single quote with \' and double quote with \""
which makes sense.

However, what it's doing is shuffling the double quotes to the end so you
get:

Taylor\'s \Old Head\""""

It should be:

Taylor\'s \""Old Head\""

I think it's a bug...

Rob.
 
That flame was totally uncalled for.

If you think that every CSV file in the world follows RFC 4180 you are sadly
mistaken.
 
HAHAHAHA... RFC 4180 specifies breaks using the Augmented Backus-Naur Form.
If there were a line break there you'd see it.

Re-reading your reply, could you explain what you mean by the above comment?

Because either you don't know what BNF is used for, or you were unable to
adequately communicate your meaning.

BNF is a human readable language used to present the grammar of a file
format. Computers don't use it.

And in any case, according to http://www.rfc-editor.org/rfc/rfc4180.txt,
RFC4180 specifies line breaks that occur inside an "escaped" field are
represented by ASCII &H0A or &H0D, which last time I looked were
non-printing. So you wouldn't see them.

You, sir, appear to have no idea what you are talking about.
 
SurturZ said:
That flame was totally uncalled for.

If you think that every CSV file in the world follows RFC 4180 you are
sadly
mistaken.

Please show where I echoed any thoughts of that kind. Failure to put up will
be taken as proof that you attempted to create an argument based on a wholly
imaginary event that never took place.

So, how long have you suffered delusions?
 
SurturZ said:
Re-reading your reply, could you explain what you mean by the above
comment?

Because either you don't know what BNF is used for, or you were unable to
adequately communicate your meaning.

I see. One extremely salient point seems to have escaped your meagre
attention span. If it is the case that you, your very self, knew what BNF is
for then you would not need to offer a choice between those two options. You
would be able to assert one or the other as empirical fact.
BNF is<BITCHSLAP>

If there were a line break in the text then the CSV should contain a text
representation of the line break at the break point.

HTH

PS: Logic is obviously not your strong suit, antbrain. Have you considered
suicide?
 
If there were a line break in the text then the CSV should contain a text
representation of the line break at the break point.

I quote from RFC 4180:
--------
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
CRLF = CR LF ;as per section 6.1 of RFC 2234 [2]
CR = %x0D ;as per section 6.1 of RFC 2234 [2]
LF = %x0A ;as per section 6.1 of RFC 2234 [2]
--------

Where's the encoding??

Also:
-------
6. Fields containing line breaks (CRLF), double quotes, and commas should
be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
-------

See? No encoding there either. You've misread the RFC - a line break
occuring within a field is simply emitted as the ASCII string &H0D0A not as
the string-literals "CRLF" or "%x0D%x0A" as I think you are suggesting.

Otherwise typing the string-literal "CRLF" (or "%x0D%x0A") in a text field
would break the file!

All this you would know if you could read BNF or had ever parsed a CSV file.

If you don't believe me, fire up Microsoft Excel and save a CSV containing a
field with a line break in it, then look at the raw bytes. (actually it
stores a line break as &H0A, but my point still stands).
 
SurturZ said:
If there were a line break in the text then the CSV should contain a text
representation of the line break at the break point.

I quote from RFC 4180:
--------
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
CRLF = CR LF ;as per section 6.1 of RFC 2234 [2]
CR = %x0D ;as per section 6.1 of RFC 2234 [2]
LF = %x0A ;as per section 6.1 of RFC 2234 [2]

You should ask "Stephany Young" that question, you macaroon.
 
Where's the encoding??
You should ask "Stephany Young" that question, you macaroon.

Guru, You're the goose that said the following:
HAHAHAHA... RFC 4180 specifies breaks using the Augmented Backus-Naur Form.
If there were a line break there you'd see it.

and you also said this:
If there were a line break in the text then the CSV should contain a text
representation of the line break at the break point.

Stephany Young asked if the line break in the original post meant that there
was a line break in the original data, which was a fair question, since line
breaks are NOT ENCODED in CSV (unless you count ASCII as encoding :-P )

You're the one that flamed her and then stupidly referred to RFC4180 which:
1) You clearly have not read
2) Could not understand even if you did read it since you don't even know
what BNF is
3) barely anyone uses when encoding CSV
4) the file in question obviously does not follow since it has some weird
use of the slash character.

Type CSV into wikipedia and get a clue, you dolt.


--
David Streeter
Synchrotech Software
Sydney Australia


Guru said:
SurturZ said:
If there were a line break in the text then the CSV should contain a text
representation of the line break at the break point.

I quote from RFC 4180:
--------
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
CRLF = CR LF ;as per section 6.1 of RFC 2234 [2]
CR = %x0D ;as per section 6.1 of RFC 2234 [2]
LF = %x0A ;as per section 6.1 of RFC 2234 [2]

You should ask "Stephany Young" that question, you macaroon.
Also:
-------
6. Fields containing line breaks (CRLF), double quotes, and commas
should
be enclosed in double-quotes. For example:
"aaa","b CRLF
bb","ccc" CRLF
zzz,yyy,xxx
-------

See? No encoding there either. You've misread the RFC - a line break
occuring within a field is simply emitted as the ASCII string &H0D0A not
as
the string-literals "CRLF" or "%x0D%x0A" as I think you are suggesting.

Otherwise typing the string-literal "CRLF" (or "%x0D%x0A") in a text field
would break the file!

All this you would know if you could read BNF or had ever parsed a CSV
file.

If you don't believe me, fire up Microsoft Excel and save a CSV containing
a
field with a line break in it, then look at the raw bytes. (actually it
stores a line break as &H0A, but my point still stands).
 
Back
Top