replacing characters in a string

Peter · Dec 17, 2008

Hi

in my application I get a lot of strings which I have to "clean up"
before I pass them to a third-party library. The strings I have contain
characters which are invalid for the third-party library, so I have to
either remove them or replace them with reasonable alternatives.

What is a good method of doing this?

At the moment I have the following:

string Clean(string element)
{
element = element.Replace(",", "");
element = element.Replace("-", "");
element = element.Replace("!", "");
element = element.Replace("/", "");
element = element.Replace("\\", "");

element = element.Replace("æ", "ae");
element = element.Replace("Æ", "AE");
element = element.Replace("ä", "ae");
element = element.Replace("Ä", "AE");
element = element.Replace("ø", "oe");
element = element.Replace("Ø", "OE");
element = element.Replace("ö", "oe");
element = element.Replace("Ö", "OE");
element = element.Replace("å", "aa");
element = element.Replace("Å", "AA");

element = element.Trim(' ', '.');

return element;
}

Thanks,
Peter

Peter Morris · Dec 17, 2008

It will work but you are scanning the whole string for each replace. I have
no experience of it but Regex.Replace is likely to only scan the string once
and call a delegate each time it finds a match against one of many patterns
you specify...

http://msdn.microsoft.com/en-us/library/ms149475.aspx

Mihai N. · Dec 18, 2008

element = element.Replace("æ", "ae");

element = element.Replace("Æ", "AE");
element = element.Replace("ä", "ae");
element = element.Replace("Ä", "AE");
element = element.Replace("ø", "oe");
element = element.Replace("Ø", "OE");
element = element.Replace("ö", "oe");
element = element.Replace("Ö", "OE");
element = element.Replace("å", "aa");
element = element.Replace("Å", "AA");

Any chance to get a new version of the 3rd party library?
Some of these replacements are locale sensitive.
And even for the locales where they are valid, they affect (negatively)
the quality of the text.
So that is not "clean up", that is "crap"

Imagine someone whould do this to English strings:
element = element.Replace("w", "vv");
because some stupid library does not support 'w'.

Jeff Johnson · Dec 18, 2008

in my application I get a lot of strings which I have to "clean up"
before I pass them to a third-party library. The strings I have contain
characters which are invalid for the third-party library, so I have to
either remove them or replace them with reasonable alternatives.

What is a good method of doing this?

At the moment I have the following:

string Clean(string element)
{
element = element.Replace(",", "");
element = element.Replace("-", "");
element = element.Replace("!", "");
element = element.Replace("/", "");
element = element.Replace("\\", "");

element = element.Replace("æ", "ae");
element = element.Replace("Æ", "AE");
element = element.Replace("ä", "ae");
element = element.Replace("Ä", "AE");
element = element.Replace("ø", "oe");
element = element.Replace("Ø", "OE");
element = element.Replace("ö", "oe");
element = element.Replace("Ö", "OE");
element = element.Replace("å", "aa");
element = element.Replace("Å", "AA");

element = element.Trim(' ', '.');

return element;
}

My take: build a "conversion matrix" and then run every character in that
string through the matrix, outputting a clean string in the end. Something
like this (air code!):

private Dictionary<char, string> _conversions;

// Constructor
public <your class name>
{
// Ideally you would read these from a database or settings file so
// that you wouldn't have to recompile if you find new things to replace
_conversions.Add(',', "");
_conversions.Add('-', "");
_conversions.Add('!', "");
_conversions.Add('/', "");
_conversions.Add('\\', "");
_conversions.Add('æ', "ae");
_conversions.Add('Æ', "AE");
_conversions.Add('ä', "ae");
_conversions.Add('Ä', "AE");
_conversions.Add('ø', "oe");
_conversions.Add('Ø', "OE");
_conversions.Add('ö', "oe");
_conversions.Add('Ö', "OE");
_conversions.Add('å', "aa");
_conversions.Add('Å', "AA");
}

private string Clean(string element)
{
StringBuilder sb = new StringBuilder();

foreach(char c in element)
{
// NOTE: The following line may not compile since one option returns
// a string and the other a char. In that case, make it a full blown
// if/else clause.
sb.Append(_conversions.Contains(c) ? _conversions[c] : c);
}

return sb.ToString().Trim(' ', '.');
}

Oh, and for what it's worth, it sounds like your third-party library
sucks....

Peter · Dec 18, 2008

Thanks for all the comments.

With regards to the 3rd-party library, it is a content management
system, and it imposes rules on the names that can be used for path
elements and the "items" or "nodes" which make up the hierarchical
content structure. Some things I do accept, like / or \ in a name (much
the same as in windows) - but I don't really know why one can't use [
or ) or "international" letters like æ or ø. I don't have an exhaustive
list of all the invalid characters.

The data I receive comes from a database, and I have to then insert it
in the CMS - which gives problems if I read "invalid" strings from the
database, so I have to make some sort of "conversion".

/Peter

Mihai N. · Dec 19, 2008

The data I receive comes from a database, and I have to then insert it

in the CMS - which gives problems if I read "invalid" strings from the
database, so I have to make some sort of "conversion".

Is the result visible somewhere "as is", or it will always go thru some
"conversion layer"?

Maybe you can come with some kind of escaping system?

For instance have the string as utf-8, then escape all bytes > 127
When you get them back, you unescape and get the original utf-8 strings,
not characters damaged.

Peter · Dec 20, 2008

Mihai said:
Is the result visible somewhere "as is", or it will always go thru
some "conversion layer"?

Maybe you can come with some kind of escaping system?

For instance have the string as utf-8, then escape all bytes > 127
When you get them back, you unescape and get the original utf-8
strings, not characters damaged.

Hi - I'm not sure I completely follow you. What I am doing is reading
company data from a database, and putting them into the hierarchical
structure of the CMS (as items/nodes in the CMS) - as well as some
accompanying data (like contact info, address, images etc).

This is to make it easy for site editors to access and change
information which is shown on some of the website's pages.

Eg.

IT companies
microsoft
yahoo

And some of the companies might have "illegal" characters in their
names (eg ! in Yahoo!).

/Peter

Mihai N. · Dec 22, 2008

This is to make it easy for site editors to access and change

information which is shown on some of the website's pages.

I am not sure how that CMS works, but I do understand that you can't store
strings with accents in it.
But the idea was that if the CMS strings go thru some kind of
layer that you control before showing them to the user, you can
escape them befor passing them to the CMS, and unescape them
them before passing them to the user.

If (for instance) the company name is used as a URL, then
there is a standard way to escape international text in URLs.
http://en.wikipedia.org/wiki/Internationalized_domain_name

And some of the companies might have "illegal" characters in their
names (eg ! in Yahoo!).

going with this example: you can take "!Yahoo" and escape the "!" as
%21 before passing it to the CMS (if % is valid, if not, you can come
up with another escaping method).
When the user asks for info you can unsecape (if you have your own
layer between the user and the CMS). So you take "%21Yahoo" from
CMS, unescape, and pass "!Yahoo" to the user.

Peter · Dec 22, 2008

Mihai said:
I am not sure how that CMS works, but I do understand that you can't
store strings with accents in it.
But the idea was that if the CMS strings go thru some kind of
layer that you control before showing them to the user, you can
escape them befor passing them to the CMS, and unescape them
them before passing them to the user.

going with this example: you can take "!Yahoo" and escape the "!" as
%21 before passing it to the CMS (if % is valid, if not, you can come
up with another escaping method).
When the user asks for info you can unsecape (if you have your own
layer between the user and the CMS). So you take "%21Yahoo" from
CMS, unescape, and pass "!Yahoo" to the user.

Thanks - good idea. I am not sure I can hook into the CMS in that way
though - to call the item "%21Yahoo" in the CMS hierarchy, but display
"!Yahoo" to the CMS user. But definitely worth investigating.

/Peter

replacing characters in a string

Peter

Peter Morris

Mihai N.

Jeff Johnson

Peter

Mihai N.

Peter

Mihai N.

Peter