1

Closed

Rationalize UTF8 vs. ASCII and implement decision

description

I noted in a recent change that we were using UTF8 for string/byte conversions. This is incorrect per the spec. I recall my original thinking and it went something like:

This is an STDF reading library. ASCII will be correctly decoded using UTF8, and (if someone is crazy) then they might make use of UTF8 encoding and we will be awesome and read it.

Since we are now in the business of writing STDFs, this is clearly the wrong approach. I think we should go to ASCII to be spec-compliant, but the question of how to deal with non-ASCII, particularly in the writing of STDFs is an interesting one. Options I see:
  • Just blindly convert to ASCII and let unencodable characters be replaced with '?'. This seems reasonable, but it could be undesirable.
  • Enforce ASCII and throw a nice NotSupportedException if we attempt to write one. Again, letting in-memory representation be more relaxed unless you want to write.
I'm leaning toward the second option. I think us enforcing data representation as a part of file integrity is likely a good principle. Open to thoughts.
Closed Apr 11, 2012 at 5:10 AM by marklio
I made the change to use ASCII encoding explicitly and to throw if we can't encode something.

comments

marklio wrote Apr 11, 2012 at 4:57 AM

After thinking about it and looking at the change, I felt strongly enough about it, and the fix is easy enough that I went ahead and fixed it. We can always revert it if we come to another conclusion.

Selzhanik wrote Apr 11, 2012 at 1:07 PM

I saw that yesterday when I was implementing the single character functions and wondered about that. I think you decided correctly. I'm a tad concerned it kept you up all night though!

I do wish there was some global feature to modify all strings. For instance if someone wanted to make an stdf to atdf converter, they might want to globally convert all '|' to '?'. Or maybe truncate strings more than 255 in length to 252 and add "..." (which reminds me I need to check string lengths on write). That same filter might be able to globally convert from UTF8 to ASCII in some desirable way. But that's a feature for another day, and one with little value likely.