Character encodings and Black Diamonds

From an e-mail I wrote today to a colleauge confused about character encodings.  We copied a bunch of files from an old HPUX web server to a new RHEL server running modern Apache.  The files viewed from the new server have the dread black diamonds all over the place and he is trying to understand why.

 

Hi, I misread your mail when I first responded that you were exactly right.  Saddly, you weren’t:

Be aware that this looks to be more of a file system issue that a web server issue from my initial findings. It may be a case of both, but the fact that a file moved to [Computer2] got the problem and then moved to [Computer1] still had the problem, when the original file on [Computer1] was fine. I think the UTF8 encoding is being set on the file when it is moved to the [Computer2] machine.

It’s not a file system issue.  Character encoding isn’t an attribute of the file, like read-only or hidden, that can be set on a file or munged by moving a file from one file system to another.  It’s a table that maps the numbers that are actually in the file to the symbols we humans use to read and write with.  The term “code-page” is a another term for character encoding.  Code-page started with IBM and is the term Microsoft uses today.

The files we think of as “just plain text, dangit!” have been written with an encoding too.  It’s just that here in the US, one encoding, ASCII, and a few close relatives and descendants have been ubiquitous since the beginning and we haven’t ever had to even consider the concept.  Now that the web is international, though, we North Americans are having to learn about this.

MS Word uses one set of numbers (encoding) to represent the characters we type and the Apache web server is (currently) configured to assume a different set were used.

The specific problem we’re seeing is caused by MS Word writing in one encoding and Apache assuming that the files were written in another.  Apache is configured to assume a file is encoded with UTF-8 if it isn’t told otherwise.  MS Word is writing in either Windows-1252, AKA CP-1252, or ISO-8859-1, which are very similar to one another.

So, in the Windows-1252 encoding, there’s a number that refers, in UTF-8, to something the browsers can’t display.  So they show us the black diamond thingy.

Help?

Best,
Mike

Leave a Reply

You must be logged in to post a comment.