Weblog (or Blog)


Thursday, January 10, 2013

What I Know About Using Plain Text

The benefits of using plain text for all the information you want to retain are simplicity, portability, and durability. Almost every application has the capability to import and use plain text. The recent example of Microsoft Word files is a good one to illustrate the barriers created by proprietary file formats. For the 2007 version of Word, Microsoft changed the format of the created documents to one based on XML. The change is an improvement, but older versions of Word cannot open the new format without additional software. In other words, the new format is not forward compatible. Because this is a recent change, the additional software to work around the incompatibilities is readily available. But consider the situation 10 years from now. As this article from Macworld magazine says, sometimes you can't even open your older documents. You never have to worry about that situation with plain text. Or the situation of changing to different software. As noted, almost all software can import plain text so it has ultimate portability.

If that's the case, why doesn't all software simply operate in plain text? Because it's "plain." No formatting, no bold, no italics, no spacing control other than a blank line and spaces or tabs. The plain in plain text results from the fact that the foundation of it is ASCII encoding that included only the upper and lower case English alphabet, numbers 0 - 9, basic punctuation, and a few control characters. Encoding is a code relating the alphanumeric characters to numbers ('cause computers don't speak English). Encoding in ASCII is a basic foundation of plain text. For a full definition, read the Linux Information Project's Plain Text Definition.

The ASCII encoding is one of the earliest, having been established in 1963. Because ASCII is limited to the English alphabet, more recent encodings have been established to encompass more languages. The current trend is toward Unicode encoding, which has the ambitious goal of encoding every character in every language in the world. Unfortunately, there are several versions of Unicode (e.g., UTF-8, UTF-16) as well as other encodings (e.g., ISO-8559, Windows 1252, and many more). Fortunately, the most frequently used ones are a superset of ASCII. In other words, even the most modern encodings, now fifty years from when ASCII was established, can still display ASCII files correctly. Thus, from a durability point of view, it's a good bet that files created in plain text today will be accessible over a lifetime at least.

That does not mean that there might not be some issues. Almost all of us have seen occasions when a web page displays a few characters that do not make any sense. For example, quotation marks may appear as gibberish or a question mark rather than an actual quotation mark. This is almost always the result of the browser using a certain encoding to display the page when the text is actually encoded using a different encoding. Web pages are supposed to inform the browser what encoding is used. Many, however, don't and the browser has to guess. If it guesses wrong, the user suffers by having to figure out what those gibberish characters are.

The bottom line is that if you predominately use the English alphabet for your documents, keeping to plain text will ensure that you can access and use them in the future.

posted at: 17:16 | path: /plain_text | permanent link