Category : C Source Code
Archive   : EUROSET.ZIP
Filename : AAREAD.ME

 
Output of file : AAREAD.ME contained in archive : EUROSET.ZIP
=====================================

HANDLING EUROPEAN CHARACTER SETS IN
USENET NEWS AND ELECTRONIC MAIL

Gisle Hannemyr
Oslonett A.S.
Version 1.0

1994 Jan 21

=====================================


Introduction
============

This note is written as part of an effort to get authors of mail and
newsreaders to treat European character sets with some dignity.


Manifest
========

AAREAD.ME -- this file
Makefile -- to make the demo application
demo.c -- a simple demo of how you can use the fold library
fold.c -- the fold library
fold.h -- header file to the fold library
iso8859-1.ps -- a postscript file showing ISO latin 1 encoding
(print this on a PostScript printer to see it)



Background
==========

In the beginning, there was US-ASCII. US-ASCII defined a standard
binding of numeric codes to graphical representations of characters.
The US-ASCII system used the codes from 33 to 126 (inclusive) for its
graphical symbols. This makes room for 94 graphical symbols, and may
comfortably be encoded by 7 bits. Later, the US-ASCII representation
was made into an international standard, which was given the name
ISO-646-IRV (IRV stands for "International Reference Version).

Unfortunately, 94 graphical symbols are too few for all the weird and
wonderful charcters used in miscellaneous European languages. There
was several attempts to court the European market by given us products
with our own characters. IBM and Microsoft, in particular, made a lot
of effort, and created in the process a whole maze of twisty little
encodings, all different. These are known as "codepages" (CPs). A
number of other computer manufacturers also created their own
encodings.

The one thing that was constant in this process was the encodings of
the 94 graphical characters that was defined in US-ASCII, but the
encodings between 128 and 255 was a mess.

Enter the ISO (International Standards Organization). ISO, with good
help from the international community, created the ISO 8859-series.
This is a series of standards defining mappings between 8-bit
character codes and graphical symbols. Each part of the series is
designed to serve the needs of a particular geographic area.

The first part of this series (ISO 8859/1, also known as "ISO Latin
alphabet no. 1", or simply "ISO Latin 1") quickly gained wide
acceptance in Scandinavia, Western Europe, and also among several
major US manufacturers (DEC, SCO, Sun, HP). It is was subsequently
adopted by X/Open as the preferred character set for Unix
workstations, and it also appears to be the preferred character set in
X.11 and the default character set used by MicroSoft Windows
applications, plus a number of others. It has also become the default
character set used when transmitting messages containing European
characters on Usenet.

While Microsoft's _Windows_ defaults to ISO Latin 1, Microsoft's
_MS-DOS_ does not. There is a DOS codepage (CP 850 multi- lingual)
that contains all the graphical symbols of ISO Latin 1, but with
different encoding.

Therefore, in order to succesfully transmit messages between a MS-DOS
system and Usenet/Internet, some conversion of character encodings are
required to prevent the message from becoming grabled. When
transmitting mail between Apple MacIntosh and Usenet similar problems
arise.


Briefly, what should be done?
=============================

When shipping messages between systems which use different character
set encodings, codings need to be converted.

1) You need to find out what encoding is used on the system you
implement your reader on, but usually, the following is the case:

Unix/X.11: ISO 8859/1
MS Windows: ISO 8859/1
Amiga: ISO 8859/1
MS-DOS: IBM CP 850 (or something very similar)
Macintosh: MacIntosh character set

2) When importing messages, you need to find out what encoding is
used in the message (this dies not distinguish between mail and
news, because there is no need to):

- If the message has a MIME header, parse the MIME header fields
to determine the actual encoding used.

- If the message does not have a MIME header, check whether the
body contains codes in the range 160-255 (decimal).
If no, assume:
Content-Type: TEXT/PLAIN; charset=US-ASCII
Content-Transfer-Encoding: 7bit
If yes, assume:
Content-Type: TEXT/PLAIN; charset=ISO-8859-1
Content-Transfer-Encoding: 8BIT

You then need to decode the message according to the information
(actual or assumed) found in these headers.

(Don't worry about what stuff like "Content-Type" etc. actually
mean at this stage, it will be explained below.)

3) When exporting messages, you need to treat news and mail
different.

NEWS: Check whether the user has actually used any characters
outside the range of US-ASCII in the body of text.

If, no, ship it unchanged.

If yes, use my fold library to translate the characters from
whatever your system uses to ISO 8859/1. Add the following 3
lines to the header, and ship it:
MIME-Version 1.0
Content-Type: TEXT/PLAIN; charset=ISO-8859-1
Content-Transfer-Encoding: 8BIT


MAIL: Check whether the user has actually used any characters
outside the range of US-ASCII in the body of text.

If, no, add the following 3 lines to the header, and ship it:
MIME-Version 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Content-Transfer-Encoding: 7bit

If yes, you need to use my fold library to translate the characters
from whatever your system uses to ISO 8859/1, and then encode the
body according to RFC 1341. You then add the following 3 lines to
the header, and ship it:
MIME-Version 1.0
Content-Type: TEXT/PLAIN; charset=ISO-8859-1
Content-Transfer-Encoding: QUOTED-PRINTABLE



MIME
====

MIME specify 3 header fields of particular interest to us:

1) The MIME-Version header field.

This tells us that the message contains headers complient with
RFC 1341.

2) The Content-Type header field.

This describe the data contained in the body part and has the
following syntax:

type "/" subtype *[ ";" parameter ]

We are not going to do a full MIME implementation, so we shall
just handle a single type (TEXT) and a single subtype (PLAIN).
Everything else is UNKNOWN.

The interesting thing here is the parameter. We shall look for
a parameter named "charset", and its value. RFC 1341 defines
the following values:

US-ASCII
ISO-8859-x (where x refer to a specific part of the
ISO-8859 set of standards).

The only value of ISO-8859 handled by my fold library is so far
ISO-8859-1.


3) The Content-Transfer-Encoding header field.

This field tells us how the body is encoded. We consider 3
different values for this field:

QUOTED-PRINTABLE
8BIT
7BIT

Both 8BIT and 7BIT indicate that the body is _not_ encoded.
QUOTED-PRINTABLE tells us that the body is encoded according
to the scheme described in section 5.1 of RFC 1341.


SAMPLES
=======

Below is some sample headers snarfed up by grepping my mailbox and our
news/spool. This is the sort of thing your program should handle.
Assuming your program is running on a MS-DOS computer using CP-850,
and that your message is contained in char *MsgBuffer, this is the
appropriate calls to my folding library to import these correctly:


1) 7bit US-ASCII
----------------
Content-Type: text/plain; charset=US-ASCII.
Content-Transfer-Encoding: 7bit

/* do nothing */


2) Unencoded ISO Latin 1
------------------------
Content-Type: TEXT/PLAIN; CHARSET=ISO-8859-1
Content-Transfer-Encoding: 8BIT

initfold(ISOL1, CP850); /* set up iso-8859-1 -> CP 850 */
foldbuffer(MsgBuffer); /* iso-8859-1 -> CP 850 */


3) Encoded ISO Latin 1
------------------------
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

unmimebuffer(MsgBuffer); /* MIME -> iso-8859-1 */
initfold(ISOL1, CP850); /* set up iso-8859-1 -> CP 850 */
foldbuffer(MsgBuffer); /* iso-8859-1 -> CP 850 */


4) 7 bit Norwegian ISO-IR-60
----------------------------
Content-Type: text/plain; charset=x-iso-ir-60
Content-Transfer-Encoding: 7bit

initfold(ISO646N, CP850); /* set up iso-or-60 -> CP 850 */
foldbuffer(MsgBuffer); /* iso-8859-1 -> CP 850 */

X-iso-ir-60 is a private value strictly outside the scope of
RFC-1341 (but RFC-1341 explictly allows private values provided
they start with an "X"). This value is used in Norway, and it
will be appreciated if your software recognizes it.


A&Q
===

Q1: All this is about the message body. What about using European
characters in headers?
A1: Don't do it. There is an RFC on it (RFC-1342), but in my experience,
such headers tend to mess up mail- and newsreaders. When exporting
messages use the "stripbuffer" function in the library to make
sure that headers are US-ASCII, 7BIT.

Q2: What are these RFC thingies?
A2: The RFCs are Internet standards socuments available by anonymous
FTP from any decent archive site. If you're really stuck, try:
ds.internic.net:rfc/*
Those of particular interest in this context are:
- RFC-822 : mail format
- RFC-1036 : news format
- RFC-1341 : MIME encoding
- RFC-1342 : MIME encoding of headers

Q3: Why is there a 8BIT Content-Transfer-Encoding, when there are no
standardized Internet transports for which it is legitimate to
include unencoded 8-bit data.
A3: Because people use it, and it works.
Cd to your mail spool, into almost any international group
with some volume, and type:
% fgrep -i content-transfer-encoding * | grep 8
See?

========================================================================


  3 Responses to “Category : C Source Code
Archive   : EUROSET.ZIP
Filename : AAREAD.ME

  1. Very nice! Thank you for this wonderful archive. I wonder why I found it only now. Long live the BBS file archives!

  2. This is so awesome! 😀 I’d be cool if you could download an entire archive of this at once, though.

  3. But one thing that puzzles me is the “mtswslnkmcjklsdlsbdmMICROSOFT” string. There is an article about it here. It is definitely worth a read: http://www.os2museum.com/wp/mtswslnk/