Dec 132017
 
Description of new offline reader format for use with Usenet and Internet mail and news.

Full Description of File


Description of new format for offline
readers compatible with Usenet and
Internet mail


File HELLDIVE.ZIP from The Programmer’s Corner in
Category UNIX Files
Description of new offline reader format for use with Usenet and Internet mail and news.
File Name File Size Zip Size Zip Type
FILE_ID.DIZ 91 79 deflated
HELLDIVE.TXT 22588 7529 deflated

Download File HELLDIVE.ZIP Here

Contents of the HELLDIVE.TXT file



Helldiver Packet Format Version 1.0

Copyright (c) 1992 Rhys Weatherley

[email protected]

Last Update: 18 December 1992

INTRODUCTION

For many years, the FidoNet community has been using QWK and other formats to
enable users to download their mail and conferences to be read while off-line.
This not only saves phone charges and prevents tying up BBS lines for long
periods of time; it also allows a user to use much more powerful tools on
their own machine to process the downloaded "packets" than what can be made
available in an on-line environment.

To date however, very little work has been done in the USENET and dial-in Unix
community to facilitate the same user operations. Some attempts have been
made to use QWK, but due to QWK's limitations and unsuitability for the USENET
message formats, such efforts have not been very successful.

Within USENET, the tendency seems to be either "dial-in to some other machine
and put up with it", or "set up your own USENET site". The former keeps the
user at the mercy of whatever user interfaces the admin of the other machine
sees fit to install, and the latter requires far more computing knowledge than
the average computer user is expected to have. Both of these can serve to
lock out large portions of the computer-literate public from experiencing
USENET. The latter option can also give rise to security problems in the form
of forged USENET messages, which a more controlled dial-in system avoids.

The purpose of this document is to define a new packet format which is aware
of the conventions used in the USENET community, forming a middle ground
between dial-in user interfaces and full USENET connectivity. It is not
limited to downloading USENET news however. The same format could be used
to enable a Unix user to package up their Unix mailbox and download it for
later perusal. The format is extensible to other kinds of news or conference
systems, so it is feasible, although not yet defined, that QWK or FidoNet
messages could be accomodated within the same packet as USENET messages.

ANATOMY OF A PACKET

A packet is a group of files, collected into a compressed archive. The
standard compression technique defined by this document is ZIP. Other
techniques such as ARJ, ZOO, ARC, LZH, etc can also be used. It is also
possible for Unix's tar.Z format to be used to transmit Helldiver packets.
The minimum requirement is a method to collect a group of files into a
single packet, and a method to expand the packet back into the original files.
Each of the filenames in a packet should be stored in upper case on those
systems where case matters.

A packet consists of zero or more "message areas" (commonly called "newsgroups"
in USENET jargon). Usually, each message area corresponds to a different topic
of discussion.

The following file specifications may appear in a packet:

INFO Optional textual information.
AREAS Index of the message areas within the packet.
REPLIES Index of the reply message areas from the user.
*.MSG Text of the messages in a particular message area.
*.IDX Index information for messages in a message area.

Other filenames may also appear in the packet, but are not defined by this
specification, so they should be avoided by generating software, and ignored
by receiving software.

The INFO file is an optional text file which may contain any kind of textual
information from the generating system. Typically this file would only be
present if there is some kind of urgent message that must be sent to the
receiving user. Use of this file to store the name of the generating BBS
and other such static information is possible, but discouraged to save space
and transmission time. Lines in this text file are terminated by LF
characters (not by CR-LF pairs).

The AREAS file contains an index of the message areas present within the
packet, specifying the name of the message area, the filename the messages
may be found in, and the message format. This is specified further in the
next section.

The REPLIES file contains an index of the message areas present within the
packet that contain reply messages from a user which should be mailed or
posted on the receiving system (usually the system that packets are normally
downloaded from for off-line reading). In most cases, a packet will contain
either an AREAS file or a REPLIES file, but both may be present. See the
section "REPLIES FILE" below for more information.

The *.MSG files contain the text of the messages from a single message area.
The actual format of this file depends on the type of message area specified
in the AREAS file. See the section "MESSAGE FILES" below for more information.

The *.IDX files provide an index into the *.MSG files, usually specifying
where each message starts and the contents of some of the common message
header fields. These files are intended for use by reading software on the
recipient's system to quickly display an overview of the messages present in
a message area. See the section "INDEX FILES" below for more information.

AREAS FILE

The AREAS file is a text file containing zero or more lines, each of which
specifies a single message area, its type and the name of the message/index
file pair in which the messages appear. In particular, each line has the
following form:

prefixarea nametype[description]

where "prefix" specifies the name of the message/index file pair, "area name"
is the name of the message area, "type" specifies the formats of the message
and index files, and "description" is a descriptive name for the message
area. Lines are terminated with a single LF character (not CR-LF).

The message and index files corresponding to the message area have the names
"prefix.MSG" and "prefix.IDX" respectively. If "prefix" contains alphabetic
characters, they must be upper case.

The message area name may be any sequence of printable ASCII characters (space
through tilde). Under USENET, this is typically a dotted name like
"comp.lang.c". Other networks may include spaces or other unusual characters
in the area names, so the receiving software must be aware of this fact,
and act accordingly. Also, receiving software must deal gracefully with
characters that have the high bit set, or names that contain control
characters, since people in other countries that speak a language other than
English may wish to use their country's native encoding for the message area
name. The only hard rule is that the name may not contain TAB, CR or LF.
Receiving software should treat the name as an indivisible string to be
displayed to the user.

The type field consists of two ASCII characters (usually alphabetic). The
first specifies the format of the message file, and the second specifies the
format of the index file. The following message file formats are currently
defined (case is significant):

u USENET news articles
m Unix mailbox articles
M Mailbox articles in the MMDF format
b Binary 8-bit clean mail format
B Binary 8-bit clean news format

The individual message file encodings are explained further in the next
section. The following index file formats are currently defined (again, case
is significant):

n No index file
c C-news overview database format
C Shorter C-news overview database format
i Offset/length pairs delineating the messages

These types are explained further in the section "INDEX FILES" below.

Further types may be defined in future versions of this specification. If
the receiving software does not recognise a message file type, it should ignore
the corresponding message and index files. If the receiving software does
not recognise a index file type, it can either ignore the message file, or
attempt to break down the message file into separate messages by some other
means. The user should be warned if a message area has been ignored.

It is recommended that packet generation software support at least the index
file type 'C', since it gives the best compromise between transmission time
and assisting the reader software to display message area summaries.

The optional message area description in the AREAS file consists of any
sequence of printable ASCII characters. This may be used to insert a
"readable" name for the message area. It may not contain TAB, CR or LF.
Additional fields may be added in a future version of this specification.

A message area may appear more than once in the AREAS file, each time with a
different prefix, but this is discouraged. This could be used to split large
message areas across more than one message file, but this is more conveniently
handled by generating a separate packet containing the area contination.

MESSAGE FILES

The format of the message file depends on the message area type specified in
the AREAS file. This version of the specification defines three formats,
which are in common use in the USENET and Unix community, and two additional
binary formats which permit messages to be stored with no modification or
assumptions about line lengths and byte values.

For each of first three formats, lines are terminated with LF characters.
Any CR characters in the messages should be considered as data characters, or
ignored on receipt. In particular, MS-DOS systems should strip CR characters
from text messages before writing them to a packet.

A 'u' (USENET) message file is a text file consisting of one or more messages
prefixed with an rnews header. This header has the form "#! rnews n" where
"n" is the number of bytes in the message that follows the header, excluding
the line-feed character which terminates the header. If the number in the
header is followed by white space and other characters, these other characters
should be ignored, until the terminating LF character is encountered.

A note about the rnews header: although a terser separator could be used, the
rnews header has the following advantages: (a) the messages can be extracted
in the absense of index files, or where the index files have an unknown type,
and (b) the message files can be imported into a USENET system as standard
rnews batches. Thus, if the user wishes to set up a real USENET site, or
simply use dedicated USENET software to read packets, they can use their
existing packet provider as a convenient read-only newsfeed, with no extra
burden placed on the system administrator of the generating system.

A 'm' (Unix mailbox) message file is a text file consisting of one or more
messages. The first line of each message must start with the character
sequence "From ". Any remaining lines in the message which start with
"From " should have the character '>' prepended. Thus the "From " lines
delimit the message file into separate messages.

A 'M' (MMDF mailbox) message file is a sequence of one or more messages,
separated by at least 4 Control-A characters. The message file may optionally
start and end with a sequence of such characters. If a sequence of 4 or more
Control-A characters occurs in a message, it should be "adjusted" by the
insertion of spaces to split the sequence. The use of Control-A characters
within a message is discouraged.

The 'm' and 'M' formats were chosen for mail because of their common
occurrence in the Unix community. The generating system may elect to instead
convert a mailbox into the USENET format if it wishes, and set the message area
name to some descriptive string to inform the reader. It is recommended
however, that 'm' or 'M' be used for mail and 'u' for USENET news so that
reader software can make a distinction between the two if it wishes.

The 'b' (binary mail) and 'B' (binary news) formats are identical. The
contents of each message must conform to RFC-822/1036 and may contain content
information compatible with RFC-1341 (MIME). The only difference between
the messages of these formats and the preceding formats is that no assumption
is made about line lengths, and any of the 256 values for a byte may be used
in any position. Each message is preceded by a 4-byte value which indicates
the length of the message in bytes, stored in big-endian order (i.e. high
byte first, low byte last). The difference between 'b' and 'B' is a semantic
one: message files of type 'b' are expected to contain mail messages, and
message files of type 'B' are expected to contain news messages. Thus, reader
software can make a distinction between the two if it desires.

For most practical purposes, 'u', 'm' and 'M' should be sufficient. The binary
'b' and 'B' types should be used for articles that contain 8-bit binary data.
It is possible to use type 'u' for binary data as well, but 'm' and 'M'
cannot be because the message contents may be modified. When MIME becomes
more wide-spread, it is expected that binary messages containing programs,
sound, pictures and video will become popular, necessitating these binary
types.

Note that MIME messages can be stored in 'u', 'm' and 'M' message files, but
any binary components should be encoded with quoted-printable or base64 (which
is expected to be the most common usage of MIME in the near future). It is
not required that 'b' or 'B' be used for MIME messages: only those containing
raw unencoded binary data (as indicated by the Content-transfer-encoding
header value "binary").

INDEX FILES

This specification defines four index file types, which provide varying
degrees of support for packet readers.

Type 'n' indicates that no index file is present, and it is up to the packet
reader to extract messages from the message file. Use of this type is
discouraged, except where transmission time must be minimised (at the expense
of packet reader simplicity). It may be useful where the generating system
is providing a USENET newsfeed using packets.

A type 'c' index file is a text file (LF terminated lines), with one line per
message that occurs in the message file. The lines in the index file should
be in the same order as the corresponding messages. Each line has the
following form:

offsetsubjectauthordatemesgid
refsbyteslines

[Note: the line-wrapping here is for document-formating purposes only. No
line-wrapping occurs in the index files]. The fields have the following
semantics:

offset Seek position in the message file of where the corresponding
message starts. The first seek position is 0. For the 'u'
format, this indicates the start of the line following the
rnews header line. For the 'm' format, this indicates the
start of the "From " line and for the 'M' format, this
indicates the start of the article after the Control-A
sequence. For the 'b' and 'B' formats, this indicates the
first byte of the message after the 4-byte message length.

subject The "Subject:" line from the message.

author The "From:" line from the message.

date The "Date:" line from the message.

mesgid The "Message-Id:" line from the message.

refs The "References:" line from the message.

bytes The number of bytes in the message.

lines The "Lines:" line from the message. Note that this field
is pretty useless these days on USENET, but is still popular.
It is meant to indicate the number of lines in the body of
the message. Generating software may elect to re-generate
this value if it is not present in the original message,
but this is not required.

If any of these fields contained TAB's, newlines or other white space in the
original articles, they should be converted into single spaces. All fields
must be present, but some may be empty. The "bytes" field must not be empty,
since it provides necessary information for packet readers. Each field must
conform to the Internet RFC documents RFC-822 or RFC-1036.

Optionally, a header line may end with one or more extra TAB-separated fields
for other RFC-compliant header fields, together with the header field names.
e.g. "Supersedes: <[email protected]>". These fields are not defined by this
version of the specification, and are by arrangement between the generator
and recipient only.

This format is compatible with the news overview database format of C-news.
The only difference being the substitution of an offset for the article
number used by C-news. The C-news format was designed to assist threading
newsreaders, so this packet format should provide similar assistance to
compliant packet readers.

The 'C' format is similar to 'c', except that the "mesgid" and "refs" fields
are dropped. These fields can commonly be quite long and are mainly of use to
packet readers which perform Message-ID based message threading. Packet
readers which perform subject threading (i.e. sort on the subject line and
then on the date and/or arrival order) do not require such information. The
format of the header lines in this case is as follows:

offsetsubjectauthordatebyteslines

Further TAB-separated fields may be added in future versions of this
specification.

The "author" field is slightly different to the 'c' format. Instead of
an RFC-822 format address, it is just the author's name, extracted from the
"From:" line of the message. Most RFC-822 and RFC-1036 "From:" lines have one
of the following forms:

address
address (name)
name


Names may sometimes be surrounded by double-quote characters, have embedded
"(...)" sequences, or contain "useless" information after a comma (",") or
slash ("/"). The main requirement is that the generating software produce
some kind of (more or less) meaningful string for the name of the author which
can be displayed to the user by a packet reader. See RFC-822 and RFC-1036
for more information on the syntax of the "From:" line in messages.

The 'i' index format is purely binary, using 8 bytes for each message in the
corresponding message file. The first 4 bytes specify the offset into the
message file of the message and the remaining 4 bytes specify the number of
bytes in the message. Each 4-byte quantity is stored in big-endian order
(high byte first). This format is supplied to provide a trade-off between
transmission time and easy extraction of messages from a message file.

It is recommended that packet generators support at least the 'C' format, and
that packet readers support at least 'c', 'C' and 'i'. If a type is
unrecognised by a packet reader, then it must "pull apart" the message file
into separate messages itself, or flag the message area as unparsable by the
packet reader.

REPLIES FILE

The one remaining requirement is a mechanism for a user to upload replies or
new messages to a generating system for mailing or posting. While it is
possible to re-use the AREAS file for this purpose, keeping the download and
upload sections separate will help prevent messages being fed back into a
network erroneously.

The REPLIES file has a similar format to the AREAS file. Each line has the
following form:

prefixreply kindtype

The "prefix" and "type" fields are as before. The "reply kind" field indicates
the mechanism to use when transmitting the messages in the message file. The
following values are currently defined:

mail Transmit an RFC-822 compliant personal mail message
news Transmit an RFC-1036 compliant USENET news posting

On a Unix system, transmission of mail and news is usually performed with the
"sendmail" and "inews" programs respectively. Additional kinds may be
specified in a future version of this specification for other message formats.
Note: it is discouraged that the kinds "mail" and "news" be used for anything
other than RFC-compliant messages. In particular, FidoNet or QWK messages
should use a different reply kind. Messages of the same reply kind can be
placed in the same message file, or in separate message files.

Further TAB-separated fields may be added to the lines in the REPLIES file
in a future version of this specification.

It is recommended that a message file type of 'b' or 'B' be used for sending
replies to minimise the chance of message corruption. The recommended index
file types for replies are 'i' and 'n'. The index types 'c' and 'C' are
discouraged because they do not provide useful information for reply purposes.

The format of the messages in the message files should follow the relevant
RFC standards, with the following restriction: any "From:", "Sender:",
"Control:", "Approved:" or other similar "dangerous" header lines should be
ignored by the system transmitting the replies to prevent forgeries from
occuring. In particular, the "From:" header should be determined from the
user's login name, or some other similar means, rather than from any data
supplied in the user's message.

In most cases, mail messages will contain "To:", "Subject:", "Cc:", "Bcc:"
and "Reply-To:" header lines, and news messages will contain "Newsgroups:",
"Subject:", "Followup-To:", "Keywords:", "Summary:" and "Reply-To:" header
lines. Other optional headers (especially MIME content headers) may also
be present.

The automatic addition of a signature by the transmitting system is
discouraged. Signatures should be added by the user's message composition
software instead, if desired.

A method for allowing replies from more than one person to be stored in the
same packet was considered, but was rejected for security reasons.

FUTURE ENHANCEMENTS

The obvious enhancement that can be made is to support other message formats,
especially FidoNet formats. Currently the message area file code 'q' is
reserved for QWK-format messages. This will be defined in a future version
of this specification if demand warrants.

Experimentation with other formats is encouraged, but please contact the
author first to prevent double-ups from occurring. The author may be
contacted via e-mail at [email protected]




 December 13, 2017  Add comments

Leave a Reply