First Published in PC Magazine June 29,1993 (Utilities)
PCMCVT strips out all the formatting codes of today's most
popular word processors, leaving you with an editable ASCII text. PCMCVT
recognizes and converts the binary files produced by Word for DOS,
Word for Windows, Windows Write, WordPerfect for DOS, WordPerfect for
Windows, and Ami Pro into ASCII files.
What do you do when you receive a document prepared on a word processing
program you don't happen to have? Dozens of formatting codes for every font
change, margin, and mode are embedded in the manuscript, and every one of
them is worse than useless if you have to do any work on the text.
This utility eliminates this hassle. If you ever need to turn text
formatted with today's most popular word processors into straight ASCII,
all you need is PCMCVT. PCMCVT recognizes and converts the binary files
produced by Word for DOS, Word for Windows, Windows Write, WordPerfect for
DOS, WordPerfect for Windows, and Ami Pro into ASCII files.
Keeping all these different word processors on your machine would be
prohibitive both in cost and disk space. Moreover, you'd have to learn
enough of each of their commands to open the files you receive and then
resave them in ASCII format. And you'd have to know which program to use
on the document you received. PCMCVT does all this for you automatically,
making it the perfect answer to word processor overload.
PCMCVT is not a memory-resident program, and it takes up very little
space when it runs. This can be very handy when you want to shell out to
DOS and convert an additional document without having to close the
document on which you're working. Most word processors today provide a
shell-to-DOS capability, and since PCMCVT weighs in at under 10K, it runs
very comfortably in a DOS shell. Some word processors, such as Microsoft
Word for DOS, can accept a DOS command without even requiring you to shell
out to DOS, making it possible to convert a file with PCMCVT without even
seeing a DOS prompt.
If you make any changes in the PCMCVT source code, you'll need
MASM 6.0 or later to reassemble the program.
You should put PCMCVT.COM in a directory on your DOS path. That way
you can then go to any directory and start converting files without having
to move them around. You enter PCMCVT with one or more parameters, from
the DOS prompt. Its complete syntax is
PCMCVT /S source [/D destination] [/Wxx] [/O] [Tx] [/?]
The only required PCMCVT parameter is /S source, the filename of the
word-processed document you want to convert into ASCII text. Spaces can
be included between the /S and the filename but are not required. The
source filename must be complete, with no wildcards or missing extensions;
the Quick Reference Guide explains how to process a batch of files.
When creating a file for the converted text, PCMCVT uses the same
filename as that of the source file by default, adding only the extension
.TXT. Thus, if you enter
PCMCVT /S MYLETTER.DOC
the source file will remain unchanged and the converted file will be
called MYLETTER.TXT. By adding the optional /D destination parameter
on the command line, you can specify any legal DOS filename (including
a path, if desired) for the converted file.
If a same-named file exists in the destination directory, PCMCVT
will warn you and give you the choice to overwrite the file or not.
If you answer No, the easiest way to reenter the command up to the point
of typing in a new destination file name is to hit F3. If you want PCMCVT
to overwrite same-named files automatically and without a warning prompt,
simply include the /O switch parameter on the command line. This can be
useful when you are running PCMCVT from a batch file.
To tell PCMCVT to word-wrap lines at a certain length, you use
the /Wxx parameter, where xx is the maximum number of characters desired.
With a setting of /W60, PCMCVT will force a new line by inserting a
carriage return/line-feed (CRLF) after a maximum of 60 characters.
When converting a line longer than xx characters, PCMCVT tries to find
a convenient space to break the line. Thus in this example, although
the specified length is 60, the line may break at 45, 52, or 59
characters, depending on where the last word space occurred. If a
file contains a long, unbroken character string, PCMCVT has no choice
but to break the line at the specified length.
If you enter the /W switch parameter without specifying a value
for xx, PCMCVT defaults to a line length of 78 characters and replaces
soft carriage returns in the original document with hard carriage returns.
By default, PCMCVT passes on the tab character (ASCII 9) to the
destination file. This is not always desirable. The /Tx switch replaces
tab characters with the number of spaces you specify as x. Thus, for
example, /T5 replaces all tab characters with 5 spaces, which you might
use to approximate the tab spacing in the original document. If no value
of x is specified, the tab is removed.
If you forget the syntax details, you can get a listing of the
PCMCVT options by entering the command with no parameter or with
the /? switch.
PCMCVT can create two kinds of ASCII files. The most common
(and the default) format is called ``one-line-per-paragraph.'' In this
format, everything up to a hard carriage return is considered part of a
single line, no matter how many lines it may occupy on the screen. The
other format inserts hard carriage returns and line-feeds at the end of
each screen line, whose length is specified by the word-wrap setting as
Which format is more appropriate in a given situation depends on
how the file will eventually be used. If you wish to import the
converted text into a word processor, the one-line-per-paragraph format
makes the most sense, as it leaves the word processor free to impose its
own margin settings and reformatting on the text.
Fixed-length lines delimited by a carriage return/line-feed, on the
other hand, are perfect for text files that are to be typed or browsed
from a DOS prompt. README files are a good example. With the trend in
shareware toward using the Windows Write format for instructions, you
could preview a newly downloaded file without having to fire up Windows.
In addition to the text that you want and the formatting codes that
you don't, word-processed files contain information that uniquely
identifies the word processor that created them. For this reason, during
the conversion process PCMCVT is able to display the name of the
originating word processing system. Identifying the native document
format also enables PCMCVT to use the appropriate routines to extract
the text from the file.
Among the document formats PCMCVT supports--the DOS or Windows
versions of WordPerfect 5.1 or later; Word for DOS, Version 5.0 or
later; Word for Windows (WinWord), Version 1.0 or later; Windows
Write 3.x; and Ami Pro--the three from Microsoft are very similar,
so I'll discuss them first.
Since its introduction in 1983, the file format for Microsoft Word
has remained unchanged. Microsoft had the foresight at that time to
define structures larger than were immediately needed so that new
features could be added without changing the basic format. A Word
document file consists of three sections: the header, or file
information block; the text section; and the formatting section.
This basic file structure is shared by the three Microsoft file
formats with but slight differences, which I'll discuss as they arise.
PCMCVT is concerned only with the first two sections of the
Microsoft document format, that is, the FIB or file information block,
and the text section. In identifying which word processor was used to
create the original file, however, PCMCVT also makes use of a second
identifying characteristic, called the signature. A file signature is
usually a unique byte sequence at a specific position in the file that
sets it apart from other files. In the Microsoft document files, for
example, the signature is contained in the first 2 bytes, as shown below:
File format Signature
Word 5.5 31BEh
WinWord 1.x 9BA5h
WinWord 2.0 DBA5h
Windows Write 3.1 31BEh
From the signatures alone we can tell WinWord and Word for DOS apart,
but Windows Write presents a problem, as it shares the same signature
with Word for DOS. This is where the reserved areas of the FIB come
into play. Starting from the beginning of the file, if we look at the
value of the 2-byte word located at offset 96, we'll get our answer.
If the value is 0, then the file is written in Word for DOS; if it has
any other value, it's Windows Write. With the type of file nailed down,
we can move on to figuring out the size of the text portion.
The text follows the file information block. In both Word for DOS
and Windows Write the text starts 128 bytes from the beginning of the
file. The starting offset of the text in WinWord is specified by a word
stored at offset 24 from the start of the file. Now that we know the file
type and the start of the text, with one more piece of information and a
bit of math, we can move on to read the text.
The actual length of the text plus the header size is stored as a
DWord (double word) at offset 28 from the start of the file. By
subtracting the header size (128 for Word and Write, 384 for WinWord),
we get the length of the text to be converted.
Of the formats handled by PCMCVT, Microsoft's are the easiest to
convert. Once the header information is gathered and the text length is
determined, what is left to be read is essentially pure ASCII text.
Unlike the other formats, only a few characters in the text demand
special treatment during conversion.
In Windows Write, Word for DOS, and WinWord, the pair of ASCII
character codes 13 and 10 (carriage return and line-feed) are used for
paragraph ends; character code 12 is used for explicit page breaks; and
character code 9 is used for tab characters. PCMCVT keeps all these
characters unchanged in the converted text unless, of course, the /Tx
switch is specified to expand tabs into spaces.
Several other special codes are used in Word for DOS and WinWord.
Hard line breaks are stored as character 11, which PCMCVT replaces with
a carriage return/line-feed pair. PCMCVT replaces ASCII 15, the
character that Word uses for the em-hyphen, or em-dash, with two normal
hyphens. Word for DOS defines the 196 as a nonbreaking hyphen and 255
as a nonbreaking space; WinWord uses ASCII 30 for a nonbreaking hyphen
and ASCII 160 for a nonbreaking space. So in these files, depending on
the original format of the file--WinWord or Word--PCMCVT replaces the
30 or 196 with a character 45 (hyphen), and 160 or 255 with a 32
(space). The utility ignores all other characters with a value of ASCII
31 or less.
WordPerfect file formats are essentially the same in Versions 5.1
for DOS and in 5.1 and 5.2 for Windows. The format goes back only to
WordPerfect 5.0, which represented a considerable departure from that
of the earlier 4.2 version.
WordPerfect 5.x files contain a 16-byte file header, called the
prefix area, that identifies the file and provides certain basic file
information. The first 4 bytes consist of the hexadecimal value 0FFh
and the three letters WPC, which comprise the signature. Although
PCMCVT uses only the signature and the next 4 bytes (which point to
the start of the text), the prefix area also contains information
about the type of document, the version of WordPerfect that created
the file, and so forth.
After the header, WordPerfect document files are not segmented
in the manner of Microsoft files. Except for some initial formatting
information stored in an additional prefix area immediately after the
16-byte header, all formatting information is embedded in the document
area itself. Having the formatting codes and data in the text area
makes the conversion process somewhat more difficult.
WordPerfect uses both single and multibyte codes to identify
functions. Single-byte codes employ some of the same control
characters (ASCII values of less than 32) used in Word files.
In WordPerfect, such control characters are used to signal carriage
returns, page feeds, page numbers, and special merge codes. Single-byte
codes in the hexadecimal range from 80h to 0BFh (128 to 191 decimal) are
used to turn such functions as justification, columns, and footnotes on
and off. PCMCVT traps the single-byte codes for hard hyphens and hard
spaces. Hard hyphens occur in three types (in-line hyphens, end-of-line
hyphens, and end-of-page hyphens) and use codes 0A9h, 0AAh, abd 0ABh.
Hard spaces use the code 0A0h. PCMCVT looks for these codes and replaces
them with the standard hyphen (ASCII 45) and space (ASCII 32) characters.
Except for hard hyphens, hard spaces, carriage returns, and page feeds,
PCMCVT strips out all the rest of the single-byte code characters when
it converts a file.
Multibyte functions are where the real fun starts. There are over
50 pages of documentation on the multibyte functions in the WordPerfect
developers guide! These functions are of two kinds. Fixed-length
functions use the range from hexadecimal 0C0h to 0CFh (192 to 207
decimal), and variable-length functions lie in the range of hexadecimal
0D0h to 0FFh (208 to 255 decimal).
The fixed-length functions begin with a code (for example, 0C1h for
tab-align), which is followed by a series of bytes of data. The end of
the function is signaled by a repetition of the initial code. PCMCVT
recognizes only one of these functions, 0C1h, the tab-align code. While
most word processors use the tab character (ASCII 9), WordPerfect signals
a tab with the tab-align function. The function contains a byte that
distinguishes between standard tabs and left, right, or center
alignment, depending on which bits are set. PCMCVT retrieves the byte
and, if the code is for tab, replaces the functions with either an
ASCII 9 character, or the number of spaces if the /T parameter was used.
In all other functions, PCMCVT ignores everything between the two
appearances of WordPerfect's multibyte fixed-length codes.
Variable-length multibyte functions have a slightly different
format. The actual size of the function is stored in the codes.
These functions start with a 1-byte function code followed by a
1-byte function code. This combination ensures over 12,000 possible
variations, giving WordPerfect a lot of room for expansion.
Immediately following the function code is a word (2 bytes) that
contains the length of the function. The value represents the size of
the whole function, including the beginning and the ending codes. After
the length word comes the data for the function. Multibyte functions can
cover a wide range of different features, such as margin commands, font
changes, printer codes, and graphics information.
At the end of the code sequence, the first section repeats itself,
only in reverse order. That is to say, the size word of the function
is duplicated, followed by the function code, and finally the function
code. In its conversions, PCMCVT must ignore (that is, strip out) all
the multibyte functions. Instead of examining each byte in search of
the ending code, however, PCMCVT figures out the length from the size
word and jumps past the function, saving the time it might take to
scan through possibly hundreds of bytes.
Compared with the Microsoft and WordPerfect products, Ami Pro
presents an unusual format: Its files consist almost completely of
ASCII text. Like its competitors, Ami Pro does have a header to store
information, followed by the document area. Like WordPerfect, Ami Pro
also embeds formatting commands in the text area. However, the
ifference is that the formatting codes are set off by certain low ASCII
characters such as < and @, rather than high-bit ASCII characters.
The Ami Pro signature used by PCMCVT is the text string [ver] at
the very start of the file. Following the signature comes line after
line of information about the file--printer last used, font information,
title of file, and so forth. PCMCVT skips through all this information
and searches straightaway for the Ami Pro start-of-document text string,
The text that follows the [edoc] string stores formatting
information both at the beginning and in the middle of the text lines.
Simple formatting information, such as bold on and off, is signaled
with <> signs surrounding a code. For example, if the single word bold
were supposed to appear in boldface, the code below would be used:
The next word is <+!>bold<-!>
Such formatting is humanly readable. The code for boldface is the
exclamation point. It is turned on with a plus sign and turned off with
a minus sign.
To distinguish between formatting codes and characters that are
part of the user's text, Ami Pro usually places a less-than sign (<)
before the character. For example, if the text contains a less-than
sign, Ami Pro prefixes it with another less-than sign. So if Ami Pro
(or PCMCVT, for that matter) sees two less-than signs (<<) in a row,
it ignores one and prints the other as a regular character. To signify
a greater-than sign, Ami Pro uses the three characters less-than,
semicolon, and greater-than (< ; >).
Ami Pro uses a similar coding format for many features, even page
breaks. Although in most word processors hard page breaks are usually
an ASCII character 12, Ami Pro puts the code <:P> at the beginning of
the first line of a new page instead. About the only ASCII control codes
that Ami Pro uses are the carriage return/line-feeds and tab characters
(ASCII 9). And again, as with all but the tab, carriage return/line-feed,
and page feeds, PCMCVT reads the Ami Pro codes and skips over them,
extracting only the actual text of the file.
HOW PCMCVT WORKS
PCMCVT was written in assembly language, using Microsoft Macro
Assembler 6.0. (Version 6.1 would work as well.) A modular programming
approach was implemented for each conversion routine, so each conversion
algorithm could be designed and tested independently. What follows is a
broad overview of how PCMCVT works; add the complete commented assembly
code (available on PC MagNet) and you'll have everything you need to
understand the PCMCVT utility.
The main module of the program contains the routines for file
handling, identifying headers, parsing the command line, word wrapping,
and other support functions.
Program execution starts by revectoring INT 24h. This allows PCMCVT
to handle critical errors itself rather than by means of the classic
``Abort, Retry, Ignore'' message. The HookInt24h and UnHookInt24h
functions simply intercept critical errors and send program execution
back to the calling program. Because the carry flag is set on the
return, PCMCVT sees the error and can process it as needed.
As a command line program, PCMCVT needs to read and parse the
parameters the user passes to it. It thus reads offset 80h of the
Program Segment Prefix (PSP), which contains the number of command line
parameters. If none are found, PCMCVT simply displays a help screen
and terminates. If it finds any user-supplied parameters, it calls the
ParseCommandLine function. In order to simplify the parsing process,
PCMCVT first capitalizes the incoming command line string. The parsing
function steps through each parameter, extracts the filenames and
switches, and then stores them for later use.
The next step is to identify the source file's word processing
format by examining the header and signature. PCMCVT opens the source
file and reads a 256-byte chunk of the file into the HeaderBuffer. This
is more than sufficient to identify the five formats supported here, and
it leaves room for expansion later. The function HeadCheck steps through
the beginning of the HeaderBuffer, looking first for a WordPerfect
signature, then for the signature of Word for DOS, WinWord, and finally
AMI Pro. The function StrCmp is called with pointers to the HeaderBuffer
looking for matches. If no matching signatures are found, PCMCVT displays
the message ``Sorry, unknown format'' and exits.
The HeadCheck function returns a value that corresponds to the file
type found. This determines which conversion routine will be called.
If the value is a valid format that PCMCVT can handle, the destination
file is opened and execution moves on to the appropriate conversion
routine. The code shown (below) in Figure 1 is the signature-checking
routine that HeadCheck uses for WordPerfect. The routine is duplicated
in PCMCVT, using the unique signature for each file type.
CALLING CONVERSION ROUTINES
The PCMCVT conversion routines are essentially integral units that
call external routines for file support and that use global data. The
way text is extracted from its native file varies with the specific word
processor. However, the conversion routines use similar steps, starting
with finding the beginning of the text in the file.
Each module begins by getting (or calculating) the start of the text
section, to which the file pointer is then reset. (The file pointer needs
to be reset because it was left 256 bytes from the start by the
header-reading procedure.) Once the pointer is reset, the first block of
text is read into a 2,024-byte (2K) input buffer. This buffer size was
chosen as a compromise between the demands of adequate workspace and
the danger of hogging too much memory. If the file is less than 2K,
the buffer size is adjusted accordingly. If nothing is left to read
from the file, the routine will exit with an error. The file read
routine is self-contained, handling errors and monitoring the amount
of text input by itself.
The actual conversion process consists of a byte-by-byte comparison
between the source file characters and the codes and special characters
that PCMCVT recognizes. PCMCVT performs a trio of tests--Use it,
Replace it, Ignore it--on each successive character (or series of
characters, in the case of a multibyte function). For each file format,
the decision may be different. In Microsoft Word, for example, the
carriage return/line-feed (CRLF) characters are passed on to the
destination (output) file. However, in Ami Pro, because each line
is CRLF delimited, PCMCVT needs to check whether there is one or more
than one CRLF. If there is only one, it is ignored. If there is more
than one CRLF, the first is ignored but each additional one is passed
on the output.
All the conversion modules handle tabs in the same manner. When a
tab character (ASCII 9) is found, the DoTabs routine is called. DoTabs,
a simple routine as shown (below) in Figure 2, determines whether PCMCVT
is to pass the tab character on or replace it with spaces. If the /Txx
switch is in effect, DoTabs writes the correct number of spaces to the
output file. Otherwise it simply writes out a tab character.
To write data, whether characters or control codes, PCMCVT calls
the WriteIt function. WriteIt is also self-contained. WriteIt accepts
a character in AL and puts it into the output buffer. The output buffer,
also 2,024 bytes (2K) long, is used exclusively by WriteIt. WriteIt
monitors the current character position and, when the buffer is full,
writes the buffer's contents to the destination file. WriteIt can be
forced to write (or flush) its buffer by calling it with a -1 in AL.
The buffer is flushed when PCMCVT gets to the end of the source file.
PCMCVT reads the source file in sections. When the conversion
routine comes to the end of the input buffer, a call is made to read
in another piece of the file. The file-reading routine preserves all
registers and updates the input buffer pointer after the file is read.
The input buffer points at the current character the conversion routine
is to consider. Thus, when the file-reading routine returns to the
conversion module, execution can pick up exactly where it left off,
except that now the input buffer has a new block of text to convert.
This is important in cases like that of the WordPerfect conversion
module, where the input buffer can run out of characters while in the
middle of a multibyte-function-scanning loop.
If the /Wxx option was specified, the self-contained word-wrap
module is called when WriteIt writes the buffer to disk. Working with
the output buffer, the word-wrap routines scan the text in a two-step
process. After saving the current pointer into the output buffer, the
first scan looks for carriage return/line-feeds, which signal the end
of a string of characters. While scanning, the word-wrap routine counts
the characters at which it looks. When it finds a CRLF, or if it comes
to the end of the buffer, the number of characters is compared with the
value specified on the command line. If a CRLF was found in fewer
characters than the specified value, the line is printed as is and the
scanning starts over again. But if the CRLF is found beyond the desired
line length (/Wxx value), the second scan is performed.
The second scan, in Figure 4, actually consists of a series of small
scans. The current string or line is scanned for a space (ASCII 32) or
a tab (ASCII 9), which signals a break between two words. Each time a
space is found, its position in relation to the start of the current
string is compared with the specified line length value. If the space's
position is less than the line length, the position value is saved and
the string is further scanned. This continues until either a space is
found with its position greater than the requested length of the line,
or the end of the buffer is encountered.
Several things happen when the scanning process finds a space in
the string at a position greater than the line length value. First, the
value of the starting position of the current scan is retrieved. The
starting value and the last space position are subtracted, yielding the
number of characters in the string. This number of characters is used
for the length of the line when the string is written to the output
Once the string of characters is written, the program writes a
CRLF combination (ASCII 13 and 10) to the destination file, completing
the line. When the CRLF writing is done, the program clears the current
character counter and stores the last space's position in the starting
position variable, and then the scanning process starts all over again.
A slight variation on the previous procedure is required when the
routine runs out of characters at the end of the output buffer before
either a CRLF or the specified line length is reached. In this case,
the current line is written to the destination file without a CRLF.
The length is stored and the word-wrap procedure returns to the main
program. Since the files are read and written in small sections, the
next time the word-wrap function is called, it picks up where it left
off and maintains the correct line length throughout the conversion.
When PCMCVT finishes converting the file, it closes all files and
returns to the command prompt. The utility also returns errorlevel
codes for use in batch files. If the conversion was successful, the
return code is a 0. The other errorlevel exit codes are shown below:
Error- level code Meaning
1 Source file not found
2 Destination existed,not overwritten
3 Drive not ready
4 Disk full
PCMCVT is designed for quick-and-dirty text conversion. Given the
complexity of today's word processors, you may occasionally run into a
file that either will not convert, or will convert very oddly. One
format that does not convert well with PCMCVT, for example, is WinWord's
so-called fast save format. This format keeps text in the original text
area, but it also incrementally saves text in the area after the
formatting area. While you're unlikely to receive a Winword document
you need to convert that has not been saved in the regular manner, if
you hit one saved in the fast save format you'll have to consult
Other casualties of stripping out formatting codes are headers and
footers. The loss of formatting on outlines or tables could render the
text unreadable. Fortunately, PCMCVT will prove entirely adequate for
the vast majority of your conversion needs.
JAY MUNRO IS A FREQUENT CONTRIBUTOR TO PC MAGAZINE. HE CAN BE REACHED
ON PC MAGNET (72241,554) IN THE UTILITIES/TIPS FORUM (GO ZNT:TIPS),
THE TECHNICAL SUPPORT AREA FOR PC MAGAZINE UTILITIES.
The Signature-Checking Routine
Invoke StrCmp, Addr HeaderBuffer, Addr WPerfID,4
Or AX,AX ;StrCmp = 0 if a match
Jnz @F ;no match
Mov AX,1 ;1 = Word Perf
Lea DX,WPerfName ;DX contains address
; source file name
@@: ;next comparison
Figure 1: The HeadCheck function uses this routine to determine if the word processing format of a file is one that PCMCVT supports.
The DoTabs Procedure
DoTabs Proc Near Uses CX ;expands tabs
Mov CX, TabWidth ;get tab size
Mov AL,32 ;put space in AL
Call WriteIt ;write it to dest file
Loop TabLoop ;go back for more
Figure 2: This procedure determines whether PCMCVT is to pass the tab
character on or replace it with spaces. If the /Txx switch is in effect,
DoTabs writes the correct number of spaces to the output file; otherwise
it simply writes out a tab character.
How Word Wrap Detects Spaces
Mov AL, 32 ;load AL with space character
RepNe ScaSb ;scan for that first space
Mov AX, BX ;get length total length (BX)
Sub AX, CX ;see difference
Mov CurLine,AX ;save current length
Add AX, LastLen ;add in the last length
Cmp AX, DX ;check length against maximum (DX)
Jg Decision2Make ;jump if line is longer than width
Mov AX,CurLine ;get current line length back
Mov LastSpc,AX ;save it as last space found
Jmp ScanLoop ;go back for more
Figure 3: The word-wrap routines (invoked when you use the /Wxx option)
scan the text in a two-step process. This is the code for the second scan;
it actually consists of a series of small scans.