Dec 122017
 
Cleanse 'dirty' text files to make clean ASCII files, new options.
File CLEANS12.ZIP from The Programmer’s Corner in
Category Word Processors
Cleanse ‘dirty’ text files to make clean ASCII files, new options.
File Name File Size Zip Size Zip Type
CLEANSE.COM 2816 1813 deflated
CLEANSE.DOC 9960 3737 deflated

Download File CLEANS12.ZIP Here

Contents of the CLEANSE.DOC file




CLEANSE
Version 1.2
(c) 1987, 1990

William C. Parke
1812 S St. NW
Washington, D.C. 20009

'Cleanse a dirty text file to make a clean ASCII file'

Description:

The program CLEANSE is a utility for operation as a
command under the MS-DOS operating system, version 2 or
higher. It is designed to remove extraneous characters
from text files and make them into ordinary ASCII text
files. Any non-standard end-of-line terminators will
be replaced by a proper carriage-return linefeed
combination. The cleansed file can then be used by
standard ASCII text editors or printed on standard line
printers. Several optional switches are available to
remove extra spaces, remove line indentations, preserve
tabs, preserve form feeds, cut long lines, and
translate the IBM extended character set to ordinary
characters.

Distribution:

CLEANSE may be freely distributed provided that it
remains unaltered and is not packaged with or in
promotion of any commercial product or venture, except
by direct agreement with the author.

Syntax: CLEANSE file1 file2 [/switch]

File1 is any text file which may have improper end-of-
line termination or extraneous control codes. CLEANSE
creates file2 to contain the processed text. If file2
already exists, the user will be asked if file2 is to
be overwritten. File2 cannot have the same name as
file1. A DOS path name may be used to preface either
file name. The following command line switches are
allowed:

/c# = Cut long lines after column # (1 to 254)
/d# = Divide line between words after column #
/ri = Remove Indentation of lines
/rs = Remove extra Space between words
/st = Save Tab characters
/sf = Save Form Feeds
/te = Translate IBM extended characters

Operation:

CLEANSE removes any extraneous control codes, makes
end-of-line termination in ASCII format, drops any
parity bits on characters, and removes the deletes.


Using the optional switches:

The /c# Switch:

The /c# switch forces CLEANSE to cut lines which extend
beyond column number #, a number between 1 and 254.
The cut will be made irrespective of the location of
word boundaries.

The /d# Switch:

By default, CLEANSE attempts to divide long lines. If a
space is found between words beyond column 122, the
line will be divided at that space. The /d# switch
changes the default column to the number #, taken
between 1 and 254.

The /ri Switch:

By default, indented lines are left indented. With the
/ri switch in force, lines indented with spaces or tabs
will have their indentation removed.

The /rs Switch:

ASCII files are often created by word processors which
add extra space between words in order to justify the
right margin of text. The /rs switch option will let
CLEANSE remove extra spaces in the text. Two spaces
will be saved between sentences even with the /rs
switch in force.

The /st Switch:

CLEANSE will expand tab characters with spaces
characters, taking the tab stops to be in columns
1,9,17,25,33,41, etc. (i.e. separated by 8 columns).
These positions are the assumed tab stops for an ASCII
file in MS-DOS. With the /st switch in force, CLEANSE
will preserve tab characters found in the text.

The /sf Switch:

CLEANSE will normally remove all control codes in a
text file except carriage returns and linefeeds. Form
feeds are converted into a pair of carriage- return-
linefeeds. However, with the /sf switch on the command
line, the form feed character (0C in hex) will be saved
in the converted ASCII file. This may be desirable if
the converted file is to be sent to a printer and you
wish to keep the same page breaks.

The /te Switch:

Some text files are created containing characters from
the extended ASCII set defined by IBM. These include
accented characters and graphic symbols. They can be
seen on an IBM PC or compatible by holding the ALT key
and typing three digits on the keypad in the range from
128 to 255. By default, CLEANSE converts these
characters into normal ASCII characters and control
codes, but without regard to their appearance. Using
the /te switch will invoke a translation of the
extended characters into a close equivalent in the
normal ASCII set. For example, an accented e is
translated into an ordinary e.

This switch will be useful for converting files
containing extended characters into text files which
can be sent to printers which do not support the
IBM extended set. In addition, the converted text can
be manipulated within editors which do not support the
IBM extended character set. The converted file may also
be sent through computer networks which preserve only
seven bits in each byte sent, such as those with even
parity settings. Since extended characters use the
eight bit, they will not pass through such networks
without prior coding.

The /TE switch should not be used for text file
produced by word processors which use the eight bit of
each byte for internal format coding (e.g. WordStar).
In this case, the coded bytes have no relation to the
IBM extended character set.

Example commands:

CLEANSE

will show a short help screen to remind the user of the command syntax.

CLEANSE FOO.TXT FOO.DOC

converts the text file called FOO.TXT into the ASCII file called FOO.DOC.

CLEANSE FOO.TXT \DOCS\FOO.DOC /RS/RI

converts the text file FOO.TXT into FOO.DOC in the subdirectory DOCS.
Extra spaces in the text and line indentation are removed.

CLEANSE EXTRA.TXT ORDIN.DOC /D72 /TE

converts the text file EXTRA.TXT to ORDIN.DOC, dividing lines at word
boundaries after column 72, and translating the IBM extended characters
into similar characters in the ASCII set.

Error Messages:

Two different files must be specified.

CLEANSE found that file1=file2. The second file and path name must be
distinct from the first.

Invalid switch given.

A command line switch begins with the slash character (/) followed by a two
character switch indicator, or a switch character and a number. Several
switches are allowed, and they may appear anywhere on the command line after
the CLEANSE command itself. Space separators are not required.

File not found.

The first file name or path is invalid.

No conversion done.

A request to overwrite an already existing second file was denied by the
user.

Not enough memory.

In order to operate quickly, CLEANSE creates two buffers for the input and
output files. The input buffer is the smaller of 64K or the input file
size. The output buffer is always 64K. As much as 132K may be required.

Error opening file.

The input file cannot be read, or the output file already exists and cannot
be overwritten.

Error creating file.

The output file may already exist as a hidden file.

Error reading file.

CLEANSE could not read a portion of the input file.

Error writing converted file.

CLEANSE could not write a portion of the output file. There may be
insufficient disk space.

File to receive conversion already exists.

If the output file already exists, you will be asked if you wish to
overwrite it. The answer must be Yes or No (y,n,Y,N accepted).

Technical Notes:

The translation of the IBM extended character set is given by the
following table. The numbers indicate the hexadecimal value of
the extended character. The translated symbol for that character
is shown to the right.

80 81 82 83 84 85-86 87 88 89 8A 8B 8C 8D 8E 8F CueaaaaceeeiiiAA
90 91 92 93 94 95-96 97 98 99 9A 9B 9C 9D 9E 9F E&AooouuyOUcL=Rf
A0 A1 A2 A3 A4 A5-A6 A7 A8 A9 AA AB AC AD AE AF aiounNao?--\\!<>
B0 B1 B2 B3 B4 B5-B6 B7 B8 B9 BA BB BC BD BE BF ###|||||||||````
C0 C1 C2 C3 C4 C5-C6 C7 C8 C9 CA CB CC CD CE CF ```|-|||````|-+`
D0 D1 D2 D3 D4 D5-D6 D7 D8 D9 DA DB DC DD DE DF ```````||``#####
E0 E1 E2 E3 E4 E5-E6 E7 E8 E9 EA EB EC ED EE EF abGpSsmgFtOdifeI
F0 F1 F2 F3 F4 F5-F6 F7 F8 F9 FA FB FC FD FE FF =+><||/=o..\n```

This table may be found in CLEANSE at location 104H after loading the
program with DEBUG.

History:

July, 1990: New to Version 1.2:

Switches /c#, /d#, and /te.

August, 1988: New to Version 1.1:

Added automatic removal of extra spaces at the end of
lines.

March, 1987: Version 1.0.


 December 12, 2017  Add comments

 Leave a Reply

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

(required)

(required)