Contents of the MAPSTAT.DOC file
MULTIVARIATE ANALYSIS PACKAGE 2.0
Copyright 1985, 86, 87, 88
Douglas L. Anderton
Department of Sociology
University of Chicago
1126 E. 59th Street
Chicago, IL 60637
These programs are released for distribution so long as 1) any
charges involved do not exceed costs of media and mailing, and 2)
no portion of programs is used for commercial resale without
written permission of the author.
MAPSTAT is a very serious multivariate statistical analysis
package capable of meeting 90% or more of most users' analytical
needs. The routines have, at this point, been well tested and
provide the most frequently used procedures of the relatively
expensive statistical packages without cost. Unlike any
commercial software package, Turbo Pascal (@Borland Int'l) source
code is included for modifications and elaborations at your own
If data are properly arranged (discussed below) MAPSTAT can
theoretically analyze an unlimited number of variables and cases.
It has been tested on data files containing over 200 variables
and 10,000 cases. It is highly recommended that you read the
entire documentation file before using or modifying MAPSTAT. It
is equally important that you have a knowledge of any statistical
procedure before you attempt to use it and interpret the results.
Fourteen subprograms are included in this seventh release of
MAPSTAT. All of the statistical programs are evoked from the
same common menu which is displayed when the command:
is evoked from the DOS prompt. The currently available sub-
programs in MAPSTAT include:
1) DESCRPT - descriptive statistics and frequency histograms
2) CORREL - correlation and covariance matrices
3) REGRESS - multiple linear regression
4) CROSSTAB - n-way crosstabulation and association tests
5) TRANSFRM - data transformations
6) HYPOTHS - simple hypotheses test on means and variances
7) PARTIAL - partial correlation coefficients
8) FACTOR - principle axis factoring with rotations
9) CLUSTER - kmeans clustering program
10) PLOT - simple 2 dimensional plots
12) MANOVA - multiple dependent variable analysis of variance
13) FIXFREE - utility for fixed to free format file conversion
14) SORT - utility to sort mapstat files with DOS SORT
Features and limitations of these programs are discussed below.
First, MAP is written as a sequential case processor to avoid
memory resident storage and achieve the greatest speed possible.
This has several consequences: 1) the package contains powerful
statistical analysis programs without horrendous memory
requirements; 2) however, the cost arises in that for redundant
functions such as histograms, regression residuals, etc., the
package currently requires multiple passes at the data. Even for
large data sets the programs are sufficiently fast to make such
Major benefits of this strategy include the (A) ability to run
the package on a floppy disk system without a hard drive
(including many portable and laptop machines); (B) ability to run
the package with as little as 56K memory; (C) the analysis of a
virtually unlimited number of cases within these modest hardware
requirements; and (D) a blinding speed compared with many larger
and more cumbersome statistical packages. Unlike many the memory
mammoths, MAPSTAT is ideal for running under multitasking
systems (e.g. DESQVIEW, DOSAMATIC) in that it requires so little
memory that it leaves room for many other tasks to be running.
Even if you currently use other statistical packages for
analysis, I suspect you will find MAPSTAT more useful and
accessible for many of your routine computing tasks.
INPUT DATA REQUIREMENTS:
MAP expects to find your data in a free format with at least one
blank separating each variable and a new line at the end of each
line. All variables for each case must be on a single line, i.e.
new lines separate records. It will not accept alphanumeric
data. Programs assume all data transformation has been performed
(e.g. CROSSTAB expects a finite number of values, not necessarily
integer value). Thus, it is very desirable to master the more
difficult TRANSFRM transformation program before making full use
of MAPSTAT. These are the only data requirements.
As with all statistical packages on all computers, extreme data
values are less precise. For example, means on variable values
such as 0.000001, 0.000002, 0.000003 or 1E+10, 2E+10, 3E+10 are
less accurate than the equivalent computations on 1, 2, and 3.
Thus, after an initial DESCRIPT run accuracy will be improved in
subsequent analyses if such values are transformed with TRANSFRM
by a simple scale shift to more reasonable ranges (e.g. in the
above examples multiply by 1,000,000 or divide by 1E+09).
A simple data file of four variables and five cases might then
100 50 90 -9
110 65 91.1 10
104 72 92.63 -9
107 70 92 11
97 36 99 14
Note there is no requirement that variables be aligned, only that
they be separated by a space with one case (or record) per line.
Of course, we will want to be able to identify variables by an
alphanumeric name, e.g. INCOME, EDUCTION, RESIDNCE, AGE. We will
also want to be able to identify the codes for each variable
which we use to indicate that the value for that variable is
missing, e.g. due to non-response on a survey item or non-
applicable items for this case. Codebook files containing
variable names and missing values are thus provided for.
Alternatively, if the use prefers, this data may be entered
interactively from the keyboard. For further information on
specifying variable names and missing values, see the details
instructions on 'running the programs' below.
RUNNING THE PROGRAMS:
1. Specifying Data Input and Output Files -
After invoking the programs they will ask for the name of an
input data file (or a file created from a prior MAP run - for
example, the output of CORREL is used by REGRESS), and the name
of an output file. For printer output specify the filename as
LST: and for screen output specify CON:. (An exception is
TRANSFRM, which uses buffered output routines will accept LST:
and CON: but will send output to LIST.TMP and CONSOLE.TMP disk
files respectively.) To send output to a disk file or obtain
input from a disk file simply enter the name of the file. This
file must reside on the current drive in the current
For example, invoking MAPSTAT you should see the main menu:
***** MAPSTAT v1.6 MENU *****
(Copyright 1985,6,7 D. Anderton)
A. Data Transformation, Selection and Recoding
B. Descriptive Statistics and Histograms
C. Two Dimensional Plots or Scattergrams
D. Hypothesis Testing on Variables or Subgroups
E. Correlation and Covariance Matrices
F. Partial Correlation and Covariance Matrices
G. Multiple Linear Regression
H. Factor Analysis with Orthogonal Rotations
I. N-Way Crosstabs and Categorical Association
J. Clustering by KMeans Algorithm
K. Multiple Analysis of Variance
L. Convert Fixed Format to Free Format Mapstat File
M. Sort a Mapstat File with DOS SORT.EXE
X. Exit to OpSys
After selecting 'B' for descriptive statistics MAPSTAT will
*** DESCRPT: DESCRIPTIVE STATISTICS ***
Name of the data file?
You must now enter the name of a data file on the current disk
and subdirectory, e.g. 'MYDAT.DAT'. If MAPSTAT cannot find this
file an error message will be written and you will be ejected
from the program. Otherwise you will see:
Name of the output file?
You may now enter either (a) the name of a new file to be created
to contain output on the current disk and subdirectory, (b)
'CON:' (no quotations) to send output to the screen, or (c)
'LST:' (again without quotations) to send output to the current
If you do not make a drastic mistake (e.g. you have room for the
new disk file or the printer is turned on) you will then see the
Name of the codebook file (or NONE)?
for appropriate MAPSTAT programs.
2. Specifying Codebook Variable Description Files -
If the input to the program is raw data (i.e. it is not one of
the procedures which input a prior CORREL matrix), then the
program will ask for a codebook file. The codebook file contains
three items of input for each variable in the data file (1) the
column number, (2) a variable name of eight characters, and (3) a
missing value code for missing values. Again, I repeat, one
line must be provided for each variable in the data file (whether
it is used in this particular analysis or not). All three items
must be provided for each variable on a new line and separated by
blanks. For the sample given above in the description of data
file we might have:
1 INCOME -1E37
2 EDUCTION -1E37
3 RESIDNCE -1E37
4 AGE -9
Note that eight spaces must be allowed for variable names, leave
blanks if necessary to fill out the string. Note also that a
missing value code must be given for every variable. The example
above used MAPSTAT's default value of -1E37 for missing data in
the first three variables and a value -9 for missing in the
fourth variable. In the sample data set -9 was actually used to
indicate missing values while the first three variables had no
missing values. When no missing values exist the value -1E37 is
simply identical to the default missing value used by MAPSTAT
when it generates new variables in TRANSFRM, etc. This or
another equally implausible value may be given in the codebook.
For normal usage you should construct such a codebook file in the
same drive and directory as the data file and enter the name in
response to the prompt for a codebook file name.
Alternatively, if the user specified 'none' in answer to the
codebook file query, variable names will default to variable
numbers and the default missing value will be assumed. This is
not a recommended option if you will return to your output
sometime in the future. It is, however, convenient for quick
oneshot runs on small data files.
3. Variable Column Identification -
After file names the programs will typically request the number
of variables in the data file and then the number of variables to
be used in the present run. For example, a DESCRPT run might be
run on a file containing lines for 500 cases each with 12
variables, only 4 of which we desire to analyze in the present
run. The total number of variables would then be entered as 12
and the number for the present run as 4 in response to the
How many variables in data file?
Number of variables to use in DESCRPT?
For each variable to be used the program will request information
on the column number of the variable (e.g. 1 for the first
variable, 2 for the second, etc.). These are column numbers in
the raw file not among the subset to be used. In the above
example, say the first, third, sixth, and eleventh of the 12
variables were to be used, the user would enter 1
3 6 12 as responses to the prompts:
Column number for variable 1?
Column number for variable 2?
Column number for variable 3?
Column number for variable 4?
4. Specification of Groups, Weights and Special variables -
Occasionally, the programs will ask you to identify one of the
variables for use in weighting data, grouping data, as a
dependent variable, etc. Again, reference is by original column
number of the input data set. For example, if the descriptives
in the example above were to be weighted by population which is
contained as the sixth variable, you would identify the weight as
column 6, it's position in the raw data file. All of the
variables used as weights, groups, etc., must have been included
in the original number of variables to use and selection of the
columns for the analysis. That is, it would not be possible to
specify, for example, column 4 as a weight since it has not been
specified in the variable list above. DESCRPT, for example will
provide the prompts:
Of these Column numbers which is weight (0=none)?
Of these Column numbers which is grouping (0=none)?
To which we may respond with either then column number of a prior
included variable or '0' if no weighting or grouping
(respectively) is to be done.
5. Hints on Further Documentation -
All other information necessary is prompted for with what I hope
are explicit prompts. If you have problems as to input queries,
or the interpretation of output, refer to a statistics book.
Some of the multivariate routines are recognizably influenced by
those in Fortran by Cooley and Lohnes in their Multivariate Data
Analysis book. The Kmeans clustering routine is found in almost
any book on cluster analysis. Some routines lifted from
numerical methods books, etc., have references in the source
code. The transformation options are relatively well elaborated
if you initially specify to input transformations for the CON:
file. Once you become familiar with the program you can input
transformations from files.
Because of the more difficult nature of TRANSFRM, a special
section on its usage is presented below.
6. Hints of Power Usage -
There are a number of features which the design philosophy of
MAPSTAT preclude. However, most of these features are readily
derivable through coupling TRANSFRM with the other subprograms.
For example, many regression packages output residuals from the
regression and plots of the standardized residuals, etc. MAPSTAT
does not force such a second pass through the data since it is
designed for large data sets without retention of the data in
memory. If the user desires such an analysis the residuals could
be readily computed using TRANSFRM and then plotted with PLOT.
Similarly, FACTOR produces score coefficients which could be used
to generate factor scores for further analysis, etc. Dummy
variables can be coded through use of the recoding facilities in
TRANSFRM and used to compute complicated general linear model
analyses of variance (e.g. GLM/ANOVA's) through REGRESS.
The list goes on, and on, and on. The more you know about
statistics and what you are doing the more you will find these
programs of use. At the same time, if you are a basic user you
will probably not require more than the basic output provided by
YOUR FIRST ENCOUNTER:
A recommended first experimentation is to begin with simple
descriptive statistics using program DESCRPT followed by
bivariate correlations using CORREL. Soon after this you should
attempt to learn the most difficult, and perhaps most useful,
program TRANSFRM for data transformations and sample selection.
Once you have mastered the TRANSFRM program all remaining
programs should come easily. As noted below, many of the
multivariate programs take a correlation matrix generated by
CORREL as input for further analysis. This allows one
correlation matrix to be generated for a large dataset and many
analyses to be computed without recomputing the correlations.
The CROSSTAB program for analysis of frequency tables is a
particularly useful program which will handle up to seven-way
tables and automatically generate all applicable statistics of
The addition of codebooks and transformation files makes these
routines roughly competitive with other micro statistics
packages. Given you have received them free of cost and,
"omigosh," with the source code, they are extremely flexible and
useful tools for data analysis.
Both DESCRPT and CORREL now allow weighted data to be entered.
While the Spicer algorithm provides good accuracy on computations
in both these programs it is not as robust against weighted
data. The results are sufficient for most purposes but exercise
caution with heavily weighted data (you should keep your weights
in a reasonable scale range - e.g. if you are weighting by
population make sure you are weighting with something like 1.2 in
millions rather than using 1,200,000 as a weight - then you will
have little cause for concern.
While each of the programs handles virtually unlimited numbers of
cases, you should be cautious of any statistical computations
which require a statistical package to manipulate very large
sums. In the programs using the Spicer algorithm this problem is
minimized. However if any statistical package results in either
gigantic or minuscule numbers within the output take the time to
transform the scale of your variables to avoir such strains.
In addition, each MAPSTAT subprogram has some limits on the
number of variables which may be included in computations (not
the number in the data, only those included for analysis in a
particular run). The current settings (which may be modified)
A. Data Transformation, Selection and Recoding - 100 variables incl created
B. Descriptive Statistics and Histograms - 100 variables
C. Two Dimensional Plots or Scattergrams - 30 variables
D. Hypothesis Testing on Variables or Subgroups- 2 variables
E. Correlation and Covariance Matrices - 50 variables
F. Partial Correlation and Covariance Matrices - 30 variables
G. Multiple Linear Regression - 30 variables
H. Factor Analysis with Orthogonal Rotations - 30 variables
I. N-Way Crosstabs and Categorical Association - 8 variables with up
to 25 codes for variables resulting in not more than
3500 cells in the table
J. Clustering by KMeans Algorithm - 10 variables with
max-min number of clusters to consider less than 25
(e.g. consider numbers of clusters between 95 and 80
K. Multiple Analysis of Variance - 30 variables
L. Convert Fixed Format to Free Format Mapstat File
- 255 variables
M. Sort a Mapstat File with DOS SORT.EXE - DOS SORT.EXE Required
These settings are easily altered at the beginning of any program
if you wish to recompile them. They have been limited to keep
the programs to they will run under 56K and so that users do not
make unreasonable demands on their own capacities and data.
It is easier to divide data into meaningful sets of variables and
work with them than to digest a correlation matrix of 500
variables. However, you are free to use and abuse the source
code as you please so long as you abide by the copyright above.
HARDWARE REQUIREMENTS AND RECOMPILING THE PROGRAMS:
MAP is written in version 2 (or 3) of Turbo Pascal (@Borland
Intl). It has been written to compile with less than 56k. If
you modify the programs and wish to recompile them you should be
familiar with the Borland compiler. First compile all *.PAS
files other than the main menu MAPSTAT.PAS using the 'cHain'
file option in the 'Options' menu of Turbo. For each .CHN file
which results make a note of the resulting code and data size.
Finally, compile MAPSTAT.PAS using the 'Com' option in the
'Options' menu of Turbo. Set the 'Code' and 'Data' segment sizes
to the largest recorded for any of the .CHN files. Failure to
follow these instructions will result in periodic program
Rename all *.CHN files to the names given in the file
MAPSTAT.PAS. REMEMBER in MSDOS you must compile all of the .CHN
files first and keep track of the largest code and data segment
sizes, I think as of now Factor has the greatest code size and
Correl the greatest data size. These must be set with O and D
commands before compiling the main menu MAPSTAT.PAS.
Only several statements must be altered to run the programs on
CPM machines. Change HALT calls to BDOS(0) and try to compile.
As I recall only two or three other lines need to be changed out
of all the code herein. The initial versions of MAPSTAT were
written for a KayproII '83 and many such machines are currently
running MAPSTAT in countries both within and outside the States.
PLOT contains printer control codes for the EPSON MX80 in
procedure Openfiles, modify these codes to suit your printer and
recompile using the Turbo (@Borland Int'l) compiler if your
printer is not compatible with, or capable of emulating, the
Epson standard codes. If your printer is not compatible and you
do not have sufficient knowledge to recompile the programs you
may continue to use MAPSTAT and simply avoid the PLOT subprogram.
Users are encouraged to REPORT BUGS and make REQUESTS for future
versions. Do not release your own versions or modifications
using the copyrighted MAP or MAPSTAT logos - and abide by the
above copyright notice. No liabilities or guarantee of technical
support may be assumed given the cost free nature of the
programs. Telephone requests for support will not be responded
to, all questions and/or requests for assistance should be made
through the mail and addressed to the author. Responding to such
queries is an activity which my schedule demands be placed at a
very low priority. I will place priority on responding to clear
inquiries including printed output and self-addressed stamped
envelopes for reply.
If you choose to register your copy of MAPSTAT (no fee is
required), send one self addressed floppy disk mailer with
postage affixed and include a DSDD DOS FORMATTED disk in the
mailer. Include a note to tell me which version of the program
you have and where you obtained it. When a substantial new
release of MAPSTAT is available I will forward you a copy so long
as time and the number of users remains manageable. There are
currently 177 registered users of MAPSTAT in 21 countries around
SPECIAL SUPPLEMENT ON PROGRAM TRANSFRM:
In part because it is a very powerful utility, the data
transformation subprogram TRANSFRM is more complicated to use than
other routines in MAPSTAT. This is similar to other statistical
packages where data transformation languages are the most difficult
for the novice.
The transformation language in MAPSTAT is a RPN (reverse polish
notation) language similar to that in many scientific calculators
such as those make popular by Hewlett-Packard. If you are
experienced with RPN logic you will find the program easier to
master; if not, a small amount of perseverance will pay off.
TRANSFRM will prompt for a transformation file containing
statements to recode, compute, or otherwise transform data:
Name of the transformation file (or con:)?
The first few times you run TRANSFRM enter con: for console input
rather than attempting to create a transformation file. This
will then display a list of available transformations to the
*** TRANSFRM: DATA TRANSFORMATION ***
Valid Arithmetic Operators:
+ - +
Turbo Pascal Functions Supported:
ABS ARCTANN COS EXP FRAC INT
LN SIN SQR SQRT ROUND TRUNC
Nonstandard MAP functions supported:
CASEN IF IFS LAG MOD NORMAL
Leading minus allowed (not plus) number must be less than or
equal to 11 digits, e.g. .001 12 -.0000005 etc.
Note: no check of statements is provided until runtime. [n]
refers to the nth variable read, not the nth column.
Comments may follow transformations on the same line
except END statement. Functions must be UPPERCASE.
This menus gives the names of all transformation functions known
to MAPSTAT. For example, LN is the standard Turbo Pascal natural
log function, NORMAL is a non-standard function to return random
numbers with a normal (0,1) distribution, REC is a recode
function, etc. These functions are described in greater detail
Upon pressing a key to continue, you will get the second menu of
the TRANSFRM program which explains statement syntax with some
*** TRANSFRM: DATA TRANSFORMATION ***);
Data transformation statements are entered in RPN (reverse polish
notation) with blanks separating each operator, constant, or
variable. Statements are terminated by '=' to end the statement
and the variable number to receive the value. Variables are
referred to by column number in brackets '[n]'. New variables
created by transformations are added to the data file. Use
successive numbers for new variables (if you read four variables
the first you create should be '' etc.) 'END' in the first
three columns will end input of transformations.
To put the square toot of 3.2 times the first variable into the
->3.2  * SQRT = 
To create a new sixth variable as the natural logarithm of the
second divided by the fifth -
->  LN / = 
To recode second variable if between 10 and 50 to value 3 -
-> 10 50 3 REC = 
A summary of available operators is displayed during entry.
If you are not familiar with RPN note the order of the examples
in these menu samples. In the first example the number 3.2 is
placed on a 'stack' of variables. Then, the value of the first
variable read in is placed on top of the value 3.2 on the stack.
When the operator '*' (multiply) is encountered it gets the
needed data from the stack. That is, it gets a value of variable
 then multiplies by the next element left on the stack, 3.2.
Finally, it places the result back on the stack in place of the
two elements it removed. When the next operator is encountered
SQRT (square root) it gets the needed data (in this case one
value) from the number placed on the stack last (by the result of
the multiply), takes the square root of the number and places it
back on the stack in place of that removed. Finally, the '='
operator says remove the last value placed on the stack and
assign it as the new value for variable .
Again, if you are not familiar with RPN, work your way through
the other two examples to see how it works and try experimenting
with a few simple transformations on a small test dataset before
relying on TRANSFRM.
A brief summary of how operators work with the stack will aid you
in writing TRANSFRM statements:
Valid Arithmetic Operators:
+ Adds the last two values on the stack
- If not attached to the front of a negative
number (with no spaces) subtracts from the
last value on the stack the preceding value
on the stack
* Multiplies last two values on the stack
/ Divides last value on the stack by the
immediately preceding value on the stack
= Assigns current value of the stack to the
Turbo Pascal Functions Supported:
ABS Replaces last number placed on the stack with
its absolute value
ARCTAN Replaces last number placed on the stack with
its arctangent function
COS Replaces last number placed on the stack with
its cosine function
EXP Replaces last number placed on the stack with
its natural exponent (i.e. e raised to that
FRAC Replaces last number placed on the stack with
the fractional part of the number only (i.e.
'4.2 FRAC = ' will place .2 in variable 2
INT Replaces last number placed on the stack with
the greatest integer number less than or
equal to it
LN Replaces last number placed on the stack with
its natural logarithm
SIN Replaces last number placed on the stack with
its sine function
SQR Replaces last number placed on the stack with
its squared value
SQRT Replaces last number placed on the stack with
its square root
ROUND Replaces last number placed on the stack with
the value rounded to the nearest integer
TRUNC Effectively identical to INT above
RANDOM Places a uniformly distributed random number
between 0 and 1 on the stack
Nonstandard MAP Functions Supported:
CASEN Places the observation or case number of the
current case onto the stack
IF Operations on the same line continue only if
the top value on the stack is greater than or
equal to zero, i.e. ' IF = ' will
assign the value of variable 2 to variable 3
only if it is greater than or equal to zero,
'5  - IF 1 = ' will assign the value 1
to variable 3 only if variable 2 is greater
than or equal to the value 5.
IFS The case will be included in the output data
file only if the last value placed on the
stack is greater than zero, i.e. ' IFS'
will select the subsample of cases with
variable 2 is greater than zero.
LAG Replaces the current value of the stack with
the similar value from the previous
observation, i.e. ' LAG = ' will set
variable 3 equal to the value of variable 2
lagged by one case
MOD Places the modulus of the last number placed
on the stack divided by the previous number
on the stack back onto the stack, i.e. '10
123 MOD = ' places the remainder 3 of
dividing 123 by 10 into variable 3
NORMAL Places a random number following a standard
normal distribution (with mean 0 and standard
deviation of 1) on the stack
POW Raises the last value placed on the stack to
the power of the immediately prior value
placed on the stack, i.e. '5  POW = '
will set the third variable to the fifth
power of the second
REC If the value placed 4 deep on the stack is
less than or equal to the value 2 deep and
also is greater than or equal to the value 3
deep then the last value on the stack is
returned to the stack, otherwise the value 4
deep is returned, i.e. ' 0 10 1 REC = '
will recode variable five to the value 1 if
it is greater than or equal to 0 and less
than or equal to 10, otherwise it is left
END Signals the end of transformation statements
If you enter your transformations from the console, each line is
prompted for by a '->'. Enter the transformation statement
followed by a carriage return. To end transformations enter the
function 'END' as the first and last item on the line followed by
a carriage return i.e. '->END'. When you become proficient with
transformation statements, you may enter these statements in a
file (followed by comments) and simply give the name of this file
to TRANSFRM at the prompt for a transformations file discussed
above. A simple transformations file might look something like:
NORMAL  + =  set var 3 equal to var 1 plus random error
 0 10 1 REC =  recode var 2 to 1 if 0<=var 2<=10
 10 25 2 REC =  recode var 2 to 2 if 10<=var 2<=25
 25 50 3 REC =  recode var 2 to 3 if 25<=var 2<=50
50  - IF 4 =  recode var 2 to 4 if var 2 >=50
 .999 - IF 0 =  recode var 2 to 0 if var 2 < 1 (i.e.<=.999)