Category : Word Processors
Archive   : USEDI.ZIP
Filename : REGEXP.1

 
Output of file : REGEXP.1 contained in archive : USEDI.ZIP



File: REGEXP.1
A part of U-SEDIT1.ZIP
Version 1 - January 19, 1990

All text is Copyright 1990 Mike Arst, Box 5, 1407 E. Madison St.,
Seattle, WA 98122. FidoNet address: send netmail c/o 1:343/8.0.

You may copy these files and transmit them in *unaltered* form to
computer bulletin boards. You may print out the text of the files
and/or photocopy the printouts for personal use. This text may
not be reproduced or published for any other purpose, by any
means now known or to be later developed, without the express
written permission of its author.

No one may charge a fee specifically for the distribution of this
file nor for the distribution of the others in the U-SEDIT1.ZIP
archive file (with "?" representing a version number), which
files include U-SED-IT.1, REGEXP.1, REFORMAT.INF, SFILES-A.1,
SFILES-B.1, and SED.EXE.

Copyright notices, and all language related to usage of this
text, must be retained in the files.

All proprietary names herein, such as Microsoft, DOS, MS-DOS,
Unix, and so on, are the property of their various owners, blah
blah blah.

If you upload the U-SEDIT archive to a bulletin board, please
upload it with all files that were in it when you got it.


If you upload the U-SEDIT archive, please upload it with all
files that were in it when you got it.


ABOUT REGULAR EXPRESSIONS

Regular expressions are what make SED a powerful text processing
tool. So what are they? "Regular expression" means something
like: A method of symbolically, rather than literally, describing
a pattern of text, such that you do not have to write out the
text in its entirety.

How the Unix documentation defines "regular expression," I dunno. The
above is good enough for me.

Definition of other terms used in this file (yeah, I know - not
in alphabetical order - sue me, already):








File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 1



STRING LITERAL

A string of characters (it could include spaces) meant to be read
quite literally. What you see is what you get.


EXPRESSION

I don't know what *the* formal definition is for "expression," as
used in Unix (or other) documentation. For now, how about this:
"some text you're telling SED about." It might be a string
literal, and on the other hand it might be a full-blown regular
expression.


REGULAR EXPRESSION

See the first paragraph under "ABOUT REGULAR EXPRESSIONS," above.
Consider two strings of text:

reduces and: [Rr]educe[ds]

The first expression, "reduces," is a string literal that means,
simply, "reduces," spelled out entirely in lower-case letters.
The second, "[Rr]educe[ds]," is a regular repression which means:
either a capital *or* a lower-case "r," followed by the string
literal "educe," followed by either a lower-case "d" *or* a lower-
case "s." It describes any of the following:

Reduces Reduced reduces reduced


DELIMITER

In a SED substitution command like this:

s/hello/goodbye/g

The "/" characters set off (delimit): 1) the start of the
instruction (which is also the beginning of the first expression,
namely the word "hello"); 2) the end of the first expression -
and also the beginning of the second ("goodbye"); and finally the
end of the instruction itself.

When you type SED commands at the DOS prompt (as opposed to
putting them into a SED script file), you often have to surround
a SED editing instruction within double quote marks. In that case
the quote marks are also delimiters.

Some versions of SED allow you to change the delimiter character
for expressions from / to some other character entirely. But do
you *really* want to do that? I didn't think so.





File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 2





ESCAPING

No, not escaping from this documentation. Too late; you're already
here.

Certain characters in a SED instruction have special meanings;
they don't *automatically* mean themselves (as string literals).
What if you want to tell SED to interpret them as themselves,
however? You "escape" them. Like so:

A period often stands for "any character." That is its special
meaning. In certain SED instructions you would type this:

\.

to mean a period, literally. In computer-ese, you have "escaped"
the period, and therefore the \ is referred to as the "escape
character."

Here are characters that must be escaped when they're placed into
either a simple or regular expression:

. \ / " & [ ] * +

The following must be escaped if they appear on the "search for"
side of a substitution command:

. \ / " [ ] * +

In their escaped form, they would look like this:

\. \\ \/ \" \[ \] \* \+

The double quote mark must be escaped *only* if it appears as
itself within an expression which must be surrounded by double
quotes - when given on the DOS command line. It does not have to
be escaped in a script file.

*** IMPORTANT: Such characters do *not* have to be escaped when
they appear on the "replace with" side of a substitution command -
except as noted below.

The following must be escaped *no matter where they appear:*

\ and /

Example: you want to search for the character string:

C:\BOGUS.BAT





File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 3






and replace it with:

D:\LEGIT.BAT

Your instruction would read:

s/C:\\BOGUS\.BAT/D:\\LEGIT.BAT/g

The \ character has been escaped each time it appears, on *both*
sides of the substitution command. The period has been escaped
only once - on the "search-for" side.

If the text to be replaced is: EITHER/OR and you want to
replace it with: AND/OR then the command would read:

s/EITHER\/OR/AND\/OR/g

The & character does *not* have to be escaped when it appears on
the "search for" side of a substitution instruction, but does
have to be escaped when it appears on the "replace with" side.

The version of SED included with this documentation does not
appear to use a question mark as any kind of wildcard or for any
other special purpose. So far, I have not found that it has to be
delimited in any editing or other SED instructions.


CONSTRUCTIONS

"Construction" is my own term for certain kinds of SED
delimiters, including:

[ ] \{ \} \( \)

Their meanings and uses will be explained by and by.


CASE-SENSITIVITY

SED is case-sensitive. If you are searching for the word "HELLO"
but you type "hello" within your regular expression, SED will not
locate "HELLO." The version included with this archive does not
support any command-line switch of the "ignore case" variety (the
GNU version does have such a switch).









File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 4



DETAILS ABOUT CHARACTERS WITH SPECIAL MEANINGS

WILDCARDS: THE "." CHARACTER - "ANYTHING"

The period-character is recognized by SED in many instructions to
mean "any single character" (no matter what it is). This is
analogous to the way a question mark is used to mean any single
character in a DOS file name. But "." doesn't *always* mean "any
single character" - and this is true for a lot of the special
characters in regular expressions. In certain kinds of
instructions, they mean themselves, literally. That'll be dealt
with in the documentation when necessary.

Example: the expression:

ab.dE

describes any of the following:

abcdE abCdE abXdE ab5dE

The expression:

ab..dE

describes any of the following:

abccdE abX1dE ab3 dE

A space appears within the string "ab3 dE" above; the "."
character, in representing "any character," can also represent a
space, or TAB or any other CTRL character.

If you use an expression with the "." in its escaped form - like:

ab\.dE

then the expression will describe *only* the following string
literal:

ab.dE

Say you want to change the string "yes. However," to the string
"yes. BUT,". You try the following substitution command:

s/yes\. However,/yes\. BUT,/

The result of the substitution would be text which reads:

yes. BUT,

The instruction:

s/yes\. However,/yes. BUT,/



File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 5




would be good enough. "." doesn't have to be escaped on the
"replace-with" side of the instruction.


THE ^ CHARACTER - beginning of line

When the caret ( ^ ) character appears at the *very beginning* of a
regular expression it refers to the start of a line. If it
appears anywhere else in a regular expression, it means the caret
character itself, except in instances I'll discuss in the section
on "character classes." The instruction:

s/^h/H/

would mean "search for lines beginning with lower-case "h" and
replace the 'h' there with a capital 'H'." Note that the ^
character has NOT been used on the "replace-with" side of the
instruction. If it appeared there, it would tell SED to put the
caret *itself* in the replacement text.

That kind of instruction will not alter the line boundary itself
(carriage return/line feed pair - or line feed alone).

No OTHER "h" would be changed by the instruction just shown.
Reason: there will only be one "h" on a line which meets the
criterion: "Lower-case 'h' appearing at very beginning of line."
So the "g" operator has not been used in the substitution
command. It isn't needed; there is no possibility of *more* than
one occurrence, on a given line, of an "h" as the *very first*
character on the line. If you put the "g" there, no harm. It
might slow down processing a bit - but probably so little you'll
barely notice the delay.

(If this business about the "g" operator doesn't make much sense,
read the file U-SED-IT.1 before proceeding. Substitution
commands, and the characters which modify them, are discussed in
that file.)

You can give a substitution command to replace the beginning of
the line with something. It doesn't alter the line boundary
itself, but it will actually add text to the beginning of a line.
Example:

s/^/XXX/

would add the character string "XXX" to the beginning of every line.
The instruction:

/^I don't/ s/^/It's ridiculous, but /

would find lines beginning with "I don't" and do a substitution
like this - before:




File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 6





I don't know what's going on.

After:

It's ridiculous, but I don't know what's going on.

This makes for some interesting possibilities. If you use
ANSI.SYS as a screen driver and if you want to use SED in batch
files. You can use ANSI.SYS commands to place the results of
SED's processing at specific places on the screen. Example:
you've already seen in the file U-SED-IT.1 that the "="
character, used on the command line, prints a line number with
each line.

sed -n $= inputfile

would print *only* the line number of the last line.

In the following batch file fragment, I will use the character
string "" to represent the actual ESC character as used in
ANSI.SYS screen control commands. "[s" saves the current
cursor position. "[u" goes to the saved position, and
"[K" deletes from the current cursor position to the end of
the line:

echo The number of lines in the file is [s...
sed -n $= inputfile | sed "s/^/[u[K/"

Without the ANSI.SYS commands (i.e., if the "piped" SED command -
to the right of the vertical bar - were *not* present at all) the
result on the screen for a 3,507-line file would be:

The number of lines in the file is ...
3507

If the piped SED command *is* present, the batch file takes the
line number display and gives a SED instruction to replace the
start of the line with the ANSI.SYS commands for "restore cursor
position" and then "delete from cursor position to end of line."
The result would be:

The number of lines in the file is 3507

The hopelessly fancy version of such an instruction would add a
second editing instruction to the second SED command, which would
look like:

sed -e "s/^/[u[K/" -e s/$/./







File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 7





This would, as it were, also replace the *end* of the input line
with a period (it wouldn't actually alter the line boundary)(see
below). The screen display would look like:

The number of lines in the file is 3507.

A bit far-fetched? Maybe - but it can be done.


THE $ CHARACTER - end of line

When a dollar sign appears at the *very end* of certain regular
expressions - with *nothing* following it other than the "end of
regular expression" delimiter - it tells SED to find the end of a
line.

If the text to search is:

This is LINE ONE
This is not LINE ONE - sorry about that
This is ONE line in this text file

then the following SED instruction:

sed -n /ONE$/p

would tell SED to print *only* the first line - the line that ends
with the word "ONE". If you had given the following instruction:

sed -n -e /ONE/p

then SED would have printed all three lines; they all contain the
character string "ONE".

As with the caret character, when you have used the dollar sign
on the "search-for" side of a substitution command you should NOT
use it on the "replace-with" side unless you intend for a literal
dollar sign to be added to the replaced text.

s/ONE$/TWO$/g

would turn the following:

This is ONE version of line ONE

into:

This is TWO$ version of line TWO$







File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 8




On the other hand:

s/ONE$/TWO/

would have changed the line to read:

This is ONE version of line TWO

As with the caret-sign, specifying an end-of-line with "$" (in a
substitution instruction) does not actually alter the carriage
return/line feed pair at the end of the line.

Another example: suppose you want to kill all trailing spaces -
those which appear at the ends of lines. You don't want to remove
any other characters on the line - including spaces in other
locations:

s/ \{1,\}$//

This describes *only*: "one or more spaces, then end of line."
(More on the construction \{ \} in a little while.)

In the event a $ character must be the last one on the "search-
for" side of an expression, it should be escaped with the \
character.

WILDCARDS: THE "*" CHARACTER - "ZERO OR MORE OCCURRENCES"

If "*" appears on the "search-for" side of a SED substitution
instruction (and in some other kinds of instructions) it means:
"0 or more occurrences of the character *to the immediate left* of
the "*." It does *not* mean: "0 or more occurrences of the word or
entire regular expression immediately preceding the "*."

The instruction:

s/.x*/G/g

would say to SED: "Look for any character, followed by zero or
more occurrences of "x" in a row, and replace ALL of the "x"
characters with the single character "G" - so if your text reads:

Hello, this is a letter "x".
Here are two of them: xx.
Here are three, used nonsensically: hiyaxxxbye.

then the instruction shown just above should result in:

Hello, this is a letter "G".
Here are two of them: G.
Here are three, used nonsensically: hiyaGbye.





File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 9




Ok, that's what I am SUPPOSED to say about the use of the "*"
character. However, in playing around with three different
versions of SED, I have not been able to get "*" to work quite as
advertised, so to speak. Using it that way shown just above
results in something really weird. *All of the characters in the *
*file* are changed to capital Gs. Wonderful.

When I tried that instruction with GNU's sed, it put a capital
"G" at the *beginning* of each line, and nowhere else (didn't
replace any of the "x" characters).

The following instruction, however, *did* work:

s/xx*/G/g

You can see why this make sense: the regular expression: xx*
means: a lower-case "x," followed immediately by *zero or more*
lower-case "x"s. In other words, a single "x" satisfies that
criterion; so do two "x"s in a row; so do three - and so on.

But why the simpler form: x* doesn't work, I don't know.

That is why I like the version of SED included with these docs:
it supports "iteration" - more on that shortly. You can play
around with using "*" to mean "0 or more," but you might find the
results rather bizarre and disappointing. The same is true of the
+ character, discussed below.

One way in which I *have* been able to get the asterisk to work
reliably is in a situation like this: you want to search for any
line beginning with spaces and kill *all* of the spaces at the
beginning of a line:

s/^ *//

Something like this also works:

s/Hd*/HELLO/g

which would mean, search for a capital "H" followed by 0 or more
occurrences of lower-case "d" - and then replace the entire
"found string" with the word "HELLO." An absurd example, but it
works. This kind of thing also works:

sed -n "/H.*g/p" inputfilename

In other words: look for a capital "H" followed by zero or more
of *anything* - print only lines on which such a string is found.

None of the problems I've encountered with "*" are discussed in
any SED documentation I have seen to date, so you figger it out.





File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 10




I have to guess that there are restrictions on the use of the
asterisk and that various writers of SED documentation just
"forgot" to mention them (well, shucks - anyone can make a
mistake). It's also possible that several different versions of
SED are just plain buggy (or "eccentric," perhaps).

Escaping the asterisk: if you have an asterisk that you want to
describe as literal text in a substitution command, precede it
with the escape character:

s/\*\*Files found://g

would find the character string:

**Files found:

and delete it. This kind of instruction:

s/**Files found://g

seems to work sometimes, and then sometimes it doesn't. I haven't
figured out why it would or wouldn't. Well, just experiment with
it. If in doubt, put the escape character in front of the
asterisk.

If these occasional problems with "*" (and "+") are not bugs,
then perhaps the conclusion is: the characters can only be used
successfully when: 1) they modify not the first character in a
regular expression, rather the one in *at least the 2nd position;*
2) they modify a space, but *only* if the space follows a
beginning-of-line character ( ^ ) or some alphabetic character.
Past that, I dunno.


WILDCARDS: the "+" character - one or more occurrences

The function of "+" in a regular expression is similar to that of
the asterisk; but whereas "*" means "0 or more occurrences," the
plus-sign means "*one or more* occurrences.

I leave it to you to do most of the experimenting with this one.
Keep in mind that, as with the use of the asterisk, there are
times when "+" just plain doesn't work the way you think it will.

If you would like to remove any number of spaces at the beginning
of a line (no matter how many), this doesn't work:

s/^ +//g

It *does* work with the GNU version of SED. So, with the version of
SED included with these docs, back to "iteration" to be on the
safe side:




File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 11





s/^ \{1,\}//

As with the "*" character, escape the plus-sign with a \ when
specifying it as a string literal on the "search-for" side of an
instruction.


WILDCARDS: the "&" character - "whatever you just found"

SED supports the use of "&" to mean "whatever has been desribed
on the search-for side of a regular expression." If the
instruction reads:

s/yes/& and no/g

and the input line reads:

I said: yes

the result is:

I said yes and no

If the text reads:

I said hello

and the instruction reads:

s/hello/goodbye and &/g

the result is:

I said goodbye and hello

In other words, you don't have to retype the "search-for"
information at all if you use the & character this way.

*** NOTE: Most SED wildcard characters do NOT have to be
escaped on the "replace-with" side of a substitution command. The
"&" character is one of the few exceptions to this rule. Always

escape it if you want it treated as a string literal on the
right-hand side of the substitution command.

Wrong: s/milk & cookies/MILK & COOKIES/g

Right: s/milk & cookies/MILK \& COOKIES/g








File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 12




CHARACTER CLASSES

This is where you can begin to feel the power. Can you FEEL THE
POWER, BROTHERS AND SISTERS? Sorry ... got carried away, there.

Stated simply, an instruction involving a class of characters
tells SED to find any *single* character matching some non-literal
description. So, for instance, you can defeat SED's lack of case
sensitivity by using a "class" instruction. If you want to search
for any one of the following words:

reserves Reserves reserved Reserved

You don't need to specify each word in four different
instructions. Do this instead:

[Rr]eserve[sd]

You've seen this kind of example before. The construction: [Rr]
means a single character that is *either* a capital *or* a lower case
'R'. The construction: [sd] means a single character that is
either a lower case "s" or a lower case "d."

The construction:

a[Rr]\{2,\}

would describe any of the following:

aRRr aRrRrr arrRRr

and so on.

The construction:

[Aa][Rr]\{2,\}

would describe any of the following:

ARRr aRrRrr ArrRRr arR

That is: "A single character - either 'A' or 'a' - followed by ..."
[etc.]

You can combine this kind of thing with "grouping." This
construction:

[Aa]\([Rr]\{2,\}\)

would place any string of more than 2 "R"s, either caps or lower
case, into a numbered group that could be referred to on the





File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 13




"replace-with" side of a substitution command. Uh-oh ... I
haven't yet explained grouping. Sorry about that. I'll get to it.

More examples:

[AbcD94X]

would describe to SED: "any *single* character which is any one of
the following - but no *other* single character:

A b c D 9 4 [or] X

This construction:

[Aa][Bb][Oo][Uu][Tt]

would describe:

About ABout aBOut aboUt AboUT

to name only a few. Because a character class describes only a
*single* character, you can follow a class instruction with a
wildcard: "*" or "+" or an iteration command (to be discussed
shortly):

string to find: [ -x]*

would mean the literal text "string to find is " followed by
either zero or more spaces, zero or more hyphens, or zero or more
"x"s.

The number of possible permutations and combinations of grouping,
character classes, and iteration, is staggering. And it's what
makes SED so gosh-darned powerful. But wait - there's more. It
gets more powerful yet. (Am I gushing? Sorry.)


RANGES OF CHARACTERS

You can indicate a *range* of characters to be searched for within
a class construction. This:

[A-Z]


refers to any single character from "A" through and including
"Z." A range like:

[G-M]

would limit the range to any single character from "G" through/
including "M." Given that instruction, SED, being the case-





File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 14




sensitive monster that it is, would not find any character in the
range of from "g" to "m" - just capital letters.

[a-z] and [g-m]

would be used to find the same two ranges of characters, only
lower case. To take care of the entire alphabet - upper *or* lower
case:

[A-Za-z]

Right. You can put the range instructions right next to one
another. Given such an instruction, SED will *not* try to find a
hyphen; the hyphen will be assumed to be part of a range, not a
character to be searched for in its own right.

What about numerals? This:

[0-9]

Finds any single integer within the range of 0 through/including
9. (Note - doing it like this: [9-0] - *won't* work.

[A-Za-z0-9.,!?]

would find any single character in the range of from A to Z, a to
z, 0 to 9, or any single one of these characters: period, comma,
exclamation point, question mark.


SPECIFYING EXCEPTIONS

If a character class instruction begins with a caret mark, the
caret has a special meaning: it tells SED to find any single
character *except for* the ones shown between the [ and ]
characters:

[^ABC]

means any single character *except* capital "A," capital "B," or
capital "C." Given an instruction like:

[^ABC]xxx

SED would find:

Exxx 3xxx ixxx

but would *not* find:

Axxx Bxxx Cxxx





File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 15





This:

[^ABCe ]

means any single character *except*

A B C e or a space.

Right: a space is not ignored when it is inside a character-class
instruction. The construction:

[^A-Z]

means any single character *except* a capital letter in the range
of from A through/including Z - thus eliminating all capital
letters, in other words. This:

[^E-H.3\ ]

means any single character which is *not* a capital E, F, G, or H,
or a period, or a 3, or a space, or a backslash.

Remember: In order for this to work properly, the ^ character
must be the *first* one to the right of the [ that begins the
character-class construction. If it is anywhere else within the
brackets, it represents *itself* as a string literal.


CHARACTERS WHICH DO, AND DON'T, HAVE TO BE "ESCAPED" INSIDE A
CHARACTER CLASS

Virtually all of the special characters, placed inside the square
brackets, don't have to be escaped in order to represent
themselves, literally.

"]" must be done as: \] unless it is in the *first* position.

"[" can be in any position within the square brackets without
needing to be escaped.


WILDCARDS: THE \( \) CONSTRUCTION
(NUMBERED GROUPS)

The "&" shortcut allows you to keep track of an entire
expression. But SED can also keep track of up to nine groups -
separate buffers in memory - containing *portions* of a searched-
for string. Each buffer (or "group," as I'm calling it here) is
enclosed within a two-character contstruction: \( to begin the
group, and \) to end the group.





File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 16





The first group construction in the expression can then be
referred to via the following shorthand:

\1

The second group, becomes: \2. The third becomes: \3 - and so on.

If a line of text in the input file reads as follows:

one two three four

and the SED substitution command looks like this:

s/\(one\) \(two\) \(three\) \(four\)/\4\2\1\3/

The result would be:

fourtwoonethree

For crying out loud, already. You didn't remember to include the
spaces on the "replace-with" side of the substitution command;
and none of the groups contains a space. Use this instruction
instead:

s/\(one\) \(two\) \(three\) \(four\)/\4 \2 \1 \3/

for the following result:

four two one three

Alternately:

s/\(one \)\(two \)\(three \)\(four\)/\4 \2\1\3/

would have the same result as the instruction shown right above
it. Note that the "\4" is followed by a space. If you assume that
the word "four" at the end of the input line is *not* followed by a
space, then you have to *create* one on the "replace-with" side of
the substitution.

I have not investigated these memory buffers to the hilt, but
here are a few observations about them:

They *can* be used to delimit virtually every kind of expression,
but if your regular expression includes, say, the ^ or $
characters, meaning "beginning of line" and "end of line,"
respectively, *don't* put the ^ or $ characters *within* the group
construction.

Example: you have written some documentation containing a number
of different file names. But now you need to update the




File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 17





documentation. Files with names like HELLO1.DOC and HELLO5.DOC
need to be renamed to HELLO.001 and HELLO.005. You don't know how
many files there are having various kinds of names, but you'd
like to update the names without having to do a lot of manual
searching or replacing. Try something like:

s/HELLO\([0-9]\)\.DOC/HELLO.00\1/g

Blow-by-blow description:

s/HELLO The string to find begins with "HELLO" and
is followed by ...

\([0-9]\) Any *single* integer within the range of from 0 to
9 - and this is placed into a buffer (a numbered
group). *Whatever* is found in that position on the
line will be placed into the numbered group *if* it
matches the criterion: "Any single integer within
the range of from one to 9." ... followed by ...

\.DOC A literal period, followed by the string "DOC"

/ End of search. Now, replace that with ...

HELLO Same primary part of the file name, followed
by...

.00 Period, two zeroes, and ...

\1 Whatever number appeared in the "found" group

If the file name is HELLO5.DOC, then the number "5" is placed
into the first numbered group (in this example, the *only* numbered
group) and then referred to on the "replace-with" side of the
instruction as: \1.

The open or closed parentheses themselves don't need to be
"escaped" with a \ character when they appear on either side of
a substitution command.

Note: the abbreviations for the buffers - \1, \2, and so on -
cannot appear on the "search-for" side of a substitution command
- only on the "replace-with" side.

You can't nest group delimiters inside one another - no groups-
within-a-group.









File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 18




ITERATION - THE \{ \} CONSTRUCTION

The version of SED included with this documentation supports an
extremely useful function called iteration: the ability to count
occurrences of a string or regular expression. The biggest range
of iteration commands I've run across is supported by the Mix
Software version of SED. Like so:

\{x\} "x" occurrences only
\{x,\} "x" or more occurrences
\{x,y\} from "x" to "y" occurrences
\{,x\} up to, but no more than, "x" occurrences

The enclosed version of SED supports the first three forms shown.

Suppose the following is a line in the input file:

Here is a group of "a" characters: a aa aaa aaaa aaaaa

The expression:

a\{2\} (meaning "two only")

will tell SED to find all but the single character "a," as does:
a\{2,\} (two or more)

The expression:

a\{1,\} (meaning "one or more")

describes all of them.

The expression:

a\{3,\} ("three or more")

would correctly describe only the third, fourth, and fifth
strings.

With this version of SED, as with the "*" and "+" characters, an
iteration construction modifies only the single character TO ITS
IMMEDIATE LEFT. It does not modify a longer string of characters
than that. If the single character is part of a range of
characters (see below) then iteration will work with the range
(since the range still describes only a *single* character).

I am told that iteration consumes a fair amount of memory. If you
have a lot of iteration commands within a single SED instruction
on the command line, and perhaps even on several different lines
in a script file, it's conceivable that you could run low on, or
even out of, memory. It hasn't happened to me yet, but it's
something to look forward to, anyway.




File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 19




The { and } characters don't seem to need to be escaped when
they represent only themselves. It makes sense: in this instance,
as with the \( \) construction, if you were to escape the braces,
SED might think you were giving an incomplete iteration
instruction and would probably bomb out with an error message.

Practical example: the place where I work often gets text on disk
from customers; the files will eventually become part of some
book or magazine. It's typical for people who learned to type in
high school or at secretarial schools to add two spaces following
punctuation. It's not right at all for typesetting or desktop
publishing, and those extra spaces must be removed.

People also often do strange things like indenting lines using
spaces - guaranteed to make a mess when the text gets to any
program that uses proportional, not monospaced, fonts.

Finally, customers often use spaces, not TABs, to line up columns
in tabular data. It works fine on the screen and when printing
out the text to a dot-matrix printer using a monospaced font.
It's a complete disaster at our end. The challenge, then, is to
strip out leading spaces; remove multiple spaces following
punctuation; and preserve the look of tabular copy as much as
possible. While the following instructions, used in a script
file, don't deal with all possible spacing snafus, they do take
care of most of the problems in one pass through a file:

s/^ *//
s/ \{3,\}//g
s/ / /g

First:

s/^ *//

This instruction says: "find beginning of line, followed by zero
or more spaces, and kill *all* of the the spaces." If there are no
spaces at the beginning of the line, nothing will happen at the
beginning of the lines.

s/ \{3,\}//g

That one says: Search for three or more spaces in a row, and
replace them with a single TAB character. In the real script
file, where I've used "" above, there would be an actual TAB
character (ASCII decimal value: 9).

Keep in mind: given an instruction like this SED will consider
the entire *contiguous* group of spaces as the string to be
replaced. It will not replace them with multiple TAB characters
in a row. Thus:





File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 20





Col. 1 Col. 2 Col. 3 Col. 4

becomes:

Col. 1Col. 2 Col. 3 Col. 4

Next:

s/ / /g

By the time the TABs have been processed, there are no more
occurrences of three or more spaces in a row - only two in a row
(if indeed there are any strings of two or more left). The above
instruction finds two spaces in a row, replacing them with one
only. I could also have done it this way:

s/ \{2\}/ /g ("two only" - or even:)

s/ \{2,\}/ /g

but it wouldn't have been necessary.


REVIEW

Special characters:

. Meaning *any* character

& Used on the "replace-with" side of a substitution
command; means "whatever you found on the "search-
for" side.

* Means "zero or more" of whatever character or
character class is immediately to its left (including
spaces)

+ Same idea as "*" - but means "one or more
occurrences."

^ At very beginning of regular expression, means "at
beginning of line.

$ At very end of regular expression, means "at end
of line."

/ Delimiter for expressions; substitution commands

\ Used to "escape" other characters - itself escaped
via: \\





File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 21





Character classes and ranges:

[ABC] Indicates any single character; any one of the
ones enclosed within brackets.

[A-Z] Indicates any single character in the range of from
A to Z; similar constructions used for "range of
'a' to 'z' and "range of '0' to '9'."


[^ABC] With caret at start of character class, indicates
any single character *except for* the characters
enclosed within the brackets.

Other "constructions":

\( \) Establishes a numbered group; each group is
referred to in the order it was set up - as
\1 for the first, \2 for the second, and so on.
Maximum number of groups per instruction: 9.

\{ \} Iteration - modifies the single character or
character class to its immediate left.

\{a,b\} From "a" to "b" occurrences
\{a,\} "a" or more occurrences
\{a\} *Exactly* "a" occurrences

( E N D )



























File REGEXP.1 - about SED - Copyright 1990 Mike Arst Page: 22






  3 Responses to “Category : Word Processors
Archive   : USEDI.ZIP
Filename : REGEXP.1

  1. Very nice! Thank you for this wonderful archive. I wonder why I found it only now. Long live the BBS file archives!

  2. This is so awesome! 😀 I’d be cool if you could download an entire archive of this at once, though.

  3. But one thing that puzzles me is the “mtswslnkmcjklsdlsbdmMICROSOFT” string. There is an article about it here. It is definitely worth a read: http://www.os2museum.com/wp/mtswslnk/