Dec 262017
 
GNU Recode 3.2.4 translates documents between various character sets: ebcdic, ascii, ibm_pc, latin1, etc. Includes additional support for French and Quebecois usage. C source code, copyrighted freeware.

Full Description of File


GNU Recode 3.2.4 is a program that converts
documents between various character sets:
ascii, ebcdic, latin1, ibm_pc etc. Special
support for French and Quebecois usage. C
source code. Copyright FSF GNU freeware.


File RECOD324.ZIP from The Programmer’s Corner in
Category UNIX Files
GNU Recode 3.2.4 translates documents between various character sets: ebcdic, ascii, ibm_pc, latin1, etc. Includes additional support for French and Quebecois usage. C source code, copyrighted freeware.
File Name File Size Zip Size Zip Type
FILE_ID.DIZ 219 172 deflated
RECOD324.TAR 542720 128527 deflated
RECODE.DOC 39356 13749 deflated

Download File RECOD324.ZIP Here

Contents of the RECODE.DOC file


GNU Recode 3.2.4 Copyright 1993 FSF

Conversion of files between different charsets and usages

This `recode' program has the purpose of converting
files between various character sets and usages. When exact
transliterations are not possible, as it is often the case,
the program may get rid of the offending characters or fall
back on approximations.

Let us coin the term charset to represent, without dis-
tinction, a character set ``per se'' or a particular usage
of a character set. This program recognizes or produces a
little more than a dozen of such charsets. Since it can
convert each charset to almost any other one, more than one
hundred different conversions are possible.

This tool pays special attention to superimposition of
diacritics, particularily for French representation. This
orientation is mostly historical, it does not impair the
usefulness, generality or extensibility of the program. In
fact, this program evolved for several years, through sev-
eral programming languages and computer brands, because I
used a lot of different coding for French characters on dif-
ferent machines, each system having its own peculiarities.

You may find in this document:





1.1. How to use this program

The general format of the program call is:











2 recode reference manual



recode [OPTION]... [before]:[after] [file]...



Each file file will be read assuming it is coded with
charset before, it will be recoded over itself so to use the
charset after. If there is no such FILE, the program rather
acts as a filter and recode standard input to standard out-
put.

The available options are:

-C Given this option, all other parameters and op-
tions are ignored. The program prints briefly the
Copyright and copying conditions. See the file
`COPYING' in the distribution for full statement
of the Copyright and copying conditions.

-c With Easy French conventions, use the column : in-
stead of the double-quote " for marking diaeresis.
See: See section Easy French.

-f This option is recognized, but otherwise ignored.
Eventually, this option will be necessary for a
file to be replaced by its recoded contents, if it
is found that the recoding is not fully re-
versible. In this version, the replacement is un-
conditionnaly done.

-i When the recoding requires a combination of two or
more elementary recoding steps, this option forces
many passes over the data, using intermediate
files between passes. This is the default be-
haviour when files are recoded over themselves.
If this option is selected in filter mode, that
is, when the program reads standard input and
writes standard output, it might take longer for
programs further down the pipe chain to start re-
ceiving some recoded data.

-o When the recoding requires a combination of two or
more elementary recoding steps, this option forces
the creation of a chain of program instances ini-
tiated through the popen(3) library call, all op-
erating in parallel. In filter mode, at cost of
some overhead, recoded data will be available soon
after the program starts, even if many elementary
recoding steps are required.

If, at installation time, the popen(3) call is
said to be unavailable, selecting option -o is
equivalent to selecting option -i.










recode reference manual 3


-p When the recoding requires a combination of two or
more elementary recoding steps, this option forces
the program to fork itself into a few copies in-
terconnected with pipes, using the pipe(2) system
call. All copies of the program operate in paral-
lel. This method is similar to the method used
through option -o, but is slightly more efficient.
This is the default behaviour in filter mode. If
this option is used when files are recoded over
themselves, this should save some disk accesses
and some disk space, at cost of more system over-
head.

If, at installation time, the pipe(2) call is said
to be unavailable, selecting option -p is equiva-
lent to selecting option -o. If both pipe(2) and
popen(3) are unavailable, selecting option -p is
equivalent to selecting option -i.

-t The touch option is meaningful only when files are
recoded over themselves. Without it, the times-
tamps associated with files are preserved, to re-
flect the fact that changing the code of a file
does not really alter its informational contents.
When the user wants the recoded files to be times-
tamped at the recoding time; this option inhibits
the automatic protection of the timestamps.

-v Before proceeding, the program will print on
`stderr' the list and order of application of ele-
mentary conversions which are planned to achieve
the global conversion. Then, the program will
print on `stderr' one message per file recoded, so
to let the user informed of the progress of its
command.

One or both of the before or after keywords may be
omitted, but the colon which separates them cannot. An
omitted keyword implies the usual or default code in usage
on the system where this program is installed. Usually,
this default code is latin1 for UNIX systems or ibmpc for
MS-DOS machines, but it might be changed to any other sup-
ported code when this program is installed.

1.2. Character sets recognized of produced

The possible values for charset before or charset after
are provided as the keys in the following menu.















4 recode reference manual


1.2.1. ASCII 8-bits for Apple's Macintosh

The file has been obtained or is aimed to a Macintosh
micro-computer from Apple. This is an eight bit code. The
file is the data fork only.

1.2.2. ASCII 7-bits, BS to overstrike

The file is straight ASCII, seven bits only. According
to the definition of ASCII: diacritics are applied by a
sequence of three characters: the letter, one BS, the dia-
critic mark. We deviate slightly from this by exchanging
the diacritic mark and the letter so, on a screen device,
the diacritic will disappear and let the letter alone. At
recognition time, both methods are acceptable.

The French quotes are coded by the sequences: < BS " or
" BS < for the opening quote and > BS " or " BS > for the
closing quote. This artifical convention was inherited in
straight ascii from habits around bangbang entry, and is not
well known. But we decided to stick to it so that ascii
charset will not loose French quotes.

1.2.2.1. Commented ASCII


oct dec hex name description

000 0 0 nul null character
001 1 1 soh start of header
002 2 2 stx start of text
003 3 3 etx end of text
004 4 4 eot end of transmission
005 5 5 enq enquiry
006 6 6 ack acknowledge
007 7 7 bel bell
010 8 8 bs back space
011 9 9 ht horizontal tab
012 10 a nl new line
013 11 b vt vertical tab
014 12 c np new page
015 13 d cr carriage return
016 14 e so shift out
017 15 f si shift in
020 16 10 dle data link escape
021 17 11 dc1 device control 1
022 18 12 dc2 device control 2
023 19 13 dc3 device control 3
024 20 14 dc4 device control 4
025 21 15 nak negative acknowledge
026 22 16 syn synchronize
027 23 17 etb end of transmitted block
030 24 18 can cancel










recode reference manual 5


031 25 19 em end of medium
032 26 1a sub substitute
033 27 1b esc escape
034 28 1c fs file separator
035 29 1d gs group separator
036 30 1e rs record separator
037 31 1f us unit separator
040 32 20 sp space

177 127 7f del delete



1.2.2.2. Octal ASCII


000 nul 020 dle 040 sp 060 0 100 @ 120 P 140 ` 160 p
001 soh 021 dc1 041 ! 061 1 101 A 121 Q 141 a 161 q
002 stx 022 dc2 042 " 062 2 102 B 122 R 142 b 162 r
003 etx 023 dc3 043 # 063 3 103 C 123 S 143 c 163 s
004 eot 024 dc4 044 $ 064 4 104 D 124 T 144 d 164 t
005 enq 025 nak 045 % 065 5 105 E 125 U 145 e 165 u
006 ack 026 syn 046 & 066 6 106 F 126 V 146 f 166 v
007 bel 027 etb 047 ' 067 7 107 G 127 W 147 g 167 w
010 bs 030 can 050 ( 070 8 110 H 130 X 150 h 170 x
011 ht 031 em 051 ) 071 9 111 I 131 Y 151 i 171 y
012 nl 032 sub 052 * 072 : 112 J 132 Z 152 j 172 z
013 vt 033 esc 053 + 073 ; 113 K 133 [ 153 k 173 {
014 np 034 fs 054 , 074 < 114 L 134 \ 154 l 174 |
015 cr 035 gs 055 - 075 = 115 M 135 ] 155 m 175 }
016 so 036 rs 056 . 076 > 116 N 136 ^ 156 n 176 ~
017 si 037 us 057 / 077 ? 117 O 137 _ 157 o 177 del



1.2.2.3. Decimal ASCII


0 nul 16 dle 32 sp 48 0 64 @ 80 P 96 ` 112 p
1 soh 17 dc1 33 ! 49 1 65 A 81 Q 97 a 113 q
2 stx 18 dc2 34 " 50 2 66 B 82 R 98 b 114 r
3 etx 19 dc3 35 # 51 3 67 C 83 S 99 c 115 s
4 eot 20 dc4 36 $ 52 4 68 D 84 T 100 d 116 t
5 enq 21 nak 37 % 53 5 69 E 85 U 101 e 117 u
6 ack 22 syn 38 & 54 6 70 F 86 V 102 f 118 v
7 bel 23 etb 39 ' 55 7 71 G 87 W 103 g 119 w
8 bs 24 can 40 ( 56 8 72 H 88 X 104 h 120 x
9 ht 25 em 41 ) 57 9 73 I 89 Y 105 i 121 y
10 nl 26 sub 42 * 58 : 74 J 90 Z 106 j 122 z
11 vt 27 esc 43 + 59 ; 75 K 91 [ 107 k 123 {
12 np 28 fs 44 , 60 < 76 L 92 \ 108 l 124 |
13 cr 29 gs 45 - 61 = 77 M 93 ] 109 m 125 }
14 so 30 rs 46 . 62 > 78 N 94 ^ 110 n 126 ~










6 recode reference manual


15 si 31 us 47 / 63 ? 79 O 95 _ 111 o 127 del



1.2.2.4. Hexadecimal ASCII


00 nul 10 dle 20 sp 30 0 40 @ 50 P 60 ` 70 p
01 soh 11 dc1 21 ! 31 1 41 A 51 Q 61 a 71 q
02 stx 12 dc2 22 " 32 2 42 B 52 R 62 b 72 r
03 etx 13 dc3 23 # 33 3 43 C 53 S 63 c 73 s
04 eot 14 dc4 24 $ 34 4 44 D 54 T 64 d 74 t
05 enq 15 nak 25 % 35 5 45 E 55 U 65 e 75 u
06 ack 16 syn 26 & 36 6 46 F 56 V 66 f 76 v
07 bel 17 etb 27 ' 37 7 47 G 57 W 67 g 77 w
08 bs 18 can 28 ( 38 8 48 H 58 X 68 h 78 x
09 ht 19 em 29 ) 39 9 49 I 59 Y 69 i 79 y
0a nl 1a sub 2a * 3a : 4a J 5a Z 6a j 7a z
0b vt 1b esc 2b + 3b ; 4b K 5b [ 6b k 7b {
0c np 1c fs 2c , 3c < 4c L 5c \ 6c l 7c |
0d cr 1d gs 2d - 3d = 4d M 5d ] 6d m 7d }
0e so 1e rs 2e . 3e > 4e N 5e ^ 6e n 7e ~
0f si 1f us 2f / 3f ? 4f O 5f _ 6f o 7f del



1.2.3. ASCII ``bang bang'', escapes are ! and !!

This is the local code in use on Cybers at Universite
de Montreal, which grave and serious people there prefer to
name ASCII code display. This code is also known as Bang-
bang. It is based on a six bits character set in which cap-
itals, French diacritics and a few others are coded using an
! escape followed by a single character, and control charac-
ters using a double ! escape followed by a single character.

The routines given here presume that the six bits code
is already expressed in ASCII by the communication channel,
with embedded ASCII ! escapes.

Here is a table showing which characters are being used
to encode each ASCII character.


000 [email protected] 020 !!P 040 060 0 100 @ 120 !P 140 [email protected] 160 P
001 !!A 021 !!Q 041 !" 061 1 101 !A 121 !Q 141 A 161 Q
002 !!B 022 !!R 042 " 062 2 102 !B 122 !R 142 B 162 R
003 !!C 023 !!S 043 # 063 3 103 !C 123 !S 143 C 163 S
004 !!D 024 !!T 044 $ 064 4 104 !D 124 !T 144 D 164 T
005 !!E 025 !!U 045 % 065 5 105 !E 125 !U 145 E 165 U
006 !!F 026 !!V 046 & 066 6 106 !F 126 !V 146 F 166 V
007 !!G 027 !!W 047 ' 067 7 107 !G 127 !W 147 G 167 W
010 !!H 030 !!X 050 ( 070 8 110 !H 130 !X 150 H 170 X










recode reference manual 7


011 !!I 031 !!Y 051 ) 071 9 111 !I 131 !Y 151 I 171 Y
012 !!J 032 !!Z 052 * 072 : 112 !J 132 !Z 152 J 172 Z
013 !!K 033 !![ 053 + 073 ; 113 !K 133 [ 153 K 173 ![
014 !!L 034 !!\ 054 , 074 < 114 !L 134 \ 154 L 174 !\
015 !!M 035 !!] 055 - 075 = 115 !M 135 ] 155 M 175 !]
016 !!N 036 !!^ 056 . 076 > 116 !N 136 ^ 156 N 176 !^
017 !!O 037 !!_ 057 / 077 ? 117 !O 137 _ 157 O 177 !_



1.2.3.1. Control Data's Display Code


Octal display code to graphic Octal display code to octal ASCII

00 : 20 P 40 5 60 # 00 072 20 120 40 065 60 043
01 A 21 Q 41 6 61 [ 01 101 21 121 41 066 61 133
02 B 22 R 42 7 62 ] 02 102 22 122 42 067 62 135
03 C 23 S 43 8 63 % 03 103 23 123 43 070 63 045
04 D 24 T 44 9 64 " 04 104 24 124 44 071 64 042
05 E 25 U 45 + 65 _ 05 105 25 125 45 053 65 137
06 F 26 V 46 - 66 ! 06 106 26 126 46 055 66 041
07 G 27 W 47 * 67 & 07 107 27 127 47 052 67 046
10 H 30 X 50 / 70 ' 10 110 30 130 50 057 70 047
11 I 31 Y 51 ( 71 ? 11 111 31 131 51 050 71 077
12 J 32 Z 52 ) 72 < 12 112 32 132 52 051 72 074
13 K 33 0 53 $ 73 > 13 113 33 060 53 044 73 076
14 L 34 1 54 = 74 @ 14 114 34 061 54 075 74 100
15 M 35 2 55 75 \ 15 115 35 062 55 040 75 134
16 N 36 3 56 , 76 ^ 16 116 36 063 56 054 76 136
17 O 37 4 57 . 77 ; 17 117 37 064 57 056 77 073



1.2.4. ASCII 8-bits as seen by Perkin Elmer

This charset represents the way Concurrent Computer
Corporation (formerly Perkin Elmer) expresses EBCDIC using
ASCII.

1.2.5. ASCII 8-bits a seen by Control Data

This charset represents the way Control Data Corpora-
tion relates EBCDIC to ASCII. We also select the lower half
of this table to do straigth ASCII to EBCDIC conversions,
back and forth.

1.2.6. ASCII 6/12 from NOS, escapes are ^ and @

This is one of the charset in use on CDC Cyber NOS sys-
tems to represent ASCII, sometimes named NOS 6/12 code for
coding ASCII. This code is also known as caret ASCII. It
is based on a six bits character set in which small letters










8 recode reference manual


and control characters are coded using a ^ escape and, some-
times, a @ escape.

The routines given here presume that the six bits code
is already expressed in ASCII by the communication channel,
with embedded ASCII ^ and @ escapes.

Here is a table showing which characters are being used
to encode each ASCII character.


000 ^5 020 ^# 040 060 0 100 @A 120 P 140 @G 160 ^P
001 ^6 021 ^[ 041 ! 061 1 101 A 121 Q 141 ^A 161 ^Q
002 ^7 022 ^] 042 " 062 2 102 B 122 R 142 ^B 162 ^R
003 ^8 023 ^% 043 # 063 3 103 C 123 S 143 ^C 163 ^S
004 ^9 024 ^" 044 $ 064 4 104 D 124 T 144 ^D 164 ^T
005 ^+ 025 ^_ 045 % 065 5 105 E 125 U 145 ^E 165 ^U
006 ^- 026 ^! 046 & 066 6 106 F 126 V 146 ^F 166 ^V
007 ^* 027 ^& 047 ' 067 7 107 G 127 W 147 ^G 167 ^W
010 ^/ 030 ^' 050 ( 070 8 110 H 130 X 150 ^H 170 ^X
011 ^( 031 ^? 051 ) 071 9 111 I 131 Y 151 ^I 171 ^Y
012 ^) 032 ^< 052 * 072 @D 112 J 132 Z 152 ^J 172 ^Z
013 ^$ 033 ^> 053 + 073 ; 113 K 133 [ 153 ^K 173 ^0
014 ^= 034 ^@ 054 , 074 < 114 L 134 \ 154 ^L 174 ^1
015 ^ 035 ^\ 055 - 075 = 115 M 135 ] 155 ^M 175 ^2
016 ^, 036 ^^ 056 . 076 > 116 N 136 @B 156 ^N 176 ^3
017 ^. 037 ^; 057 / 077 ? 117 O 137 _ 157 ^O 177 ^4



1.2.7. EBCDIC with no further comments

This charset is the IBM's external binary coded decimal
for interchange coding. This is an eight bits code.

1.2.8. ASCII without diacritics nor underline

This code is ASCII expunged of all diacritics and
underlines, as long as they are applied using three charac-
ter sequences, with BS in the middle. Also, despite
slightly unrelated, each control character is represented by
a sequence of two or three graphic characters. The newline
character, however, keeps its functionnality and is not rep-
resented.

Note that charset flat is a terminal charset. We can
convert to flat, but not from it.

1.2.9. ASCII 8-bits for IBM's PC

The file was obtained or is aimed towards a PC micro-
computer from IBM or any compatible. This is an eight-bit
code.










recode reference manual 9


1.2.10. ASCII for the Unisys' ICON

The file is using Unisys' ICON way to represent dia-
critics with 0x19 escape sequences. This is a seven-bit
code, even if eight-bit codes can flow through as part of
IBM-PC charset.

1.2.11. ASCII with LaTeX codes

This charset is an ASCII file coded to be read by LaTeX
or, in certain cases, by TeX.

1.2.12. ASCII extended by Latin Alphabet 1

This charset corresponds to the ISO Latin Alphabet 1.
It is an eight-bit code which coincides with ASCII for the
lower half.

1.2.12.1. Commented Latin-1


oct dec hex description

240 160 a0 no-break space
241 161 a1 inverted exclamation mark
242 162 a2 cent sign
243 163 a3 pound sign
244 164 a4 currency sign

245 165 a5 yen sign
246 166 a6 broken bar
247 167 a7 paragraph sign, section sign
250 168 a8 diaeresis
251 169 a9 copyright sign
252 170 aa feminine ordinal indicator
253 171 ab left angle quotation mark
254 172 ac not sign
255 173 ad soft hyphen
256 174 ae registered trade mark sign
257 175 af macron
260 176 b0 degree sign
261 177 b1 plus-minus sign
262 178 b2 superscript two
263 179 b3 superscript three
264 180 b4 acute accent
265 181 b5 small greek mu, micro sign
266 182 b6 pilcrow sign
267 183 b7 middle dot
270 184 b8 cedilla
271 185 b9 superscript one
272 186 ba masculine ordinal indicator
273 187 bb right angle quotation mark
274 188 bc vulgar fraction one quarter
275 189 bd vulgar fraction one half










10 recode reference manual


276 190 be vulgar fraction three quarters
277 191 bf inverted question mark
300 192 c0 capital A with grave accent
301 193 c1 capital A with acute accent
302 194 c2 capital A with circumflex accent
303 195 c3 capital A with tilde
304 196 c4 capital A diaeresis
305 197 c5 capital A with ring above
306 198 c6 capital diphthong A with E
307 199 c7 capital C with cedilla
310 200 c8 capital E with grave accent
311 201 c9 capital E with acute accent
312 202 ca capital E with circumflex accent
313 203 cb capital E with diaeresis
314 204 cc capital I with grave accent
315 205 cd capital I with acute accent
316 206 ce capital I with circumflex accent
317 207 cf capital I with diaeresis
320 208 d0 capital icelandic ETH
321 209 d1 capital N with tilde
322 210 d2 capital O with grave accent
323 211 d3 capital O with acute accent
324 212 d4 capital O with circumflex accent
325 213 d5 capital O with tilde
326 214 d6 capital O with diaeresis
327 215 d7 multiplication sign
330 216 d8 capital O with oblique stroke
331 217 d9 capital U with grave accent
332 218 da capital U with acute accent
333 219 db capital U with circumflex accent
334 220 dc capital U with diaeresis
335 221 dd capital Y with acute accent
336 222 de capital icelandic THORN
337 223 df small german sharp s
340 224 e0 small a with grave accent
341 225 e1 small a with acute accent
342 226 e2 small a with circumflex accent
343 227 e3 small a with tilde
344 228 e4 small a with diaeresis
345 229 e5 small a with ring above
346 230 e6 small diphthong a with e
347 231 e7 small c with cedilla
350 232 e8 small e with grave accent
351 233 e9 small e with acute accent
352 234 ea small e with circumflex accent
353 235 eb small e with diaeresis
354 236 ec small i with grave accent
355 237 ed small i with acute accent
356 238 ee small i with circumflex accent
357 239 ef small i with diaeresis
360 240 f0 small icelandic eth
361 241 f1 small n with tilde
362 242 f2 small o with grave accent










recode reference manual 11


363 243 f3 small o with acute accent
364 244 f4 small o with circumflex accent
365 245 f5 small o with tilde
366 246 f6 small o with diaeresis
367 247 f7 division sign
370 248 f8 small o with oblique stroke
371 249 f9 small u with grave accent
372 250 fa small u with acute accent
373 251 fb small u with circumflex accent
374 252 fc small u with diaeresis
375 253 fd small y with acute accent
376 254 fe small icelandic thorn
377 255 ff small y with diaeresis



1.2.12.2. Octal Latin-1


200 220 240 nsp 260 ++ 300 A` 320 DD 340 a` 360 dd
201 221 241 !! 261 +- 301 A' 321 N~ 341 a' 361 n~
202 222 242 c| 262 22 302 A^ 322 O` 342 a^ 362 o`
203 223 243 ## 263 33 303 A~ 323 O' 343 a~ 363 o'
204 224 244 cur 264 304 A" 324 O^ 344 a" 364 o^
205 225 245 y- 265 uu 305 A+ 325 O~ 345 a+ 365 o~
206 226 246 || 266 pil 306 AE 326 O" 346 ae 366 o"
207 227 247 $$ 267 .. 307 C, 327 xx 347 c, 367 //
210 230 250 "" 270 ,, 310 E` 330 O/ 350 e` 370 o/
211 231 251 cO 271 11 311 E' 331 U` 351 e' 371 u`
212 232 252 a- 272 o- 312 E^ 332 U' 352 e^ 372 u'
213 233 253 << 273 >> 313 E" 333 U^ 353 e" 373 u^
214 234 254 -. 274 14 314 I` 334 U" 354 i` 374 u"
215 235 255 -- 275 12 315 I' 335 Y' 355 i' 375 y'
216 236 256 tO 276 34 316 I^ 336 PP 356 i^ 376 pp
217 237 257 mac 277 ?? 317 I" 337 ss 357 i" 377 y"



1.2.12.3. Decimal Latin-1


128 144 160 nsp 176 ++ 192 A` 208 DD 224 a` 240 dd
129 145 161 !! 177 +- 193 A' 209 N~ 225 a' 241 n~
130 146 162 c| 178 22 194 A^ 210 O` 226 a^ 242 o`
131 147 163 ## 179 33 195 A~ 211 O' 227 a~ 243 o'
132 148 164 cur 180 196 A" 212 O^ 228 a" 244 o^
133 149 165 y- 181 uu 197 A+ 213 O~ 229 a+ 245 o~
134 150 166 || 182 pil 198 AE 214 O" 230 ae 246 o"
135 151 167 $$ 183 .. 199 C, 215 xx 231 c, 247 //
136 152 168 "" 184 ,, 200 E` 216 O/ 232 e` 248 o/
137 153 169 cO 185 11 201 E' 217 U` 233 e' 249 u`
138 154 170 a- 186 o- 202 E^ 218 U' 234 e^ 250 u'
139 155 171 << 187 >> 203 E" 219 U^ 235 e" 251 u^











12 recode reference manual


140 156 172 -. 188 14 204 I` 220 U" 236 i` 252 u"
141 157 173 -- 189 12 205 I' 221 Y' 237 i' 253 y'
142 158 174 tO 190 34 206 I^ 222 PP 238 i^ 254 pp
143 159 175 mac 191 ?? 207 I" 223 ss 239 i" 255 y"



1.2.12.4. Hexadecimal Latin-1


80 90 a0 nsp b0 ++ c0 A` d0 DD e0 a` f0 dd
81 91 a1 !! b1 +- c1 A' d1 N~ e1 a' f1 n~
82 92 a2 c| b2 22 c2 A^ d2 O` e2 a^ f2 o`
83 93 a3 ## b3 33 c3 A~ d3 O' e3 a~ f3 o'
84 94 a4 cur b4 c4 A" d4 O^ e4 a" f4 o^
85 95 a5 y- b5 uu c5 A+ d5 O~ e5 a+ f5 o~
86 96 a6 || b6 pil c6 AE d6 O" e6 ae f6 o"
87 97 a7 $$ b7 .. c7 C, d7 xx e7 c, f7 //
88 98 a8 "" b8 ,, c8 E` d8 O/ e8 e` f8 o/
89 99 a9 cO b9 11 c9 E' d9 U` e9 e' f9 u`
8a 9a aa a- ba o- ca E^ da U' ea e^ fa u'
8b 9b ab << bb >> cb E" db U^ eb e" fb u^
8c 9c ac -. bc 14 cc I` dc U" ec i` fc u"
8d 9d ad -- bd 12 cd I' dd Y' ed i' fd y'
8e 9e ae tO be 34 ce I^ de PP ee i^ fe pp
8f 9f af mac bf ?? cf I" df ss ef i" ff y"



1.2.13. ASCII with easy French conventions

This charset is identical to ascii, save for French
diacritics which are noted using a slightly different con-
vention.

See See section Easy French for more details.

1.3. Easy French conventions

These conventions are used in texte and latexte
charsets, which are seven bits codes. At text entry time,
these conventions provide a little speed up. At read time,
they slightly improve the readability. Of course, it would
better to have a specialized keyboard to make direct eight
bits entries and fonts for immediately displaying eight bit
ISO Latin-1 characters. But not everybody is so fortunate.
In several mailing environment, the eight bit is often will-
fully destroyed (an horrible Crime that most people do not
care to straighten up).

See:












recode reference manual 13


1.3.1. French quotes

French quotes (sometimes called ``angle quotes'') are
noted the same way English quotes are noted in TeX, id est
by `` and ''.

1.3.2. Latin ligatures

No effort has been put to preserve Latin ligatures (ae,
oe) which are representable in several other charsets. So,
these ligatures may be lost through Easy French conventions.

1.3.3. Diacritics

This is almost the French convention for simplified
diacritics entry:

e' Acute accent

e` Grave accent

e^ Circumflex accent

e" Diaeresis

c, Cedilla


In some countries, : is used instead of " to mark
diaeresis. `recode' support one convention on a single
call, depending on the -c option of the recode command.

The convention is prone to loosing information, because
the diacritic meaning overloads some characters that already
have other uses. To alleviate this, some knowledge of the
French language is insufflated into the recognition rou-
tines. So, the following subtleties are systematically
obeyed by the various recognizers.

o A single quote which follows a e does not neces-
sarily means an acute accent if it is followed by
a single other one. For example:

e' will give an e with an acute accent.

e'' will give a simple e, with a closing quota-
tion mark.

e'''
will give an e with an acute accent, followed
by a closing quotation mark.












14 recode reference manual


There is a problem induced by this convention if
there are English citations with a French text.
In sentences like:

There's a meeting at Archie's restaurant.


the single quotes will be mistaken twice for
acute accents. So English contractions and
suffix possessives could be mangled.

o A double quote or colon, depending on -c
option, which follows a vowel is interpreted
as diaeresis only if it is followd by another
letter. But there are in French several words
that end with a diaeresis, the program also
recognizes them.

See See section Ending diaeresis for a study
of all the problematic cases.

o A comma which follows a c is interpreted as a
cedilla only if it is followd by one of the
vowels a, o and u.

1.3.4. List of words ending with diaeresis

Here is a classification of all cases of a diaeresis at
the end of a French word:

o Words ending in ``igue''

- Feminine words without a relative masculine:


besaigue" cigue"



- Feminine words with a relative masculine:
(1)


aigue" ambigue" contigue" exigue" subaigue" suraigue"



o Words not ending in ``igue''

- Ended by ``i'': (2)


ai" congai" goi" hai"kai" inoui" sai" samurai" thai" tokai"










recode reference manual 15




- Ended by ``e'':


canoe"



- Ended by ``u'': (3)


Esau"



Notes:

1. There are supposed to be seven words in this case.
So, one is missing.

2. Look at the following sentence:

"Ai"e! Voici le proble`me que j'ai"


or, using the -c option:

Ai:e! Voici le proble`me que j'ai:


There is an ambiguity between an ai", the
small animal, and the indicative future of
avoir (first person singular), when followed
by what could be a diaeresis mark. Hopefully,
the case is solved by the fact that an
apostrophe always precedes the verb and almost
never the animal.

3. I did not pay attention to proper nouns, but
this one showed up as being fairly evident.

Just to complete this topic, note that it would be
wrong to make a rule for all words ending in ``igue'' as
needing a diaerisis. Here are counter-examples:


becfigue be`sigue bigue bordigue bourdigue brigue contre-digue
digue d'intrigue fatigue figue garrigue gigue igue intrigue
ligue prodigue sarigue zigue













16 recode reference manual


1.3.5. When, How and Who.

Easy French has been in use in France for a while.
Loic Dachary first exposed me to this
particular convention. I only slightly adapted it (the
diaeresis option) to make it more comfortable to several
usages in Que'bec originating from Universite' de Montre'al.

In fact, the main problem for me was not to necessarily
to invent Easy French, but to recognize the ``best'' conven-
tion to use, (best is not being defined, here) and to try to
solve the main pithfalls associated with the selected con-
vention. I'm particularily grateful to Claude Goutier
<[email protected]> whom, through numerous discussions in
August 1988, was quite helpful in evaluating various hypoth-
esis.

1.4. Internal aspects

This information is organized in:

1.4.1. Overall organization

The main driver has a table giving the conversion rou-
tines available and for each, the starting charset and the
ending charset. It then tries to figure out the shortest
sequence of conversions that will transform the input
charset into the final charset. Let us consider these
charsets as being the nodes of a directed graph. `recode'
has internally a few elementary recoding methods, called
single-steps, each of which may be considered as oriented
arc from one node to the other. A cost is attributed to
each single-step. Given a starting code and a goal code,
`recode' computes the most economical route through the ele-
mentary recodings.

The main part of `recode' is written in C, as are most
single-steps. A few single-steps which need to recognize
sequences of multiple characters are written in `lex'.

1.4.2. Internal vs external piping

Suppose that four elementary steps are selected at path
optimization time. Then `recode' will split itself into
four different tasks interconnected with pipes, logically
equivalent to:


step1 output














recode reference manual 17


1.4.3. Some limitations

Here are some limitations of the program.

o There is a limit (currently 10) on the number of
steps allowed in one single recodification work.
It should stay sufficient for quite a while, maybe
for ever. This is a simple compilation #define,
in any case.

1.4.4. Adding new charsets

It is fairly easy for a programmer to add a new charset
to `recode'. All it requires is making two routines, modi-
fying a few tables, and makeing `recode' again.

One of the routine should convert from any previous
charset to the new one. Any previous charset will do, but
try to select it so you will not loose too much information
while converting. If you have to read multiple bytes of the
old charset before recognizing the character to produce, you
might write this routine in `lex'; otherwize, use C. Proto-
type your routine after one of those which exists, so to
keep the sources uniform.

The other routine should convert from the new charset
to any older one. You do not have to select the same old
charset than what you selected for the previous routine.
Select any charset for which you will not loose too much
information while converting. If the routine has to read
multiple bytes of the new charset before deciding which
character it will produce, you might write this routine in
`lex'; otherwize, use C. Prototype your routine after one
of those which exists, so to keep the sources uniform.

Edit `Makefile' to add the object name of your two rou-
tines to the C_STEPS or L_STEPS macro definition, depending
on the fact your routines is written in C or in `lex'. Then
edit `steps.h' in the four following places:

1. Create a symbol for your new charset in enum
TYPE_code definition.

2. Add the option name of your new charset in
code_keywords initialization.

3. Add two extern declarations for your routines at
the appropriate places.

4. Add two lines in single_steps array initialization
to declare your routines. For each line, include
the four following fields:











18 recode reference manual


1. The function name of your routine.

2. The starting code enum constant, that is, the
code your routine reads.

3. The goal code enum constant, that is, the
code your routine produces.

4. The cost of your routine, using the
predefined constants STEP, LOOSE, EXACT, SLOW
and FAST. See the comments for the exact
meaning of each of these and follow the
examples. Respect these meanings and be
honest with the costs!


In some circumstances, one of your routines would
be a mere copy. It is better in this case to not
provide the routine, but still declare it in
single_steps using NULL as its function name and
ALREADY alone as its cost.


1.5. Future things

I will be glad to hear critics and suggestions, even
for details. This program is made up of hundreds of
details, in fact. Write to [email protected]

Some notes and suggestions.

o Accept abbreviations for charsets on the command
call. Accept more than one conversion with
intermediate filters in a single call.

o Support Universite de Montreal ``accent''
convention.

o Support [nt]roff diacritics.

o Support the Atari-ST internal code.

o Segregate charsets and usages.

o Is there some way of specifying that recode should
not contract something that looks like an accent?
Like "There\'s a meeting at Archie\'s restaurant"?
(With corresponding insertion of backslashes or
whatevers when converting the other way, of course
- the transformation from accented to ascii should
be exactly invertable in all cases.) Of course,
There\'s will not be contracted.











recode reference manual 19































































 December 26, 2017  Add comments

Leave a Reply