UTF-EBCDIC
From Wikipedia, the free encyclopedia
UTF-EBCDIC is a character encoding used to represent Unicode characters. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty. Its advantages for existing EBCDIC-based systems are similar to UTF-8's advantages for existing ASCII-based systems. Details on UTF-EBCDIC are defined in Unicode Technical Report #16.
To produce the UTF-EBCDIC encoded version of a series of Unicode code points, an encoding based on UTF-8 (known in the specification as UTF-8-Mod) is applied first. The main difference between this encoding and UTF-8 is that it allows unicode code points U+0080 through U+009F (the C1 control codes) to be represented as a single byte and therefore later mapped to corresponding EBCDIC control codes. In order to achieve this 101XXXXX was used instead of 10XXXXXX as the format for later bytes in a multi-byte sequence. As this can only hold 5 bits rather than 6, UTF-EBCDIC will generally produce larger output for the same input data than UTF-8.
This transformation leaves the data in an ASCII based format, so a reversible byte-byte transform is made on this data using a lookup table to make it as close to normal EBCDIC code pages as feasible. These steps can be easily reversed to recover the unicode code points.
Generally, this encoding form is rarely used, even on EBCDIC based mainframes for which it was designed. IBM EBCDIC based mainframe operating systems, like z/OS, usually use UTF-16 for complete Unicode support. For example, DB2 UDB, COBOL, PL/I, Java and the IBM XML toolkit support UTF-16 on IBM mainframes.
[edit] Codepage layout
There are 160 characters with single-byte encodings in UTF-EBCDIC; these are shown in the following table. The remaining 96 codes are used as part of multi-byte characters. As you can see, the single byte portion is similar to ibm-1047 instead of ibm-37 due to the location of the square brackets. CCSID 37 has [] at hex BA and BB instead of at hex AD and BD respectively.
| UTF-EBCDIC | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| —0 | —1 | —2 | —3 | —4 | —5 | —6 | —7 | —8 | —9 | —A | —B | —C | —D | —E | —F | |
| 0− |
NUL 0000 0 |
SOH 0001 1 |
STX 0002 2 |
ETX 0003 3 |
ST 009C 4 |
HT 0009 5 |
SSA 0086 6 |
DEL 007F 7 |
EPA 0097 8 |
RI 008D 9 |
SS2 008E 10 |
VT 000B 11 |
FF 000C 12 |
CR 000D 13 |
SO 000E 14 |
SI 000F 15 |
| 1− |
DLE 0010 16 |
DC1 0011 17 |
DC2 0012 18 |
DC3 0013 19 |
OSC 009D 20 |
LF 000A 21 |
BS 0008 22 |
ESA 0087 23 |
CAN 0018 24 |
EM 0019 25 |
PU2 0092 26 |
SS3 008F 27 |
FS 001C 28 |
GS 001D 29 |
RS 001E 30 |
US 001F 31 |
| 2− |
PAD 0080 32 |
HOP 0081 33 |
BPH 0082 34 |
NBH 0083 35 |
IND 0084 36 |
NEL 0085 37 |
ETB 0017 38 |
ESC 001B 39 |
HTS 0088 40 |
HTJ 0089 41 |
VTS 008A 42 |
PLD 008B 43 |
PLU 008C 44 |
ENQ 0005 45 |
ACK 0006 46 |
BEL 0007 47 |
| 3− |
DCS 0090 48 |
PU1 0091 49 |
SYN 0016 50 |
STS 0093 51 |
CCH 0094 52 |
MW 0095 53 |
SPA 0096 54 |
EOT 0004 55 |
SOS 0098 56 |
SGCI 0099 57 |
SCI 009A 58 |
CSI 009B 59 |
DC4 0014 60 |
NAK 0015 61 |
PM 009E 62 |
SUB 001A 63 |
| 4− |
SP 0020 64 |
. 002E 75 |
< 003C 76 |
( 0028 77 |
+ 002B 78 |
| 007C 79 |
||||||||||
| 5− |
& 0026 80 |
! 0021 90 |
$ 0024 91 |
* 002A 92 |
) 0029 93 |
; 003B 94 |
^ 005E 95 |
|||||||||
| 6− |
- 002D 96 |
/ 002F 97 |
, 002C 107 |
% 0025 108 |
_ 005F 109 |
> 003E 110 |
? 003F 111 |
|||||||||
| 7− |
` 0060 121 |
: 003A 122 |
# 0023 123 |
@ 0040 124 |
' 0027 125 |
= 003D 126 |
" 0022 127 |
|||||||||
| 8− |
a 0061 129 |
b 0062 130 |
c 0063 131 |
d 0064 132 |
e 0065 133 |
f 0066 134 |
g 0067 135 |
h 0068 136 |
i 0069 137 |
|||||||
| 9− |
j 006A 145 |
k 006B 146 |
l 006C 147 |
m 006D 148 |
n 006E 149 |
o 006F 150 |
p 0070 151 |
q 0071 152 |
r 0072 153 |
|||||||
| A− |
~ 007E 161 |
s 0073 162 |
t 0074 163 |
u 0075 164 |
v 0076 165 |
w 0077 166 |
x 0078 167 |
y 0079 168 |
z 007A 169 |
[ 005B 173 |
||||||
| B− |
] 005D 189 |
|||||||||||||||
| C− |
{ 007B 192 |
A 0041 193 |
B 0042 194 |
C 0043 195 |
D 0044 196 |
E 0045 197 |
F 0046 198 |
G 0047 199 |
H 0048 200 |
I 0049 201 |
||||||
| D− |
} 007D 208 |
J 004A 209 |
K 004B 210 |
L 004C 211 |
M 004D 212 |
N 004E 213 |
O 004F 214 |
P 0050 215 |
Q 0051 216 |
R 0052 217 |
||||||
| E− |
\ 005C 224 |
S 0053 226 |
T 0054 227 |
U 0055 228 |
V 0056 229 |
W 0057 230 |
X 0058 231 |
Y 0059 232 |
Z 005A 233 |
|||||||
| F− |
0 0030 240 |
1 0031 241 |
2 0032 242 |
3 0033 243 |
4 0034 244 |
5 0035 245 |
6 0036 246 |
7 0037 247 |
8 0038 248 |
9 0039 249 |
APC 009F 255 |
|||||
| —0 | —1 | —2 | —3 | —4 | —5 | —6 | —7 | —8 | —9 | —A | —B | —C | —D | —E | —F | |
[edit] See also
[edit] External links
- http://www.unicode.org/reports/tr16/ Unicode Technical Report #16: the definition of UTF-EBCDIC
|
||||||||||||||||||||

