Unicode Convert R←{X} ⎕UCS Y
⎕UCS
converts (Unicode) characters into integers and vice versa.
The optional left argument X
is a character vector containing the name of a variable-length Unicode encoding scheme which must be one of:
- 'UTF-8'
- 'UTF-16'
- 'UTF-32'
If not, a DOMAIN ERROR
is issued.
If X
is omitted, Y
is a simple character or integer array, and the result R
is a simple integer or character array with the same rank and shape as Y
.
If X
is specified, Y
must be a simple character or integer vector, and the result R
is a simple integer or character vector.
Monadic ⎕UCS
Used monadically, ⎕UCS
simply converts characters to Unicode code points and vice-versa.
With a few exceptions, the first 256 Unicode code points correspond to the ANSI character set.
⎕UCS 'Hello World'
72 101 108 108 111 32 87 111 114 108 100
⎕UCS 2 11⍴72 101 108 108 111 32 87 111 114 108 100
Hello World
Hello World
The code points for the Greek alphabet are situated in the 900's:
⎕UCS 'καλημέρα'
954 945 955 951 956 941 961 945
Unicode also contains the APL character set. For example:
⎕UCS 123 40 43 47 9077 41 247 9076 9077 125
{(+/⍵)÷⍴⍵}
Dyadic ⎕UCS
Dyadic ⎕UCS
is used to translate between Unicode characters and one of three standard variable-length Unicode encoding schemes, UTF-8, UTF-16 and UTF-32. These represent a Unicode character string as a vector of 1-byte (UTF-8), 2-byte (UTF-16) and 4-byte (UTF-32) signed integer values respectively.
'UTF-8' ⎕UCS 'ABC'
65 66 67
'UTF-8' ⎕UCS 'ABCÆØÅ'
65 66 67 195 134 195 152 195 133
'UTF-8' ⎕UCS 195 134, 195 152, 195 133
ÆØÅ
'UTF-8' ⎕UCS 'γεια σου'
206 179 206 181 206 185 206 177 32 207 131 206 191 207 133
'UTF-16' ⎕UCS 'γεια σου'
947 949 953 945 32 963 959 965
'UTF-32' ⎕UCS 'γεια σου'
947 949 953 945 32 963 959 965
Because integers are signed, numbers greater than 127 will be represented as 2-byte integers (type 163), and are thus not suitable for writing directly to a native file. To write the above data to file, the easiest solution is to use ⎕UCS
to convert the data to 1-byte characters and append this data to the file:
(⎕UCS 'UTF-8' ⎕UCS 'ABCÆØÅ') ⎕NAPPEND tn
Note regarding UTF-16: For most characters in the first plane of Unicode (0000-FFFF), UTF-16 and UCS-2 are identical. However, UTF-16 has the potential to encode all Unicode characters, by using more than 2 bytes for characters outside plane 1.
'UTF-16' ⎕UCS 'ABCÆØÅ⍒⍋'
65 66 67 198 216 197 9042 9035
⎕←unihan←⎕UCS (2×2*16)+⍳3 ⍝ x20001-x20003
'UTF-16' ⎕UCS unihan
55360 56321 55360 56322 55360 56323
Translation Error
⎕UCS
will generate a DOMAIN ERROR
if the argument cannot be converted. Additionally, in the Classic Edition, a TRANSLATION ERROR
is generated if the result is not in ⎕AV
or the numeric argument is not in ⎕AVU
.