Skip to content

Unicode Normalisation R←X(5581⌶)Y

Converts characters in an array to one of four Unicode Normal Forms.

Y must be one of the following:

  • A character scalar or vector.
  • A nested vector of character scalars/vectors.
  • An array of any shape, in which each element is a vector character, a scalar numeric, or ⎕NULL.

All characters in Y must be valid Unicode code points, that is, within the ranges U+0000 to U+D800 and U+E000 to U+10FFFF.

X must be a character scalar or vector describing the form of normalisation required, from the following:

X Normalisation form
'D' Canonical Decomposition
'C' Canonical Decomposition followed by Canonical Composition
'KD' Compatibility Decomposition
'KC' Compatibility Decomposition followed by Canonical Composition

The result R is the same as Y except that the character arrays are normalised as required. If Y is a character scalar or a vector of character scalars and vectors, then scalars in Y are converted to vectors. Vectors can contain a different number of characters after normalisation.

Normalisation

The Universal Character Set allows multiple ways to represent certain glyphs and glyph sequences. For example, the character Ç has its own code point (U+00C7, or ⎕UCS 199) but can also be formed from two separate code points – that for the character C (U+0043, or ⎕UCS 67) followed by the combining cedilla (U+0327, or ⎕UCS 807). These two representations are canonically equivalent. Some code points or combinations of code points more loosely represent the same character. For example, the characters 5 and (superscript 5) both represent the same numeric digit but are visually distinct. These two have compatibility equivalence.

Unicode normalisation is used to transform all equivalent text to a single representation. The four different normal forms are generated by decomposing, or decomposing then composing, code points to either their canonically equivalent or compatibly equivalent sequences. Example use cases for each form are:

Form Use For
D Searching, linguistic analysis
C Display, storage, user input
KD Text comparison, data indexing, accessibility
KC Search, login names, spoofing defence

For a full explanation of the different normalisation forms, see Unicode Standard Annex #15.

Examples

      COMBINING_CEDILLA⎕UCS 807
      'C' (5581) 5 '5⁵' ('C' COMBINING_CEDILLA)
───────────┐
   ─┐  
 5 5⁵│ │Ç│ 
   └──┘ └─┘ 
───────────┘
      'KC' (5581) 5 '5⁵' ('C' COMBINING_CEDILLA)
───────────┐
   ─┐  
 5 55 │Ç│ 
   └──┘ └─┘ 
───────────┘
      'C' COMBINING_CEDILLA  'D' (5581) 'Ç'
1
      'BrandTM''Brand™'
0
      'KD'(5581)'Brand™'
BrandTM
      'BrandTM'('KD'(5581))'Brand™'
1