SPOC-Web Icon, semantic Knowledge Management

Individuen: Dinge, Eigenschaften und Verbindungen

Unicode: extensible Character Set for Text-Files

This Section introduces the Unicode Standard and its Benefits for Text Editing.

The World before Unicode

Unicode was created in 1991 by the Unicode Consortium, a non-profit organization similar to the W3C. Its aim is to create a single Definition for all Characters in all written Languages.

Before Unicode each Computer System had its own Character Definition. Due to limited Memory and Processing Power only a single Byte (8 Bit) was used to encode a Character. This allows for only 256 Values, which was sufficient for basic English and the most important Western European Languages, but not for the upcoming World Wide Web. Hundreds of special Code Pages were created to support Characters of different Cultures, but the Result was unreadable Content as soon as you didn't know the Code Page of the Source Computer.

What is Unicode?

Unicode is a Standard that strives to define unique Numbers for every Character in any written Language, living or dead. It reserved a million Numbers (for Compatibility with UTF-16, one of its Encodings). As of October 2015 Unicode is in Version 8.0 and uses about 10% (119934 Characters) of these Numbers. 90% are still unassigned.

The first 65000 Characters cover nearly all contemporary Writings, from western European Cultures, Cyrillic, Indian to east Asian and CJK (China, Japan and Korea) Scripts.

It also defines Numbers for many common Things, both physical and abstract, so you can use a dedicated Character for many Objects thus forming a new, international Language with defined Semantics. The only Problem left is the limited Keyboard. 

Here are some Examples:  (Firefox Users on Windows 8 or higher will even see colorful Icons)

Unicode Examples:

Here are some of the more interesting Unicode Characters. Their Look depends very much on the Browser, Operating System and Fonts installed. In old Browsers with outdated Fonts you may well see mostly empty Squares. On the other Hand Firefox Users on Windows 8 or higher will even see colorful Icons. Support for these Icons will increase over the next Years. You can find good Unicode Fonts here.

Cyrillic: АБВГ..., абвг...  Latin: ABC...Z, abc...z, áàâãäåæ Shapes: ⊿▬▭▮▯⬠⬡⬭ Runes: ᚠ ᚡ ᚢ ᚣ ᚤ ᚥ ᚦ
Chess: ♔ ♕ ♖ ♗ ♘♙♚ ♛ ♜ ♝ ♞ ♟ CJK: 电买开东车红马无鸟热时 are unified and simplified Versions of Chinese and Japanese. Telugu: గ్రంథాలు స్వయంచాలకంగా అనువదించబడ్డాయి

 Thai: คีย์เท่ากับอย่างชั

Urdu: مکمل ٹیسٹنگ ک

Office: 📇 📅 📆 📎 📏 📐 📛 📍 📌 Map-Icons: 🏯 🏰  ⛽ 🏠 🏡 🏢 🏣 🏤 🏥 🏦 🏧 🏨 🏩 🏪 🏫 🏬 🎪 🎠 🎡 🎢 🏭 Concepts: 💬 💭 🎦 🎬 🎭 🔑 🔒 🔓💢 💤 💥 💦 💧 💨 Globe: 🌐 🌍 🌎 🌏
Metro: 🚩 🚫 🚬 🚭 🚮 🚯 🚰 🚱 🚪 🚲 🚳 🚴 🚵 🚶 🚷 🚸 🚹 🚺 🚻 🚼 🚽 🚾 🚿 🛀 🛁 🛂 🛃 🛄 🛅

😀 😁 😂 😃 😄 😅 😆 😇 😈 😉 😊 😋 😌 😍 😎 😏 😐 😑 😒 😓 😔 😕 😖 😗 😘 😙 😚 😛 😜 😝 😞 😟 😠 😡 😢 😣 😤 😥 😦 😧 😨 😩 😪 😫 😬 😭 😮 😯 😰 😱 😲 😳 😴 😵 😶 😷

Emojis: 🙅 🙆 🙇 🙋 🙌 🙍 🙎 🙏 😸 😹 😺 😻 😼 😽 😾 😿 🙀 🙈 🙉 🙊

Astronomy: 🌌 🌞 🌟 🌠 🌑 🌒 🌓 🌔 🌕 🌖 🌗 🌘 🌙 🌚 🌛 🌜 🌝 ☽☾
Mail: 📧 📨 📩 📤 📥 📪 📫 📬 📭 📮 📦 📯 Office: 📒 📜 📝 📰 💼 📑 📓 📔 📕 📖 📗 📘 📙 📚 🔖 🔗 📁 📂 📃 📄 📋 ✏ ✒ 🎤 🎧 📣 📢 🔇 🔈 🔉 🔊 📞 📱 📲 📴 📵 📳 📠 📟 📡 📶 📼 📻 📺 🎥 📷 📹 🎮
Tools: 🔪 🔦 🔧 🔨 🔩 💺 🔬 🔭 🎨 🏮 💉 💊 🔫 💣 🎣 🎫 💆 💇 💈 💡 🔌

👀 👄 👅 👃 👂 💪 👍 👎 👌 👊 👆 👇 👈 👉 👋 👏 👐 👤 👥 👪 👫 👬 👭 💑 👶 👦 👧 👨 👩 👴 👵 👯 👸 👱 👰 👲 👳 👮 👷 💁 💂 💃 👹 👺

Fashion: 👞 👟 👠 👡 👢🎒 👛 👜 👝 👓 👔 👕 🎽 👖 👗 👘 👙 👚 👒 🎩 🎓 👑 💄 💅 💎 🔮 Drinks: 🍵 🍶 🍷 🍸 🍹 🍺 🍻 🍼
 Events: 🐾 👣 🌃 🌄 🌅 🌆 🌇 🌉 🌊 🌋 Sports: 🏃 🏊 🎿 🏂 🏄 🏇 🏈 🏉 🎾 🏀 🎯 🎰 🎱 🎲 🎳 🎴 🏆 🏁
👻 👼 👽 👾 👿 💀 🎀 🎁 🎂 🎃 🎄 🎅 🎆 🎇 🎈 🎉 🎊 Music: 🎹 🎻 🎷 🎺 🎸 🎵 🎶 🎼
🐭 🐮 🐯 🐵 🐶 🐷 🐸 🐰 🐱 🐲 🐴 🐹 🐺 🐻 🐼 🐽 Animals: 🐀 🐁 🐂 🐃 🐄 🐅 🐆 🐇 🐉 🐍 🐎 🐏 🐐 🐑 🐒 🐓 🐔 🐕 🐩 🐖 🐗 🐈 🐊 🐘 🐪 🐫 🐨 🐋 🐳 🐬 🐟 🐠  🐡 🐙 🐢 🐚 🐌 🐣 🐤 🐥 🐦 🐧 🐛 🐜 🐝 🐞

Traffic:  🚥 🚦 🚧 🚨

Love: 💋 💏 💌 💒 🔞 💍 💐

Plants: 🌰 🌱 🌴 🌵 🌷 🌸 🌹 🌺 🌻 🌼 🌽 🌾 🌿 🍀 🍁 🍂 🍃 🌳 🌲 Transport: 🚢 🚣 🚤 🚀 🚁 ✈ 🚄  🚅 🚂 🚃 🚆 🚇 🚈 🚉 🚊 🚋 🚝 🚞 🚟 🚠 🚡 ⛟ 🚚 🚛 🚌 🚍 🚎 🚏 🚐 🚔 🚕 🚖 🚗 🚘 🚙 🚜 🚑 🚒 🚓  Food: 🍳 🍱 🌽 🌾 🍔 🍕 🍝 🍞 🍖 🍗 🍟 🍠 🍤 🍚 🍜 🍛 🍘 🍙 🍣 🍥 🍲 🍇 🍈 🍉 🍊 🍋 🍌 🍍 🍎 🍏 🍐 🍑 🍒 🍓 🍄 🍅 🍆 🍡 🍢 🍦 🍧 🍨 🍩 🍪 🍫 🍬 🍭 🍮 🍯 🍰

 

ASCII, the common Denominator for the first 128 Characters:

ASCII, a Standard from 1963 used only the lower 128 Characters (7 Bit), allowing for the 8th Bit to act as Validation on Teletype Transfer. ASCII was different from previous Teletype Codes; it was not focused  on reducing manual Errors, because it was designed for Computers, not human Typists. Therefore it is organized alphabetically with Digits starting at 48 (#30 hexadecimal), upper Case Letters as 64 (#40) and lower Case Letters at 96 (#60). This simplified case insensitive Search and Number Parsing and makes it the Core of Unicode.

The first 256 Characters: Latin-1 resp. ISO8859-1,

This Code Page was used in most western European Countries, mostly as Windows-1252 Encoding. As an 8 Bit Encoding it can hold up to 256 Characters. It is conformant to ASCII and contains the most important Umlaut Characters from many European Cultures to form a successful Compromise. Until 2008 when it was overtaken by Unicode's UTF-8, it was the most common Text Format on the Internet. Therefore its Definitions were also incorporated into the Unicode Standard. 

Unicode Encoding: UCS-2, UTF-16, UTF-32, UTF-8 etc.

A Byte is not enough hold more than an million Characters. Unicode has devised several Representations for its characters as bytes, both in fixed and variable length.

UTF-32 resp. UCS-4 Encoding

The easiest is UTF-32, because it always uses 4 Bytes, which can represent up to 4 billion Numbers directly, more than sufficient for the million Unicode Characters, but it wastes more than a Byte per Character. The only Difference between these is that UTF-32 forbids Numbers in the Surrogate-Range #D800-#DFFF. (see UTF-16)

UCS-2 Encoding

Unicode first defined only 2-Byte Codes (UCS-2) and was quickly adopted in the early 90s by Web Browsers, Operating Systems (Windows 95...2000) and modern Languages like Java and .NET due to the Need for world-wide Communication. Most Programming Languages used this 2-Byte Representation for Strings in Memory.

UTF-16 Encoding

Due to the Success of Unicode it soon became clear that 65535 Characters were not sufficient. So as a Compromise to the effort already put into UCS-2 from Hard- and Software Vendors, UTF-16 was created.

UTF-16 is a variable-length Encoding that uses either 2 or 4 Bytes. Up to #D800 it uses the same Definitions as UCS-2. These Characters from the so-called "Basic Multilingual Plane" (BMP) represent almost all contemporary Writing Systems. Nearly half of the Numbers are used for unified CJK Han Signs. It is very rare to need a Character outside of the BMP, except for Symbols or Emojis.

For historic Scripts additional 16 "Planes" with 2^16(65536) Characters each were created, of which currently only 3 are used. To address these million (2^20) Characters, these 20 Bits are split up and added to either #D800 and #DC00 forming so-called low (#D800-#DC00) and high (#DC00-#DFFF) Surrogate-Pairs: 

𐐷 = #10437 = 0001 0000 01000011 0111 =>  1101 1000 0000 0001 and 1101 1100 0011 0111 = D801 DC37

These Surrogate-Values are not assigned to any Characters and forbidden in UTF-16. These disjoint Number Ranges allow to quickly detect Surrogate Pairs in the Text and restart reading at the next proper Character. This "Self-Synchronization" is very important for many Text Operations, because it allows to start anywhere in the Byte Stream without having to read it from the Start. Unfortunately Characters out of the BMP were used very rarely, so this Surrogate Encoding was not as well tested or supported as e.g. UTF-8.

UTF-8 Encoding

For English Texts the first 127 (#80) Characters are usually sufficient and so as to not "waste" a Byte, the UTF-8 Encoding was created, which uses a single Byte for these. A side Effect is that any old English Document is automatically an UTF-8 Document as long as it does not use one of the higher Characters.

Characters from #80 to #800 are represented as 2-Byte Runs.

Characters from #800 to #10000 (the BMP) are represented as 3-Byte Runs and the higher Planes need 4 Bytes.

Theoretically the UTF-8 Scheme is unlimited, unlike the UTF-16 Encoding.

As of 2015 about 85% of the Internet Content is encoded using UTF-8.

UTF-8 uses the highest Bit to distinguish ASCII Characters (Bit not set) from Run Characters (Bit always set). To achieve Self-Synchronization on any Character,

UTF-8 Encoding Bit usage

This Table shows how the Values for Start Bytes and Continuation Bytes in the original Design allowing for 2 Billion Characters (up to #80000000=31 Bit, the original limit of Unicode), but that was truncated in 2003 to the UTF-16 Limit of 21 Bit.

Range Bits  Byte 1
Byte 2
Byte 3
Byte4 (Byte5) (Byte6)
#00...
0tuvwxyz 0tuvwxyz          
#80... 0pqrstuvwxyz 110pqrst 10uvwxyz        
#800... klmnopqrstuvwxyz 1110klmn 10opqrst 10uvwxyz      
#10000... ßghijklmnopqrstuvwxyz 11110ßgh 10ijklmn 10opqrst 10uvwxyz    
(#200000...) (...hijklmnopqrstuvwxyz) 111110xx 10xxxxxx 10ijklmn 10opqrst 10uvwxyz  
(#4000000...) (...hijklmnopqrstuvwxyz) 1111110x 10xxxxxx 10xxxxxx 10ijklmn 10opqrst 10uvwxyz

The last two Rows were deprecated in 2003 when the Unicode Range was limited from 31 Bits to 21 Bits and must not appear in UTF-8 Files (which BTW is a good Test for UTF-8 Encoding in the Presence of Bytes with the highest Bit set).

This Scheme has several useful Characteristics:

  • pure ASCII Text is already valid UTF-8. 
  • it saves up to 50% in Western Texts and does not cost more for east European or Middle-East Scripts. Only CJK and other east Asian Scripts are better off with UTF-16, because it uses only 2 Bytes instead of 3.
  • ASCII, Start and Continuation Bytes have disjoint Ranges, so they are easily discernible.
  • it allows to determine the Number of Bytes to read from the Bits set in the first Byte.
  • you need to back up only up to 3 Bytes to read the current Character.

Byte Order Mark (BOM)

UTF-16 is used in both big-endian and low-endian Byte Order, unlike UTF-8, which defines the Ordering of higher and lower Order Bytes. To support detecting the actual UTF-16 Encoding, the zero-width #FEFF BOM Character is added to the Text Start. When a BOM is missing, big-endian encoding should be assumed, though many Windows Applications assume the low-endian OS Default. In Texts without BOM it is a good idea to search for the Space (#0020) Character to determine the Encoding, because it is usually the most frequent Character and Language-independent.