Many scripts in Unicode, such as Arabic, have special orthographic rules that require certain combinations of letterforms to be combined into special ligature forms. In English, the common ampersand (&) developed from a ligature in which the handwritten Latin letters e and t (spelling et, Latin for and) were combined.[1] The rules governing ligature formation in Arabic can be quite complex, requiring special script-shaping technologies such as the Arabic Calligraphic Engine by Thomas Milo's DecoType.[2]
As of Unicode 16.0, the Arabic script is contained in the following blocks:[3]
The basic Arabic range encodes the standard letters and diacritics, but does not encode contextual forms (U+0621–U+0652 being directly based on ISO 8859-6); and also includes the most common diacritics and Arabic-Indic digits. The Arabic Supplement range encodes letter variants mostly used for writing African (non-Arabic) languages. The Arabic Extended-B and Arabic Extended-A ranges encode additional Qur'anic annotations and letter variants used for various non-Arabic languages. The Arabic Presentation Forms-A range encodes contextual forms and ligatures of letter variants needed for Persian, Urdu, Sindhi and Central Asian languages. The Arabic Presentation Forms-B range encodes spacing forms of Arabic diacritics, and more contextual letter forms. The presentation forms are present only for compatibility with older standards, and are not currently needed for coding text.[4] The Arabic Mathematical Alphabetical Symbols block encodes characters used in Arabic mathematical expressions. The Indic Siyaq Numbers block contains a specialized subset of Arabic script that was used for accounting in India under the Mughal Empire by the 17th century through the middle of the 20th century.[5][6] The Ottoman Siyaq Numbers block contains a specialized subset of Arabic script, also known as Siyakat numbers, used for accounting in Ottoman Turkish documents.[6]
Below is a demonstration for the basic alphabet used in Modern Standard Arabic illustrating how Arabic letters are expected to appear in different contexts. Codepoints listed as contextual forms should "should not be used in general interchange"[4]. Unicode has other methods of encoding the difference if necessary, such as Zero-width joiner.
Only the Arabic question mark ⟨؟⟩ and the Arabic comma ⟨،⟩ are used in regular Arabic script typing and the comma is often substituted for the Latin script comma ⟨,⟩ which is also used as the decimal separator when the Eastern Arabic numerals are used (e.g. ⟨100.6⟩ compared to ⟨١٠٠,٦⟩).
Arabic Presentation Forms-A has a few characters defined as "word ligatures" for terms frequently used in formulaic expressions in Arabic. They are rarely used out of professional liturgical typing, also the Rial grapheme is normally written fully, not by the ligature.
used for writing Samvat era dates in Urdu
may be used with Coptic Epact numbers
→ U+221B ∛ Cube Root
→ U+221C ∜ Fourth Root
→ U+2030 ‰ Per Mille Sign
→ U+2031‱ Per Ten Thousand Sign
also used with Thaana and Syriac in modern text
→ U+002C, Comma
→ U+2E32 ⸲ Turned Comma
→ U+2E41 ⹁ Reversed Comma
represents sallallahu alayhe wasallam "may God's peace and blessings be upon him"
represents alayhe assalam "upon him be peace"
represents rahmatullah alayhe "may God have mercy upon him"
represents radi allahu 'anhu "may God be pleased with him"
sign placed over the name or nom-de-plume of a poet, or in some writings used to mark all proper names
marks a recommended pause position in some Qurans published in Iran and Pakistan should not be confused with the small TAH sign used as a diacritic for some letters such as 0679
early Persian
Arabic Small High Ligature Alef With Yeh Barree
should not be confused with 064E Fatha
should not be confused with 064F Damma
should not be confused with 0650 Kasra
also used with Thaana and Syriac in modern text → U+003B ; Semicolon → U+204F ⁏ Reversed Semicolon → U+2E35 ⸵ Turned Semicolon
also used with Thaana and Syriac in modern text → U+003F ? Question Mark → U+2E2E ⸮ Reversed Question Mark
→ U+02BE ʾ Modifier Letter Right Half Ring
≡ آ U+0627 U+0653
≡ أ U+0627 U+0654
≡ ؤ U+0648 U+0654
≡ إ U+0627 U+0655
in Kyrgyz the hamza is consistently positioned to the top right in isolate and final forms ≡ ئ U+064A U+0654
→ U+01B9 ƹ Latin Small Letter Ezh Reversed → U+02BF ʿ MODIFIER LETTER LEFT HALF RING
Azerbaijani
inserted to stretch characters or to carry tashkil with no base letter also used with Adlam, Hanifi Rohingya, Mandaic, Manichaean, Psalter Pahlavi, Sogdian, and Syriac= kashida
Sindhi uses a shape with a short tail
represents YEH-shaped dual-joining letter with no dots in any positional form not intended for use in combination with 0654 → U+0626 ئ Arabic Letter Yeh With Hamza Above
loses its dots when used in combination with 0654 retains its dots when used in combination with other combining marks → U+08A8 ࢨ Arabic Letter Yeh With Two Dots Below And Hamza Above
a common alternative form is written as two intertwined dammas, one of which is turned 180 degrees
marks absence of a vowel after the base consonant used in some Qurans to mark a long vowel as ignored can have a variety of shapes, including a circular one and a shape that looks like '06E1' → U+06E1 ۡArabic Small High Dotless Head Of Khah
used for madd jaa'iz in South Asian and Indonesian orthographies →U+089C ࢜ Arabic Madda Waajib →U+089E ࢞ Arabic Doubled Madda →U+089F ࢟ Arabic Half Madda Over Madda
restricted to hamza and ezafe semantics is not used as a diacritic to form new letters
Kashmiri, Urdu, Swahili, Somali
Baluchi indicates nasalization in Urdu
Pashto
African languages
African languages also used in Quranic text in African and other orthographies
Kalami
Kashmiri
→ U+0025 % Percent Sign
the ordinary comma is most commonly used instead
the Arabic comma is most commonly used instead
→ U+060C ، Arabic Comma
→ U+0027 ' Apostrophe
→ U+2019 ’ Right Single Quotation Mark
appearance rather variable
→ U+002A * Asterisk
Quranic Arabic
Baluchi, Kashmiri
This character is deprecated and its use is strongly discouraged; the sequence 0627 065F is the preferred way of encoding this character.
Kazakh, Jawi forms digraphs
preferred spelling is ٴا U+0674 U+0627
preferred spelling is ٴو U+0674 U+0648
preferred spelling is ٴۇ U+0674 U+06C7
preferred spelling is ٴی U+0674 06CC
Urdu
Sindhi
Persian, Urdu, ...
Pashto, Sarikoli represents the phoneme /dz/
not used in modern Pashto
Sindhi, historically Bosnian
Pashto, Khwarazmian, Sarikoli represents the phoneme /ts/ in Pashto
Sindhi, early Persian, Pegon, Malagasy
Lahnda
older shape for DUL, now obsolete in Sindhi Burushaski
Sindhi current shape used for DUL
Old Urdu, not in current use
Kurdish
Kurdish, early Persian
Dargwa
Moroccan Arabic
Turkic
Berber, Burushaski
Old Hausa
Jawi
Adighe
Maghrib Arabic
Ingush
Middle Eastern Arabic for foreign words Kurdish, Khwarazmian, early Persian, Jawi
North African Arabic for foreign words
Maghrib Arabic, Uyghur
Tunisian and Algerian Arabic
Persian, Urdu, Sindhi, ...= kaf mashkula
represents a letter distinct from Arabic KAF (0643) in Sindhi
Pashto may appear like an Arabic KAF (0643) with a ring below the base
use for the Jawi gaf is not recommended, although it may be found in some existing text data; recommended character for Jawi gaf is 0762 → U+0762 ݢ Arabic Letter Keheh With Dot Above
Uyghur, Kazakh, Moroccan Arabic, early Jawi, early Persian, ...
Berber, early Persian Pegon alternative for 08B4
not used in Sindhi
Sindhi, Saraiki
not used in Sindhi, Karakalpak
Kurdish, historically Bosnian
Avar, Soqotri
Urdu, archaic Arabic dotless in all four contextual forms
dotless in all four contextual forms Sindhi
forms aspirate digraphs in Urdu and other languages of South Asia represents the glottal fricative /h/ in Uyghur
for ezafe, use 0654 over the language-appropriate base letter actually a ligature, not an independent letter Arabic letter hamzah on ha (1.0) ≡ ۀ U+06D5 U+0654
Urdu actually a ligature, not an independent letter ≡ ۂ U+06C1 U+0654
Kyrgyz a glyph variant occurs which replaces the looped tail with a horizontal bar through the tail
Uyghur, Kurdish, Kazakh, Azerbaijani, historically Bosnian
Azerbaijani, Kazakh, Kyrgyz, Uyghur
Uyghur
Kazakh, Kyrgyz, historically Bosnian
Uyghur, Kazakh
Arabic, Persian, Urdu, Kashmiri, ... initial and medial forms of this letter have dots → U+0649 ى ARABIC LETTER ALEF MAKSURA → U+064A ي Arabic Letter Yeh
Pashto, Sindhi
Pashto, Uyghur used as the letter bbeh in Sindhi
Mende languages, Hausa
Uyghur, Kazakh, Kyrgyz
smaller than the typical circular shape used for 0652
the term "rectangular zero" is a translation of the Arabic name of this sign
used in some Qurans to mark absence of a vowel= Arabic jazm → U+0652 ْ Arabic Sukun
typically used with 06E5, 06E6, 06E7, and 08F3
→ U+08D3 ࣓ Arabic Small Low Waw → U+08F3 ࣳ Arabic Small High Waw
there is a range of acceptable glyphs for this character
also used in Quranic text in African and other orthographies to represent wasla, ikhtilas, etc.
also used in early Persian
Persian has a different glyph than Sindhi and Urdu
Persian, Sindhi, and Urdu share glyph different from Arabic
Persian, Sindhi, and Urdu have glyphs different from Arabic
Urdu and Sindhi have glyphs different from Arabic
They are mostly ligatures which can be created from the previous charts' characters, with the exception of the bracket-like graphemes ﴾ ﴿ and some of them are ligatures of common liturgical phrases.
These can all be created from the basic chart's characters.