1. About this book

Todo

Render text, fonts, RTL, LTR

(The original entry is located in about.rst, line 6.)

Todo

LaTeX: ‘abc’ doesn’t use the right glyph for ‘

(The original entry is located in about.rst, line 10.)

Todo

define a character

(The original entry is located in definitions.rst, line 9.)

Todo

define a glyph

(The original entry is located in definitions.rst, line 15.)

Todo

Nelle : un exemple de ce dernier cas serais, je pense, le bienvenue ici

(The original entry is located in definitions.rst, line 137.)

Todo

NELLE : je ne m’y connais pas trop en encodage, mais il me semble que ce que tu affirmes dans le paragraphe précédent n’est pas tout à fait correct: un encodage associe un character/glyphe/symbole avec quelque chose d’autre, comme une série d’entier, d’octet ou n’importe quoi (en fait plus exactement, pour moi de l’encodage, c’est une maniere d’associer X à Y, avec la possibilité de décoder de Y vers X). Si tu prends l’article de wikipédia sur le sujet (http://en.wikipedia.org/wiki/Character_encoding), il mentionne le code morse. Le pire dans tout ça, c’est qu’il me semble qu’il existe différent type de code morse pour différent language. Entre, la chine.

Bref, tout ça pour dire que je ne suis pas d’accord sur le fait que : “7 and 8 bits don’t need any encoding”. Tu associes une série de booléen à un caractère, donc par définition, il y a encodage. Cependant, je suppose que c’est un encodage “standard”

(The original entry is located in definitions.rst, line 177.)

Todo

define UCS

(The original entry is located in definitions.rst, line 232.)

Todo

ISO 10646

(The original entry is located in definitions.rst, line 234.)

Todo

write an introduction

(The original entry is located in encodings.rst, line 4.)

Todo

4th: 13%?

(The original entry is located in encodings.rst, line 54.)

Todo

add an explicit list of top3 in 2010

(The original entry is located in encodings.rst, line 60.)

Todo

Perf of the codec

(The original entry is located in encodings.rst, line 73.)

Todo

NELLE “is decoded from an encoding” => “is decoded”

(The original entry is located in encodings.rst, line 109.)

Todo

define “glyph”

(The original entry is located in encodings.rst, line 190.)

Todo

VISCII, EDBIC

(The original entry is located in encodings.rst, line 215.)

Todo

NELLE : I’d probably replace rules per tips

(The original entry is located in good_practices.rst, line 7.)

Todo

problem grammatical dans la dernière phrase du dernier point

(The original entry is located in good_practices.rst, line 21.)

Todo

explain why byte strings are still used (backward compatibility)

(The original entry is located in good_practices.rst, line 133.)

Todo

explain how to switch from byte to unicode strings: Python 2=>3, Windows A=>W, PHP 5=>6

(The original entry is located in good_practices.rst, line 134.)

Todo

NELLE - test if the bit 7 of all byte is unset

(The original entry is located in guess_encoding.rst, line 18.)

Todo

update/complete this list

(The original entry is located in guess_encoding.rst, line 248.)

Todo

NELLE : “the problem was” & “The problem is” est plus une expression francaise traduite: ce n’est pas faux grammaticallement en anglais, mais ne sonne pas juste:

8 bits (256 code points) are not enought so store all (Unicode?) characters

(The original entry is located in historical_encodings.rst, line 10.)

Todo

NELLE : un exemple serait le bienvenu

(The original entry is located in historical_encodings.rst, line 21.)

Todo

Arabic (cp1256, ISO-8859-6)

(The original entry is located in historical_encodings.rst, line 108.)

Todo

which JIS encodings use multibyte?

(The original entry is located in historical_encodings.rst, line 294.)

Todo

usage of surrogates (U+D800-U+DFFF) in security?

(The original entry is located in issues.rst, line 58.)

Todo

what about undecodable filenames?

(The original entry is located in libraries.rst, line 73.)

Todo

write an intro for all OS?

(The original entry is located in operating_systems.rst, line 6.)

Todo

And Windows CE?

(The original entry is located in operating_systems.rst, line 25.)

Todo

Document NormalizeString()

(The original entry is located in operating_systems.rst, line 220.)

Todo

Document the replacement character?

(The original entry is located in operating_systems.rst, line 222.)

Todo

document ReadConsoleW()?

(The original entry is located in operating_systems.rst, line 310.)

Todo

Consequences on TTY and pipes?

(The original entry is located in operating_systems.rst, line 357.)

Todo

write a full example in C

(The original entry is located in operating_systems.rst, line 440.)

Todo

setlocale(“”) means user preference

(The original entry is located in operating_systems.rst, line 463.)

Todo

UDF encoding?

(The original entry is located in operating_systems.rst, line 520.)

Todo

NTFS encoding

(The original entry is located in operating_systems.rst, line 539.)

Todo

Linux: mount options (FAT, NFSv3)

(The original entry is located in operating_systems.rst, line 557.)

Todo

USB keys, camera, memory cards

(The original entry is located in operating_systems.rst, line 558.)

Todo

Network fileystems like NFS (NFS4 supports Unicode?)

(The original entry is located in operating_systems.rst, line 559.)

Todo

“Because there is no Unicode standard library”: add historical/compatibilty reasons

(The original entry is located in programming_languages.rst, line 30.)

Todo

toupper() and isprint() are locale dependent

(The original entry is located in programming_languages.rst, line 57.)

Todo

char* points to char, not char*

(The original entry is located in programming_languages.rst, line 63.)

Todo

Create a section for NUL byte/character

(The original entry is located in programming_languages.rst, line 87.)

Todo

towupper() and iswprint() are locale dependent

(The original entry is located in programming_languages.rst, line 112.)

Todo

is wchar_t signed on Windows and Mac OS X?

(The original entry is located in programming_languages.rst, line 113.)

Todo

can wchar_t be signed?

(The original entry is located in programming_languages.rst, line 114.)

Todo

how are non-ASCII characters handled in the format string?

(The original entry is located in programming_languages.rst, line 182.)

Todo

locale encoding should be initialized.

(The original entry is located in programming_languages.rst, line 189.)

Todo

cleanup Python 2/3 here (open)

(The original entry is located in programming_languages.rst, line 471.)

Todo

u flag: instead of which encoding?

(The original entry is located in programming_languages.rst, line 507.)

Todo

Document utf8_encode() and utf8_decode() functions?

(The original entry is located in programming_languages.rst, line 517.)

Todo

PHP6 creation date?

(The original entry is located in programming_languages.rst, line 524.)

Todo

explain isWhitespace()

(The original entry is located in programming_languages.rst, line 554.)

Todo

uppercase/lowercase

(The original entry is located in programming_with_unicode.rst, line 5.)

Todo

MySQL and PostgreSQL

(The original entry is located in programming_with_unicode.rst, line 6.)

Todo

Filesystems (ext2 vs VFAT)

(The original entry is located in programming_with_unicode.rst, line 7.)

Todo

examples of applications only supporting BMP characters?

(The original entry is located in unicode.rst, line 15.)

Todo

NELLE - exemples ? Il y a beaucoup de catégories/sous catégories que je ne comprends pas

(The original entry is located in unicode.rst, line 36.)

Todo

NELLE - Je pense que ça vaut le coup de faire un graphique pour les stats. C’est un peu chiant à faire, mais ça change la vie du lecteur !

(The original entry is located in unicode.rst, line 61.)

Todo

NELLE - typo “replacment”

(The original entry is located in unicode.rst, line 103.)

Todo

CJK and Han issues

(The original entry is located in unicode.rst, line 118.)

Todo

is printable?

(The original entry is located in unicode.rst, line 119.)

Todo

lower/upper case

(The original entry is located in unicode.rst, line 120.)

Todo

character properties: name, category, number, RTL

(The original entry is located in unicode.rst, line 121.)

Todo

NELLE - I don’t understand. Why would UTF-8 support longer 5 bytes sequences if it is useless ?

(The original entry is located in unicode_encodings.rst, line 16.)

Todo

write a section: handle NUL byte/character

(The original entry is located in unicode_encodings.rst, line 31.)

Todo

NELLE la première phrase ne me semble pas “correcte” d’un point de vue grammatical :

“It is possible to be sure that a byte string is encoded by UTF-8, because UTF-8 adds markers to each byte.” => “Thanks to markers placed at each byte, it is possible to make sure a byte string is encoded in UTF-8”

(The original entry is located in unicode_encodings.rst, line 40.)

Todo

NELLE - “The problem with”

(The original entry is located in unicode_encodings.rst, line 47.)

Todo

NELLE - Il me semble que tu utilises endianness, sans avoir expliquer avant ce que c’était. Considères tu que le lecteur connaît ?

(The original entry is located in unicode_encodings.rst, line 49.)

Todo

NELLE - “If getting” a partir de là, je ne comprends plus bien

(The original entry is located in unicode_encodings.rst, line 52.)

Todo

which troubles?

(The original entry is located in unicode_encodings.rst, line 143.)

Todo

can a UTF-16 encoder encode characters in U+D800-U+DFFF?

(The original entry is located in unicode_encodings.rst, line 163.)

Todo

Render text, fonts, RTL, LTR

Todo

LaTeX: ‘abc’ doesn’t use the right glyph for ‘

The book is written in reStructuredText (reST) syntax and compiled by Sphinx.

1.1. License

This book is distributed under the CC BY-SA 3.0 license.

1.2. Thanks to

Reviewers: Alexander Belopolsky, Antoine Pitrou, Feth Arezki and Nelle Varoquaux, Natal Ngétal.

1.3. Notations

  • 0bBBBBBBBB: 8 bit unsigned number written in binary, first digit is the most significant. For example, 0b10000000 is 128.
  • 0xHHHH: number written in hexadecimal, e.g. 0xFFFF is 65535.
  • 0xHH 0xHH ...: byte sequence with bytes written in hexadecimal, e.g. 0xC3 0xA9 (2 bytes) is the character é (U+00E9) encoded to UTF-8.
  • U+HHHH: Unicode character with its code point written in hexadecimal. For example, U+20AC is the “euro sign” character, code point 8,364. Big code point are written with more than 4 hexadecimal digits, e.g. U+10FFFF is the biggest (unallocated) code point of Unicode 6.0: 1,114,111.
  • A—B: range including start and end. Examples:
    • 0x000x7F is the range 0 through 127 (128 bytes)
    • U+0000—U+00FF is the range 0 through 255 (256 characters)
  • {U+HHHH, U+HHHH, ...}: a character string. For example, {U+0041, U+0042, U+0043} is the string “abc” (3 characters).

Table Of Contents

Previous topic

Programming with Unicode

Next topic

2. Unicode nightmare

This Page