The Absolute Minimum Every Programmer Should know !!!

One day you got an cool idea to develop an app “Better Half, for couples to keep ‘date night’ live” . After countless night outs your better Half app is ready and doing very good in the market. Due to increasing popularity of your app Valentine from France decided to use the app to write a message for his girlfriend on valentine’s day. After so many hours he did not get any response thinking what went wrong. Then he finally decided to ask his Girl friend and he found that the message she received is  all the junk characters “???? ?????? ??? ????” and she never understood it. Now you received a very long feedback mail from Valentine describing how you ruined his Valentine’s day.  After digging into the problem you found out that you did not handle the encoding.


But as a programmer we never pay much attention to the encoding unless until someones comes with the issue. In this post we will be discussing two things a) Various Encoding/decoding text b) Different libraries used

Encoding/decoding text:

Content-type: application/json; charset=utf-8.

What does this mean??? It designates the content to be in JSON format, encoded in the UTF-8 character encoding.  So now question is what is Encoding ??

“Mapping characters to numbers. Many such mappings exist; once you know the encoding of a piece of text, you know what character is meant by a particular number.”

two encoding’s which are used extensively are :

  1. Unicode:  Unicode officially encodes 1,114,112 characters, from 0×000000 to 0x10FFFF. (The idea that Unicode is a 16-bit encoding is completely wrong.) For maximum compatibility, individual Unicode values are usually passed around as 32-bit integers (4 bytes per character), even though this is more than necessary. For convenience, the first 128 Unicode characters are the same as those in the familiar ASCII encoding. For more info on unicode encoding you can go through the link
  2. UTF-8The consensus is that storing four bytes per character is wasteful, so a variety of representations have sprung up for Unicode characters. The most interesting one for C programmers is called UTF-8. UTF-8 is a “multi-byte” encoding scheme, meaning that it requires a variable number of bytes to represent a single Unicode value. Given a so-called “UTF-8 sequence”, you can convert it to a Unicode value that refers to a character.UTF-8 has the property that all existing 7-bit ASCII strings are still valid. UTF-8 only affects the meaning of bytes greater than 127, which it uses to represent higher Unicode characters. A character might require 1, 2, 3, or 4 bytes of storage depending on its value; more bytes are needed as values get larger. To store the full range of possible 32-bit characters, UTF-8 would require a whopping 6 bytes. But again, Unicode only defines characters up to 0x10FFFF, so this should never happen in practice.UTF-8 is a specific scheme for mapping a sequence of 1-4 bytes to a number from 0×000000 to 0x10FFFF:
    00000000 -- 0000007F: 	0xxxxxxx
    00000080 -- 000007FF: 	110xxxxx 10xxxxxx
    00000800 -- 0000FFFF: 	1110xxxx 10xxxxxx 10xxxxxx
    00010000 -- 001FFFFF: 	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

    The x’s are bits to be extracted from the sequence and glued together to form the final number. UTF-8 is widely adopted encoding now.

Other problems surrounding text are

  • Display: After you decoded the text how you will represent the text. you have to find a font that has the character and render it. This task is much complicated by the need to display both left-to-right and right-to-left text, the existence of combining characters that modify previous characters and have zero width, the fact that some languages require wider character cells than others, and context-sensitive letter forms.
  • Internationalization (i18n)
    This refers to the practice of translating a program into multiple languages, effectively by translating all of the program’s strings.
  • Lexicography
    Code that processes text as more than just binary data might have to become a lot smarter. The problems of searching, sorting, and modifying letter case (upper/lower) vary per-language. If your application doesn’t need to perform such tasks, consider yourself lucky. If you do need these operations, you can probably find a UI toolkit or i18n library that already implements them.

In the next post we will talking about Different APIS used for encoding and decoding in different programming languages.


Get every new post delivered to your Inbox.

Join 516 other followers