Almost Everything You Need to Know about Charset Encoding, UTF-8, ISO-8859. Conversion and More

Written by
Date: 2011-08-30 22:50:00 00:00


Introduction From Wikipedia:

A character encoding system consists of a code that pairs each character from a given repertoire with something else, such as a sequence of natural numbers, octets or electrical pulses, in order to facilitate the transmission of data (generally numbers and/or text) through telecommunication networks or storage of text in computers.

Most modern web browsers feature automatic character encoding detection.

Maybe the most used Character encoding are:

If you want that your web pages look the way it should, the documents (html files), should be created using the same encoding that the browser will use to display the web page.

First you need to decide which charset are going to use, and that depends on the language you are going to use.

ISO-8859-1 is for Western Europe languages. You can check this page, to know about the others ISO-8859 sub-sets.

If you want to be on safe ground, choose UTF-8

From Wikipedia:

UTF-8 can represent every character in the Unicode character set… … It is backward-compatible with ASCII and avoids the complications of endianness and byte order marks (BOM). For these and other reasons, UTF-8 has become the dominant character encoding for the World-Wide Web, accounting for more than half of all Web pages.

The tools

Let's now know about the tools you need to create UTF-8 files, or convert other encodings to that one.

Check the charset of a file

file -bi [file]

This will have an output like this: text/plain; charset=us-ascii, which tells you the file is plain text and is using us-ascii encoding, remember from above that UTF-8 is backward-compatible with ASCII.

You can also get something like this: text/html; charset=iso-8859-1, which tells you that the file is html, and the encoding is ISO-8859-1.

Convert from one charset to another

You may want to convert your files from one charset to another. To do that use this command:

iconv -f ascii -t utf8 [filename] > [newfilename]

That will convert from ASCII to UTF-8, be sure the encoding you are converting to, support all characters you have in the document you are re-encoding.

Create files in UTF-8

To create new files using UTF-8 encoding. Set your LANG variable to UTF-8.

export LANG=us_utf8

Then create the file, if you are using vim, and want to be sure, force utf-8 using this command while in vi/vim:

:set encoding=utf-8

Change the encoding of file names

You may want to change the encoding of the file names as well, to do that, use this command:

convmv -f iso-8859-1 -t utf-8 filename

You may need to install convmv application first. Do it, using your package manager.

OK, now you have enough tools to be sure your html documents, are using, the correct encoding. Now you need to tell the browsers which encoding to use.

Just add this line

in the <head> ... </head> section, for utf-8 or

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />

for ISO-8859-1. I'm sure you've got the idea.

If you are using Nginx, you can define the charset of your documents with the headers, thus improving the speed at your pages render in the client browser. Just add this line in the location section:

charset	utf-8;

Conclussion

I think you now have all needed tools, to be sure all your pages are correctly displayed in your visitor's browsers.