Creating a MySQL powered websites, applications or content management systems often involves dealing with unpleasant character encoding related issues that are hard (and so not fun) to diagnose. In the worst case the behavior may even vary between your development and production environment.
I’ve never been digging deep enough into character encoding (since it’s not fun), but I plan to reading this promising blog post recommended somewhere at stackoverflow: Getting out of MySQL Character Set Hell
Before I get around to doing it (it’s quite long and detailed), here is a quick list of rules to follow I’ve came up with after a lot of trial and error. I might update it after I read the article.
- The database’s collation must be set to utf8_general_ci (or anything more language-specific, utf8_language_ci)
- Every column in every table must have the same collation
- Connection to the database included in every PHP page should be followed with this MySQL query:
SET NAMES 'utf8' COLLATE 'utf8_general_ci'
- PHP must provide a HTTP header before any output, which specifies encoding:
header('Content-type: text/html; charset=utf-8');
- The HTML header must specify encoding:
There is one special case that needs to be kept in mind:
- Whenever unsing PHP’s htmlentities() function, you must specify UTF-8 encoding
Just in case anyone is wondering, the stackoverflow question that brought me to the blog post is here, and the user adrienne, author of the best answer, lists these rules:
- The DB connection is using UTF-8
- The DB tables are using UTF-8
- The individual columns in the DB tables are using UTF-8
- The data is actually stored properly in the UTF-8 encoding inside the database (often not the case if you’ve imported from bad sources, or changed table or column collations)
- The web page is requesting UTF-8
- Apache is serving UTF-8