The problem

We’ve all been in this position at some point, working for a company who wants to internationalize their website, and so their mySQL CMS data …

But all is not so well as just using ‘SET NAMES utf8′ and changing all ‘charset’ on tables to utf8,

You may fall foul of seeing content like Á£

This is due to the fact in this case the latin1 encoded £ has not properly been converted to utf8 and as such is not rendering correctly, this is true of most ‘multibyte’ characters.

    The solution

What you need to do is actually convert the character set to utf8, in order to do this your going to need to run it through a program you could use iconv if you are already familiar with it, however if your system has python installed you can grab a copy of my sysadmin program which has iconv like functionality but is far more user friendly.

    What you will need
    Preparing the file

This assumes the database is currently using latin1, in theory this could be any encoding.

Get a dump of the database:

1
mysqldump --set-character-set=latin-1 --set-charset -u user -pPASSWORD databasename > databasename-latin1.sql

Now you have to be aware of what you need to replace using SED, you can’t just replace all instances of ‘latin1′ as Murphy’s law being as it is means that somewhere there will be ‘latin1′ in the physical content, especially for instance if I was using a mysql dump from this blog.

As such you need to replace the following:

1
/*!40101 SET NAMES latin1 */;

If your database dump is small enough (sub 100mb) you can edit this line directly in your text editor, alternatively you can do the following.

1
2
3
cat ./databasename-latin1.sql | sed 's/SET NAMES latin1/SET NAMES utf8/g' > tmp
cat ./tmp > ./databasename-latin1.sql
rm -f ./tmp

Now you need to replace all instances of ‘CHARSET=latin1′

1
2
3
cat ./databasename-latin1.sql | sed 's/CHARSET=latin1/CHARSET=utf8/g' > tmp
cat ./tmp > ./databasename-latin1.sql
rm -f ./tmp

Now we have to run the file through the charset converter

1
sysadmin -c iconv -d ./databasename-latin1.sql,latin-1,utf-8

If your sql dump is over 30mb, you will be prompted to confirm you wish to proceed, please remember that this will load the entire file into memory, so ensure you have enough available system memory before proceeding, I also suggest not running this on a production server.

If any characters could not be converted you will be alerted to their exact position within the file, from there you will either need to use sed to replace the character or use your text editor.

If all went well you now have ./databasename-latin1.sql.utf-8 (note the utf-8 extension), you now have a complete utf8 mySQL dump, all you need do now is import the dump.

    Further reading

  1. Force mySQL utf8 connections
  2. mySQL backup script

Tags: , , , , , , , , , ,
5 Responses to “Converting mySQL latin1 to utf8”
  1. aboad says:

    Thank You , but what if the database has ( binary table ) like image and files ?

  2. Buzz says:

    Hi Aboad,

    In that case I would only run this conversion tool on the non binary data, as far as I am aware there is no encoding for the BLOB data type, running this conversion tool may incorrectly try to convert the data.

  3. dvb says:

    You can use ‘–hex-blob’ to keep binary data intact.
    I’ve just used that to convert an old mybb forum.

    Here is a quote:
    –hex-blob
    Dump binary columns using hexadecimal notation (for example, ‘abc’ becomes 0×616263). The affected data types are BINARY, VARBINARY, the BLOB types, and BIT.

    ( http://dev.mysql.com/doc/refman/5.1/en/mysqldump.html#option_mysqldump_hex-blob )

    BTW,
    Here are some improvements:
    —–

    curdate=`date +%Y-%m-%d`

    mysqldump -u user -pXXXXX -h myhost.com –hex-blob –set-charset –default-character-set=latin1 databasename > “$curdate”.latin1.sql

    sed -i -e “1 iSET NAMES utf8;” -e ‘s/^\([)] ENGINE=MyISAM \(AUTO_INCREMENT=[0-9]\+ \)\?DEFAULT CHARSET=\)latin1;$/\1utf8;/; s#^\(/\*![0-9]\+ SET NAMES \)latin1 \*/;$#\1utf8 \*/;#’ “$curdate”.latin1.sql

    iconv -f latin1 -t utf8 “$curdate”.latin1.sql > “$curdate”.utf8.sql

    rm “$curdate”.latin1.sql

    mysql -u user -pXXXXX -h myhost.com databasename < "$curdate".utf8.sql

    —–
    Of course you should change the user, password, host and database names for your own.
    $curdate just hold the current date in the format YYYY-MM-DD, and used for filenames, you could give it any value.
    I'm using just one sed, so it should be faster for large files.
    The sed instructions are much more accurate and should never conflict with the data.

  4. Simon says:

    Here’s a little script that might help you, doubledecode.py — https://gist.github.com/1282752

  5. Simon says:

    Here’s a script to fix broken double encoded mysqldumps —
    https://gist.github.com/1282752

  6.  
Leave a Reply