Posts Tagged “iconv”

I had a major issue facing the iconv functionality of the sysadmin toolset namely due to rushed coding.

When loading a file to be re-encoded the entire file was loaded into the buffer, encoded as whole and written out to the new file, this of course meant the  memory usage was roughly double the size of the file to be converted plus any overheads to do with the encoding itself.

Today I had need to convert a 1.3GB sql file, needles to say the script was crashing out with a memory error.

As such I have now completely re-written the function it now processes the file in 1kb ‘chunks’, moving the load to the CPU, this process is now very cpu intensive the the memory overhead is minimal (during test processed the 1.3GB file using 113kb of memory!!!).

Also I have now added BOM (Byte order mark) detection:

Tags: , , , ,

Comments No Comments »

    The problem

We’ve all been in this position at some point, working for a company who wants to internationalize their website, and so their mySQL CMS data …

But all is not so well as just using ‘SET NAMES utf8′ and changing all ‘charset’ on tables to utf8,

You may fall foul of seeing content like Á£

This is due to the fact in this case the latin1 encoded £ has not properly been converted to utf8 and as such is not rendering correctly, this is true of most ‘multibyte’ characters.

    The solution

What you need to do is actually convert the character set to utf8, in order to do this your going to need to run it through a program you could use iconv if you are already familiar with it, however if your system has python installed you can grab a copy of my sysadmin program which has iconv like functionality but is far more user friendly.

    What you will need
    Preparing the file

This assumes the database is currently using latin1, in theory this could be any encoding.

Get a dump of the database:

1
mysqldump --set-character-set=latin-1 --set-charset -u user -pPASSWORD databasename > databasename-latin1.sql

Now you have to be aware of what you need to replace using SED, you can’t just replace all instances of ‘latin1′ as Murphy’s law being as it is means that somewhere there will be ‘latin1′ in the physical content, especially for instance if I was using a mysql dump from this blog.

As such you need to replace the following:

1
/*!40101 SET NAMES latin1 */;

If your database dump is small enough (sub 100mb) you can edit this line directly in your text editor, alternatively you can do the following.

1
2
3
cat ./databasename-latin1.sql | sed 's/SET NAMES latin1/SET NAMES utf8/g' > tmp
cat ./tmp > ./databasename-latin1.sql
rm -f ./tmp

Now you need to replace all instances of ‘CHARSET=latin1′

1
2
3
cat ./databasename-latin1.sql | sed 's/CHARSET=latin1/CHARSET=utf8/g' > tmp
cat ./tmp > ./databasename-latin1.sql
rm -f ./tmp

Now we have to run the file through the charset converter

1
sysadmin -c iconv -d ./databasename-latin1.sql,latin-1,utf-8

If your sql dump is over 30mb, you will be prompted to confirm you wish to proceed, please remember that this will load the entire file into memory, so ensure you have enough available system memory before proceeding, I also suggest not running this on a production server.

If any characters could not be converted you will be alerted to their exact position within the file, from there you will either need to use sed to replace the character or use your text editor.

If all went well you now have ./databasename-latin1.sql.utf-8 (note the utf-8 extension), you now have a complete utf8 mySQL dump, all you need do now is import the dump.

    Further reading

  1. Force mySQL utf8 connections
  2. mySQL backup script
Tags: , , , , , , , , , ,

Comments 2 Comments »

Creative Commons License