UTF-8 and utf8: Another piece of the puzzle

I have been trying to do some parsing of google search results, then save the results in mysql. Naturally my script was in perl, and naturally this was not as straightforward as it looked. The most irritating part of the problem was trying to handle the hex encoded unicode in the urls returned by google search. Some of this was just normal stuff like ?, &, = and so forth. Some, however, was not; it was Chinese unicode characters (eeeek!). This was mostly from BBS or blog searches, and a nasty long mess it was.

The normal stuff is easily handled with URI::Escape; just uri_unescape($href) and you’re done.

Unicode was a different story. Here is what I finally wound up doing:

use Encode qw(decode encode);
use URI::Escape;

…..

my $href = uri_unescape($href);
if ($href =~ s/(%.*)//) {
$href .= decode(‘UTF-8’, uri_unescape($1));
}

I first apply uri_unescape to the whole string, and this catches the ordinary stuff. Then I chop off the bit that didn’t turn into normal characters and do it again, wrapping the result in decode(‘UTF-8’, $dehexedunicode).

Two odd things. First, I cannot just do the decode unescape thing all at once and get the normal characters and unicode characters at the same time. When I try this, it simply returns the original hexed unicode string. You have to get rid of all the regular unhexed stuff before you can proceed; so far, my regex does this; guess the weird stuff is always at the end of the string or something.

Second, you must do decode(‘UTF-8’, $dehexedunicode) to get a result which you can insert in mysql. NOTICE the capitals and the hyphen. As the Encode package pod explains,

utf8 = UTF8
and
utf-8 = utf_8 = UTF-8 = UTF_8

so the difference is between hyphenated UTF-8 (strict) and unhyphenated utf8 (loose)

At first, instead of trying to get the function right AND stuff it into mysql in one go, I just grabbed the text and wrote it to a file (on a windows xp machine). For this, I used utf8, and this worked fine; the characters were reconstituted from their freeze-dried hexedness and showed up in the file without problem. But when I applied this proven technique to mysql (5.5.8), the characters were immediately discombobulated into a primeval morass. Only UTF-8 and its homonyms will do for the son of Monty. It’s just a strict sort of program.

Live and learn. sigh.

This entry was posted in Programming. Bookmark the permalink.