Earlier in this space, I detailed a problem I ran into with accented characters when trying to set up a system to use NetNewsWire to maintain blogroll for my weblog.
At long last, I’ve cracked the accented character problem, and so, if only to assist others who find themselves in the same boat, I present the final results here.
My goal was to export my list of RSS subscriptions in NetNewsWire to a text file, transfer that text file to my webserver, and then post-process it in PHP, loading it into a MySQL database where I could manipulate it at will, including using it to draw this blogroll page on the fly.
The first step is to use AppleScript to export the subscription list as a text file:
tell application “NetNewsWire”
set c to “”
set linefeed to “\n”
repeat with thisSub in subscriptions
set s to “” as Unicode text
set s to s & (is group of thisSub) & linefeed
set s to s & (inGroup of thisSub) & linefeed
set s to s & (display name of thisSub) & linefeed
set s to s & (givenName of thisSub) & linefeed
set s to s & (givenDescription of thisSub) & linefeed
set s to s & (home URL of thisSub) & linefeed
set s to s & (RSS URL of thisSub) & linefeed
set s to s & (icon URL of thisSub) & linefeed
set t to TECConvertText s
fromCode “UNICODE-2-0” toCode “ISO-8859-1”
set c to c & t
end repeat
end tell
set blogroll to “blogroll.txt”
set f to (POSIX path of blogroll)
set n to open for access file f with write permission
write c to n
close access n
This script loops through each of my subscriptions in NetNewsWire, gathers the relevant parts to export as a Unicode string called s, and then converts that string to ISO-8859-1 (aka ISO Latin-1) using a scripting addition called TEC OSAX.
To download and install TEC OSAX, simply follow the download link on this page, and then copy the file called TEX.osax to /Library/ScriptingAdditions (you may need to create that folder if it doesn’t exist already; note that there’s no space in the folder name).
Once the conversion to ISO-8859-1 is complete, the resulting string t is added to a string c which will later be written to a file.
Once all the information is gathered about each subscription, the string c, which contains a linefeed-separated list of attributes for each subscription, is written to a file called blogroll.txt.
This file is then copied to my webserver, using SCP, by a shell script, which then post-processes the file using a PHP script, the important part of which is this:
$string = htmlentities($string);
This line, which appears in the loop that reads in and parses the blogroll text file, converts the accented characters in the ISO-8859-1 character set to HTML entities.
The end result is that the ç that started out in NetNewsWire as a MacRoman 0x8D, gets converted to Unicode U+00E7, then gets converted to the ISO-8859-1 character 0xB8, and finally to the HTML entity ç.
And so accents get preserved and François Nonnenmacher comes out as François Nonnenmacher.