At long last, I’ve cracked the accented character problem, and so, if only to assist others who find themselves in the same boat, I present the final results here.
My goal was to export my list of RSS subscriptions in NetNewsWire to a text file, transfer that text file to my webserver, and then post-process it in PHP, loading it into a MySQL database where I could manipulate it at will, including using it to draw this blogroll page on the fly.
The first step is to use AppleScript to export the subscription list as a text file:
tell application “NetNewsWire” set c to “” set linefeed to “\n” repeat with thisSub in subscriptions set s to “” as Unicode text set s to s & (is group of thisSub) & linefeed set s to s & (inGroup of thisSub) & linefeed set s to s & (display name of thisSub) & linefeed set s to s & (givenName of thisSub) & linefeed set s to s & (givenDescription of thisSub) & linefeed set s to s & (home URL of thisSub) & linefeed set s to s & (RSS URL of thisSub) & linefeed set s to s & (icon URL of thisSub) & linefeed set t to TECConvertText s fromCode “UNICODE-2-0” toCode “ISO-8859-1” set c to c & t end repeat end tell set blogroll to “blogroll.txt” set f to (POSIX path of blogroll) set n to open for access file f with write permission write c to n close access n
This script loops through each of my subscriptions in NetNewsWire, gathers the relevant parts to export as a Unicode string called s, and then converts that string to ISO-8859-1 (aka ISO Latin-1) using a scripting addition called TEC OSAX.
To download and install TEC OSAX, simply follow the download link on this page, and then copy the file called TEX.osax to /Library/ScriptingAdditions (you may need to create that folder if it doesn’t exist already; note that there’s no space in the folder name).
Once the conversion to ISO-8859-1 is complete, the resulting string t is added to a string c which will later be written to a file.
Once all the information is gathered about each subscription, the string c, which contains a linefeed-separated list of attributes for each subscription, is written to a file called blogroll.txt.
This file is then copied to my webserver, using SCP, by a shell script, which then post-processes the file using a PHP script, the important part of which is this:
$string = htmlentities($string);
This line, which appears in the loop that reads in and parses the blogroll text file, converts the accented characters in the ISO-8859-1 character set to HTML entities.
The end result is that the ç that started out in NetNewsWire as a MacRoman 0x8D, gets converted to Unicode U+00E7, then gets converted to the ISO-8859-1 character 0xB8, and finally to the HTML entity ç.
And so accents get preserved and François Nonnenmacher comes out as François Nonnenmacher.