<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd"
	xmlns:media="http://search.yahoo.com/mrss/"
	>
<channel>
	<title>Comments on: Zip files and Encoding &#8211; I hate you.</title>
	<atom:link href="http://datadriven.com.au/2008/12/zip-files-and-encoding-i-hate-you/feed/" rel="self" type="application/rss+xml" />
	<link>http://datadriven.com.au/2008/12/zip-files-and-encoding-i-hate-you/</link>
	<description>By reading this blog you&#039;ve signed an NDA.</description>
	<lastBuildDate>Sun, 11 Sep 2011 09:03:42 -0700</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Yuhong Bao</title>
		<link>http://datadriven.com.au/2008/12/zip-files-and-encoding-i-hate-you/comment-page-1/#comment-2315</link>
		<dc:creator>Yuhong Bao</dc:creator>
		<pubDate>Mon, 27 Jun 2011 02:10:07 +0000</pubDate>
		<guid isPermaLink="false">http://datadriven.com.au/?p=112#comment-2315</guid>
		<description>&quot;There’s nothing “weird” or “nonstandard” about the OS X Unicode decomposition,&quot;
Not exactly true:
http://developer.apple.com/library/mac/#technotes/tn/tn1150.html</description>
		<content:encoded><![CDATA[<p>&#8220;There’s nothing “weird” or “nonstandard” about the OS X Unicode decomposition,&#8221;<br />
Not exactly true:<br />
<a href="http://developer.apple.com/library/mac/#technotes/tn/tn1150.html" rel="nofollow">http://developer.apple.com/library/mac/#technotes/tn/tn1150.html</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Yuhong Bao</title>
		<link>http://datadriven.com.au/2008/12/zip-files-and-encoding-i-hate-you/comment-page-1/#comment-2314</link>
		<dc:creator>Yuhong Bao</dc:creator>
		<pubDate>Mon, 27 Jun 2011 02:00:02 +0000</pubDate>
		<guid isPermaLink="false">http://datadriven.com.au/?p=112#comment-2314</guid>
		<description>O. Andersen: FYI, the difference between CP437 and CP1252 is that CP437 is a OEM codepage and CP1252 is the ANSI codepage. Yes, there are two codepages in use in Windows. Most GUI stuff use the ANSI one, most console stuff and other stuff that comes from DOS uses the OEM one.</description>
		<content:encoded><![CDATA[<p>O. Andersen: FYI, the difference between CP437 and CP1252 is that CP437 is a OEM codepage and CP1252 is the ANSI codepage. Yes, there are two codepages in use in Windows. Most GUI stuff use the ANSI one, most console stuff and other stuff that comes from DOS uses the OEM one.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: VlaD</title>
		<link>http://datadriven.com.au/2008/12/zip-files-and-encoding-i-hate-you/comment-page-1/#comment-2188</link>
		<dc:creator>VlaD</dc:creator>
		<pubDate>Thu, 18 Nov 2010 21:07:51 +0000</pubDate>
		<guid isPermaLink="false">http://datadriven.com.au/?p=112#comment-2188</guid>
		<description>I used this simple utf8 normalizing sub, which solved all issues with all the archive types:

Here&#039;s the example:
#!/usr/bin/perl -w

use strict;
use warnings;
use bytes;
use Encode::Detect;

sub normalize_to_utf8 {
    return decode(&quot;Detect&quot;, shift);
}

$fname = &quot;any_filename_from_archive_as_it_comes&quot;;
$fname = normalize_to_utf8($fname);

# now $file&#039;s encoding detected and converted to UTF8
# ready to be used

Hope this helps.</description>
		<content:encoded><![CDATA[<p>I used this simple utf8 normalizing sub, which solved all issues with all the archive types:</p>
<p>Here&#8217;s the example:<br />
#!/usr/bin/perl -w</p>
<p>use strict;<br />
use warnings;<br />
use bytes;<br />
use Encode::Detect;</p>
<p>sub normalize_to_utf8 {<br />
    return decode(&#8221;Detect&#8221;, shift);<br />
}</p>
<p>$fname = &#8220;any_filename_from_archive_as_it_comes&#8221;;<br />
$fname = normalize_to_utf8($fname);</p>
<p># now $file&#8217;s encoding detected and converted to UTF8<br />
# ready to be used</p>
<p>Hope this helps.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: gobi</title>
		<link>http://datadriven.com.au/2008/12/zip-files-and-encoding-i-hate-you/comment-page-1/#comment-69</link>
		<dc:creator>gobi</dc:creator>
		<pubDate>Mon, 24 Aug 2009 10:49:17 +0000</pubDate>
		<guid isPermaLink="false">http://datadriven.com.au/?p=112#comment-69</guid>
		<description>The unzip command has -O and -I options to specify source filename encodings.

If archive was done on Windows u use -O option some thing like:

unzip -O sjis yourarchive.zip

-I option is used if you archived it on Linux/Unix with diffirent option.</description>
		<content:encoded><![CDATA[<p>The unzip command has -O and -I options to specify source filename encodings.</p>
<p>If archive was done on Windows u use -O option some thing like:</p>
<p>unzip -O sjis yourarchive.zip</p>
<p>-I option is used if you archived it on Linux/Unix with diffirent option.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: O. Andersen</title>
		<link>http://datadriven.com.au/2008/12/zip-files-and-encoding-i-hate-you/comment-page-1/#comment-65</link>
		<dc:creator>O. Andersen</dc:creator>
		<pubDate>Fri, 24 Jul 2009 14:15:55 +0000</pubDate>
		<guid isPermaLink="false">http://datadriven.com.au/?p=112#comment-65</guid>
		<description>Windows-1252 is actually a superset of ISO 8859-1 (disregarding control characters which will not appear in file names anyway), so your example is technically incorrect: encoding as ISO 8859-1 and decoding as Windows-1252 would work perfectly fine.

You later mention CP437 as an encoding used on Windows machines, but also say that &quot;everyone&quot; ignores the specification, which says that only CP437 or UTF-8 should be used. (CP437 is effectively incompatible with ISO 8859-1 and Windows-1252.) I am confused as to what encoding Windows actually uses/assumes. Do some versions use Windows-1252 and others CP437? Please clarify. (Obviously, other encodings must be used for non-Western demographics, as touched upon by another commentator, but let us leave that for now.)</description>
		<content:encoded><![CDATA[<p>Windows-1252 is actually a superset of ISO 8859-1 (disregarding control characters which will not appear in file names anyway), so your example is technically incorrect: encoding as ISO 8859-1 and decoding as Windows-1252 would work perfectly fine.</p>
<p>You later mention CP437 as an encoding used on Windows machines, but also say that &#8220;everyone&#8221; ignores the specification, which says that only CP437 or UTF-8 should be used. (CP437 is effectively incompatible with ISO 8859-1 and Windows-1252.) I am confused as to what encoding Windows actually uses/assumes. Do some versions use Windows-1252 and others CP437? Please clarify. (Obviously, other encodings must be used for non-Western demographics, as touched upon by another commentator, but let us leave that for now.)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Christopher Warner</title>
		<link>http://datadriven.com.au/2008/12/zip-files-and-encoding-i-hate-you/comment-page-1/#comment-67</link>
		<dc:creator>Christopher Warner</dc:creator>
		<pubDate>Fri, 08 May 2009 20:38:42 +0000</pubDate>
		<guid isPermaLink="false">http://datadriven.com.au/?p=112#comment-67</guid>
		<description>Blame all the programmers who think that encoding doesn&#039;t matter and refuse to get on the UTF-8 bandwagon even though the rest of the world has long since been on the bus.</description>
		<content:encoded><![CDATA[<p>Blame all the programmers who think that encoding doesn&#8217;t matter and refuse to get on the UTF-8 bandwagon even though the rest of the world has long since been on the bus.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Todd Morrison</title>
		<link>http://datadriven.com.au/2008/12/zip-files-and-encoding-i-hate-you/comment-page-1/#comment-68</link>
		<dc:creator>Todd Morrison</dc:creator>
		<pubDate>Wed, 08 Apr 2009 16:15:27 +0000</pubDate>
		<guid isPermaLink="false">http://datadriven.com.au/?p=112#comment-68</guid>
		<description>Thank you for this research.. i have been facing this problem recently.</description>
		<content:encoded><![CDATA[<p>Thank you for this research.. i have been facing this problem recently.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Cary Clark</title>
		<link>http://datadriven.com.au/2008/12/zip-files-and-encoding-i-hate-you/comment-page-1/#comment-75</link>
		<dc:creator>Cary Clark</dc:creator>
		<pubDate>Mon, 08 Dec 2008 17:18:56 +0000</pubDate>
		<guid isPermaLink="false">http://datadriven.com.au/?p=112#comment-75</guid>
		<description>Future zip executables (compressors) should assume filenames are in the system encoding (a very reasonable assumption in my opinion) and convert them to UTF-8 in the created zip files.</description>
		<content:encoded><![CDATA[<p>Future zip executables (compressors) should assume filenames are in the system encoding (a very reasonable assumption in my opinion) and convert them to UTF-8 in the created zip files.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: WAHa.06x36</title>
		<link>http://datadriven.com.au/2008/12/zip-files-and-encoding-i-hate-you/comment-page-1/#comment-74</link>
		<dc:creator>WAHa.06x36</dc:creator>
		<pubDate>Mon, 08 Dec 2008 13:01:27 +0000</pubDate>
		<guid isPermaLink="false">http://datadriven.com.au/?p=112#comment-74</guid>
		<description>Also, tar is a very unflexible and limited format, and not much good for any platform with filesystem metadata of any kind, which is pretty much all of them these days.</description>
		<content:encoded><![CDATA[<p>Also, tar is a very unflexible and limited format, and not much good for any platform with filesystem metadata of any kind, which is pretty much all of them these days.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: WAHa.06x36</title>
		<link>http://datadriven.com.au/2008/12/zip-files-and-encoding-i-hate-you/comment-page-1/#comment-73</link>
		<dc:creator>WAHa.06x36</dc:creator>
		<pubDate>Mon, 08 Dec 2008 12:59:20 +0000</pubDate>
		<guid isPermaLink="false">http://datadriven.com.au/?p=112#comment-73</guid>
		<description>There&#039;s nothing &quot;weird&quot; or &quot;nonstandard&quot; about the OS X Unicode decomposition, it&#039;s just plain NFD as far as I know. Now, if unicode decomposition was the only problem, this would all be trivial.

But the real problem is the already mentioned ISO-8859-1, Windows-1252, CP437, and the as-yet unmentioned Shift_JIS, EUCKR, Big5, ISO-8859:s 2 through 15 or however many there are, and so on, and so on.

Really, the only way to reliably open a zip file is to either ask the user for the character encoding (and he probably doesn&#039;t know), or to try and autodetect it.

I&#039;ve had some success using Mozilla&#039;s universalchardet to open Zip files in http://code.google.com/p/theunarchiver/. A friend is currently also helping getting some of the core code to run on Linux. It&#039;s all Objective-C, though, which will probably scare people off from using it.</description>
		<content:encoded><![CDATA[<p>There&#8217;s nothing &#8220;weird&#8221; or &#8220;nonstandard&#8221; about the OS X Unicode decomposition, it&#8217;s just plain NFD as far as I know. Now, if unicode decomposition was the only problem, this would all be trivial.</p>
<p>But the real problem is the already mentioned ISO-8859-1, Windows-1252, CP437, and the as-yet unmentioned Shift_JIS, EUCKR, Big5, ISO-8859:s 2 through 15 or however many there are, and so on, and so on.</p>
<p>Really, the only way to reliably open a zip file is to either ask the user for the character encoding (and he probably doesn&#8217;t know), or to try and autodetect it.</p>
<p>I&#8217;ve had some success using Mozilla&#8217;s universalchardet to open Zip files in <a href="http://code.google.com/p/theunarchiver/" rel="nofollow">http://code.google.com/p/theunarchiver/</a>. A friend is currently also helping getting some of the core code to run on Linux. It&#8217;s all Objective-C, though, which will probably scare people off from using it.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

