Home > W3C, Widgets > Zip files and Encoding – I hate you.

Zip files and Encoding – I hate you.

December 8th, 2008

I’ve written about some of the issues with depending on zip as a packaging format in the past. As people know, Web Apps is depending on Zip as the packaging format for Widgets.

Zip the good

Zip has a lot going for it. It is ubiquitous and dependable… so long as you don’t want to share files across cultures.

Zip the bad

The Zip spec does not seem to know that there are normalization models for UTF-8, when there are actually 4 (or more, because there is some non-standard ones too!). The Zip file gives no guidance as to how file names inside zip files are to be normalized.

Consider, when a zip file is created on Linux, it just writes the bytes for the file name in the encoding of the underlying file system. So, if the file system is in ISO-8859-1, the bytes are written in ISO-8859-1. This may seem ok, but when you decompress the zip file on Windows, which runs on encoding Windows-1252, the file names get all mangled. If the underlying encoding of the file system on Linux is something else, you won’t be able to share files with other systems at all. So in this case, it is not Window’s fault.

The Zip spec says that the only supported encodings are CP437 and UTF-8, but everyone has ignored that. Implementers just encode file names however they want (usually byte for byte as they are in the OS… see table below).

It gets worst! because MacOS runs on some weird non-standard decomposed Unicode mode, you can only share zip files with other MacOs users. According to this email, the LimeWire guys also ran into a similar problem with regards to encodings in MacOS:

“for example a French, German or Spanish Windows user cannot exchange files that contain [file names with] French, German or Spanish accents with a French, German or Spanish Macintosh users”

The following table illustrates the problem:

Bytes that represent ñ in a Zip file (in hex)
File name Zip in Windows Zip in Linux Zip in Mac OS
ñ a4 (Extended US-ASCII/CP437) C3 B1 (UTF-8 NFC) 6E CC 83 (UTF-8 NFD)

Yes! holly crap! three different byte sequences corresponding to different character encodings.

The only way around this would be a *special* custom-built widget zipping tool that normalizes file name strings to NFC. If the widget engine needs to decompress the widget to disk, then it would take the NFC and convert them to the operating system’s native encoding (or store the files in memory, and reference them that way). This affects the URI scheme and DOM normalization of Widgets, so Web Apps will have to deal with it eventually… but not sure exactly how.

admin W3C, Widgets , , , ,

  1. December 8th, 2008 at 20:48 | #1

    at least it’s possible to work around this problem.

  2. kL
    December 8th, 2008 at 22:15 | #2

    I haven’t got much success with tar.bz2 either.

    There’s The Unarchiver for Mac OS X which tries to guess encoding of filenames.

    Since UTF-8 can be mostly-reliably distinguished from 8-bit encodings, I think it should be required for all decompressors.

    And NFD is Mac OS X’s problem, not ZIP’s. If some app tries to use bytes in filenames that system simply does not allow by definition, then that’s bug in the app, and app should be fixed.

    I think going forward, all ZIP-dependent specs should require filenames in UTF-8 and forbid applications from relying on any particular Unicode normalization.

  3. Mike Seth
    December 8th, 2008 at 22:49 | #3

    Use tar(1)?

  4. WAHa.06×36
    December 8th, 2008 at 22:59 | #4

    There’s nothing “weird” or “nonstandard” about the OS X Unicode decomposition, it’s just plain NFD as far as I know. Now, if unicode decomposition was the only problem, this would all be trivial.

    But the real problem is the already mentioned ISO-8859-1, Windows-1252, CP437, and the as-yet unmentioned Shift_JIS, EUCKR, Big5, ISO-8859:s 2 through 15 or however many there are, and so on, and so on.

    Really, the only way to reliably open a zip file is to either ask the user for the character encoding (and he probably doesn’t know), or to try and autodetect it.

    I’ve had some success using Mozilla’s universalchardet to open Zip files in http://code.google.com/p/theunarchiver/. A friend is currently also helping getting some of the core code to run on Linux. It’s all Objective-C, though, which will probably scare people off from using it.

  5. WAHa.06×36
    December 8th, 2008 at 23:01 | #5

    Also, tar is a very unflexible and limited format, and not much good for any platform with filesystem metadata of any kind, which is pretty much all of them these days.

  6. Cary Clark
    December 9th, 2008 at 03:18 | #6

    Future zip executables (compressors) should assume filenames are in the system encoding (a very reasonable assumption in my opinion) and convert them to UTF-8 in the created zip files.

  7. noep
    December 9th, 2008 at 04:19 | #7

    Serves people right for using languages other than english.

  8. April 9th, 2009 at 02:15 | #8

    Thank you for this research.. i have been facing this problem recently.

  9. May 9th, 2009 at 06:38 | #9

    Blame all the programmers who think that encoding doesn’t matter and refuse to get on the UTF-8 bandwagon even though the rest of the world has long since been on the bus.

  10. O. Andersen
    July 25th, 2009 at 00:15 | #10

    Windows-1252 is actually a superset of ISO 8859-1 (disregarding control characters which will not appear in file names anyway), so your example is technically incorrect: encoding as ISO 8859-1 and decoding as Windows-1252 would work perfectly fine.

    You later mention CP437 as an encoding used on Windows machines, but also say that “everyone” ignores the specification, which says that only CP437 or UTF-8 should be used. (CP437 is effectively incompatible with ISO 8859-1 and Windows-1252.) I am confused as to what encoding Windows actually uses/assumes. Do some versions use Windows-1252 and others CP437? Please clarify. (Obviously, other encodings must be used for non-Western demographics, as touched upon by another commentator, but let us leave that for now.)

  11. gobi
    August 24th, 2009 at 20:49 | #11

    The unzip command has -O and -I options to specify source filename encodings.

    If archive was done on Windows u use -O option some thing like:

    unzip -O sjis yourarchive.zip

    -I option is used if you archived it on Linux/Unix with diffirent option.

  1. No trackbacks yet.