2010-09-19

CLI tool to review PO files

If there is something annoying about reviewing PO files is that it is impossible. When there are two hundred messages in a PO file, how are you going to know which messages changed? Well, that's the way it works currently for Transifex but there are very good news, first a review board is already available which is a good step forward but second it is going to get some good kick to make it awesome. But until this happens, I have written two scripts to make such a review.

A shell script msgdiff.sh

Pros: tools available on every system
Cons: ugly output, needs template file

#!/bin/sh
PO_ORIG=$1
PO_REVIEW=$2
PO_TEMPL=$3

MSGMERGE=msgmerge
DIFF=diff
PAGER=more
RM=/bin/rm
MKTEMP=mktemp

# Usage
if test "$1" = "" -o "$2" = "" -o "$3" = ""; then
    echo Usage: $0 orig.po review.po template.pot
    exit 1
fi

# Merge
TMP_ORIG=`$MKTEMP po-orig.XXX`
TMP_REVIEW=`$MKTEMP po-review.XXX`
$MSGMERGE $PO_ORIG $PO_TEMPL > $TMP_ORIG 2> /dev/null
$MSGMERGE $PO_REVIEW $PO_TEMPL > $TMP_REVIEW 2> /dev/null

# Diff
$DIFF -u $TMP_ORIG $TMP_REVIEW | $PAGER

# Clean up files
$RM $TMP_ORIG $TMP_REVIEW

Example:
$ ./msgdiff.sh fr.po fr.review.po thunar.pot
[...]
 #: ../thunar-vcs-plugin/tvp-git-action.c:265
-#, fuzzy
 msgid "Menu|Bisect"
-msgstr "Différences détaillées"
+msgstr "Menu|Couper en deux"
 
 #: ../thunar-vcs-plugin/tvp-git-action.c:265
 msgid "Bisect"
-msgstr ""
+msgstr "Couper en deux"
[...]

A Python script podiff.py

Pros: programmable output
Cons: external dependency

The script depends on polib that can be installed with the setuptools scripts. Make sure setuptools is installed and than run the command sudo easy_install polib.

#!/usr/bin/env python
import polib

def podiff(path_po_orig, path_po_review):
    po_orig = polib.pofile(path_po_orig)
    po_review = polib.pofile(path_po_review)
    po_diff = polib.POFile()
    po_diff.header = "PO Diff Header"
    for entry in po_review:
        orig_entry = po_orig.find(entry.msgid)
        if not entry.obsolete and (orig_entry.msgstr != entry.msgstr \
        or ("fuzzy" in orig_entry.flags) != ("fuzzy" in entry.flags)):
            po_diff.append(entry)
    return po_diff


if __name__ == "__main__":
    import sys
    import os.path

    # Usage
    if len(sys.argv) != 3 \
      or not os.path.isfile(sys.argv[1]) \
      or not os.path.isfile(sys.argv[2]):
        print "Usage: %s orig.po review.po" % sys.argv[0]
        sys.exit(1)

    # Retrieve diff
    path_po_orig = sys.argv[1]
    path_po_review = sys.argv[2]
    po_diff = podiff(path_po_orig, path_po_review)

    # Print out orig v. review messages
    po = polib.pofile(path_po_orig)
    for entry in po_diff:
        orig_entry = po.find(entry.msgid)
        orig_fuzzy = review_fuzzy = "fuzzy"
        if "fuzzy" not in orig_entry.flags:
            orig_fuzzy = "not fuzzy"
        if "fuzzy" not in entry.flags:
            review_fuzzy = "not fuzzy"
        print "'%s' was %s is %s\n\tOriginal => '%s'\n\tReviewed => '%s'\n" % (entry.msgid, orig_fuzzy, review_fuzzy, orig_entry.msgstr, entry.msgstr)

Example:
$ ./podiff.py fr.po fr.review.po
'Menu|Bisect' was fuzzy is not fuzzy
 Original => 'Différences détaillées'
 Reviewed => 'Menu|Couper en deux'

'Bisect' was not fuzzy is not fuzzy
 Original => ''
 Reviewed => 'Couper en deux'
[...]

2010-09-06

Benchmarking Compression Tools

Comparison of several compression tools: lzop, gzip, bzip2, 7zip, and xz.
  • Lzop: small and very fast yet good compression.
  • Gzip: fast and good compression.
  • Bzip2: slow for both compression and decompression although very good compression.
  • 7-Zip: LZMA algorithm, slower than Bzip2 for compression but very good compression.
  • Xz: LZMA2, evolution of LZMA algorithm.

Preparation

  • Be skeptic about compression tools and wanna promote the compression tool
  • Compare quickly old and new compression tools and find interesting results
So much for the spirit, what you really need is to write a script (Bash, Ruby, Perl, anything will do) because you will want to generate the benchmark data automatically. I picked up Ruby as it's nowadays the language of my choice when it comes to any kind of Shell-like scripts. By choosing Ruby I have a large panel of classes to process benchmarking data, for instance I have a Benchmark class (wonderful), I have a CSV class (awfully documented, redundant), and I have a zillion of Gems for any kind of tasks I would need to do (although I always avoid them).

I first focused on retrieving the data I was interested into (memory, cpu time and file size) and saving it in the CSV format. By doing so I am able to produce charts easily with existing applications, and I was thinking maybe it was possible to use GoogleCL to generate charts from the command line with Google Docs but it isn't supported (maybe it will maybe it won't, it's up to gdata-python-client). However there is an actual Google tool to generate charts, it is the Google Chart API that works by providing a URI to get an image. The Google Image Chart Editor website helps you to generate the chart you want in a friendly WYSIWYG mode, after that it is just a matter of computing the data into shape for the URI. But well while focusing on the charts I found the Ruby Gem googlecharts that makes it friendly to pass the data and save the image.

Ruby Script

The Ruby script needs the following:
  • It was written with Ruby 1.9
  • Linux/Procfs for reading the status of processes
  • Googlecharts: gem install googlecharts
  • ImageMagick for the command line tool convert (optional)
The Ruby script takes a path as argument, with which it creates a tarball inside a tmpfs directory in order to avoid I/O latencies from a hard-drive. Next it runs a number of commands over the tarball from which it collects benchmark data. The benchmark data is then saved inside CSV files that are reusable within spreadsheet applications. The data is also reused to retrieve charts from the Google Chart API and finally it calls the ImageMagick tool ''convert'' to collect the charts inside a single image. The summary displayed on the standard output is also saved inside a text file.

The script is a bit long for being pasted here (more or less 300 lines) so you can download it from my workstation. If the link doesn't work make sure the web browser doesn't encode ~ (f.e. to "%257E"), I've seen this happening with Safari (inside my logs)! If really you are out of luck, it is available on Pastebin.

Benchmarks

The benchmarks are available for three kinds of data. Compressed media files, raw media files (image and sound, remember that the compression is lossless), and text files from an open source project.
Media Files
Does it make sense at all to compress already compressed data. Obviously not! Let's take a look at what happens anyway.


As you see, compression tools with focus on speed don't fail, they still do the job quick while gaining a few hundred kilo bytes. However the other tools simply waste a lot of time for no gain at all.

So always make sure to use a backup application without compression over media files or the CPU will be heating up for nothing.
Raw Media Files
Will it make sense to compress raw data? Not really. Here are the results:


There is some gain in the order of mega bytes now, but the process is still the same and for that reason it is simply unadapted. For media files there are existing formats that will compress the data lossless with a higher ratio and a lot faster.

Let's compare lossless compression of a sound file. The initial WAV source file has a size of 44MB and lasts 4m20s. Compressing this file with xz takes about 90s, this is very long while it reduced the size to 36MB. Now if you choose the format FLAC, which is doing lossless compression for audio, you will have a record. The file is compressed in about 5s to a size of 24MB! The good thing about FLAC is that media players will decode it without any CPU cost.

The same happens with images, but I lack knowledge about photo formats so your mileage may vary. Anyway, except the Windows bitmap format, I'm not able to say that you will find images uncompressed just like you won't find videos uncompressed... TIFF or RAW is the format provided by many reflex cameras, it has lossless compression capabilities and contains many information about image colors and so on, this makes it the perfect format for photographers as the photo itself doesn't contain any modifications. You can also choose the PNG format but only for simple images.
Text Files
We get to the point where we can compare interesting results. Here we are compressing data that is the most commonly distributed over the Internet.


Lzop and Gzip perform fast and have a good ratio. Bzip2 has a better ratio, and both LZMA and LZMA2 algorithms even better. We can use an initial archive of 10MB, 100MB, or 400MB, the charts will always look alike the one above. When choosing a compression format it will either be good compression or speed, but it will definitely never ever be both, you must choose between this two constraints.

Conclusion

I never heard about the LZO format until I wanted to write this blog post. It looks like a good choice for end-devices where CPU cost is crucial. The compression will always be extremely fast, even for giga bytes of data, with a fairly good ratio. While Gzip is the most distributed compression format, it works just like Lzop, by focusing by default on speed with good compression. But it can't beat Lzop in speed, even when compressing in level 1 it will be fairly slower in matter of seconds, although it still beats it in the final size. When compressing with Lzop in level 9, the speed is getting ridiculously slow and the final size doesn't beat Gzip with its default level where Gzip is doing the job faster anyway.

Bzip2 is noise between LZMA and Gzip. It is very distributed as default format nowadays because it beats Gzip in term of compression ratio. It is of course slower for compression, but easily spottable is the decompression time, it is the worst amongst all in all cases.

Both LZMA and LZMA2 perform almost with an identical behavior. They are using dynamic memory allocation, unlike the other formats, where the higher the input data the more the memory is allocated. We can see the evolution of LZMA is using less memory but has on the other hand a higher cost on CPU time. And we can see they have excellent decompression time, although Lzop and Gzip have the best scores but then again there can't be excellent compression ratio and compression time. The difference between the compression ratio of the two formats is in the order of hundred of kilo bytes, well after all it is an evolution and not a revolution.

On a last note, I ran the benchmarks on an Intel Atom N270 that has two cores at 1.6GHz but I made sure to run the compression tools with only one core.

A few interesting links: