Release of NeXus v1.3.0 with data compression

Thu Dec 16 03:55:19 GMT 1999

First off--Thanks for implementing this feature so quickly!

I took some time to test the various data compression schemes on some typical
data we collect here.  The preliminary results were both very exciting and
disappointing.  I haven't been able to test an active system yet so I don't yet
know how dissappointing the results are.  Below is a description of what I found
if anyone cares to read it.

The first set of data I tested with is for a high speed (768x480 pixels/frame
image at >4Hz) acquisition system.  Due to the high speed, we can acquire
massive amounts of data in a few minutes.  This data not only needs to be stored
on the acquisition machine during acquisition, but needs to be moved over the
net to another machine for detailed analysis.  Data compression can help with
both these issues.

The data itself is mostly noise with a single round or donut shaped peak
somewhere in the image.  The noise is such that neighboring pixels are almost
always of a different value, but the range of values in the noise is very
small--4-5 counts at most.  The data is consistant enough that one would expect
nearly identical compression ratios for all images.

The second set of data is much harder to define.  Images can litterally be
anything.  The data may be very uniform over broad areas, or it may be a mess.
The data may also be ov very different image sizes.  The data rate is
considerably smaller than the first data set, but the volume for a complete set
is still very large (~1GB and expected to grow).  The data I selected was from a
recent run which used a 512x512 pixel/frame image with a pretty low level of
uniformity.

The two sets of data are from two different projects with differing data rates
but similar data volumes.  To perform the test, I wrote a simple program to read
in uncompressed files and write out compressed ones.  To test compression
ratios, I compressed a large (200) number of different files and compared file
sizes.  To test overhead, I picked a typical file and compressed it 100 times in
a row to different file names (to help eliminate caching effects).  I took as my
time the total time to compress all 100 files.

Here is a chart of my results:

Set1
Compression Type     |    time for 100 files   |     % origional file size
None                         |            4s                 |              1.00

LZW                         |            35s                |              0.18
HUF                          |            17s               |              0.19
RLE                          |             33s               |              0.50

Set2
Compression Type     |    time for 100 files   |     % origional file size
None                         |            6s                 |              1.00

LZW                         |            39s                |              0.51
HUF                          |            42s               |              0.52
RLE                          |             12s               |              1.01

It's interesting to note that in the seconds data set the RLE algorithm actuall
blew up the data rather than compressing it.  This help illustrate why one needs
to be careful which algorithm is used--it can help tremendously to know what the
data will look like in advance!

Looking at Set1, the HUF algorithm looks ideal.  It's fast and makes tiny files
for that type of data.  Just compressing the data, I could get slightly greater
the 5Hz.  This rate will of course go down in the actual acquisition program,
but I know that our data rate there is mostly disk bound.  In either event, I
think the benefits to network trafic and data storage will probably outweigh the
added overhead--even if it means loosing a few Hz in the data rate.

The other type of data we acquire is a more open question.  We don't want to
burden our users with knowing that there even is a compression scheme--let alone
which one will work best for them.  We'll need more experience to see how useful
this will be in that project.

Hope this is of some help to others.

Brian Tieman
tieman at aps.anl.gov

Ray Osborn wrote:

> Following popular demand, Mark Koennecke has updated the NeXus API to
> include data compression.  Users can choose to compress using either LZW
> (gzip), Skipping Huffman, or Run-Length Encoding algorithms.
>
> In the core API, compression is invoked by a call to NXcompress between
> calls to NXopendata and NXputdata or NXputslab.  Data is automatically
> compressed in NXUwritedata (currently only a part of the F90 Utility API) if
> there has been a call to NXUsetcompress, which defines the compression
> algorithm and the minimum size of data set to be compressed (it makes no
> sense to compress very small arrays).  See the web pages at
> <http://www.neutron.anl.gov/NeXus/> for more details.
>
> One of the beauties of using HDF to do the data compression is that it
> automatically decompresses the data when you read them back.  There is no
> need therefore to rewrite any data input routines.
>
> The scale of compression you get will depend on the type of data you have,
> but if the NeXus file is dominated by a few large data sets, you can get an
> idea of what you should achieve by gzipping the whole file.
>
> The current version of the API is v1.3.0.  Please send messages to this list
> if you have any useful experience (positive or negative) to report, and
> definitely report any bugs.
>
> Ray Osborn
>
> P.S. I have been unable to update all of the NeXus web pages on our NT
> server here at Argonne for a currently unknown reason.  We are trying to
> sort out the problem now, but in the meantime, use the European mirror until
> the home page shows today's date at the bottom.
>
> --
> Dr Ray Osborn                Tel: +1 (630) 252-9011
> Materials Science Division   Fax: +1 (630) 252-7777
> Argonne National Laboratory  E-mail: ROsborn at anl.gov
> Argonne, IL 60439-4845