[Nexus-developers] NXconvert-NXtranslate

Tue Nov 18 19:13:32 GMT 2003

I'm delighted by this debate, because I think a robust translation utility
will make NeXus much more attractive for everyone.  I'm already itching to
write a CCD translation file using one of these schemes.  I have read Mark
and Peter's two proposals, which seem to be converging.  However, I want to
give some reasons why I favour Peter's scheme, particularly because it
provides a coherent architecture for handling live data, something we lack
so far. 

Both schemes now use a single XML translation file, based on the NeXus
metaDTD format, to define how to copy the existing source data to a target
file in a new NeXus format.  However, Mark is suggesting that we make the
translation file by replacing the data of a NeXus XML file with scripting
commands while Peter is suggesting that we add attributes to the data tag
that points to missing data.  Although this may seem a fairly minor
difference, I think it profoundly affects the versatility and ease of use of
this translation process.

Let's take an example in both schemes (eliminating unnecessary attributes):

In Mark's scheme:

<script>
  source nxsupport.tcl
  set inputFile [nx_open [lindex $argv 4] $NXACC_READ]
</script>
<NXentry>
  <title>copyFromNexus $inputFile /entry/title</title>
  <name>writeText SEPD</name>
  <distance>
      writeFloat [expr [getFromNexus $inputFile \
              /entry/sample/distance] * -1]
  </distance>
  <counts type="NX_FLOAT" signal="1" units="">
    copyFromNexus $inputFile /$entryName/detector1/counts
  </counts>
</NXentry>

In Peter's scheme:

<NXentry source="file.dat" mime_type="NeXus">
  <title tag="/entry/title" />
  <name>SEPD</name>
  <distance tag="/entry/sample/distance" />
  <counts source="livedata" tag="raw_counts" mime_type="IPNSlive" />
</NXentry>

1) Ease of writing translation files

I believe that the general user will find it much easier to write Peter's
translation files.  Of course, someone needs to do the hard work of
providing the data reading libraries, and the various wrappers that will
interface those libraries to the translation utility.   The complexity will
be similar in both schemes; you have to write a set of C-wrappers, either to
interface the scripting language (Mark's) or the translation utility
(Peter's) with the source library.  However, it only needs to be done once
for each source library.

Now, if the user wants to produce their own NeXus file using these
libraries, we need to make it easy for them to customize the translation
file.  Mark's scheme requires that a user learn a particular scripting
language and all the scripts that are written in it, e.g. copyFromNeXus.
There would have to be a new set of scripts written for every type of source
file, with their own sets of arguments.

In Peter's scheme, that complexity is hidden from the user.  All they need
to know is the file name and a set of tags that point to the data, which, in
NeXus files, are the paths to the data items.  No programming is required -
just some documentation stating what the source data items are called by the
source library.

2) Versatility

In the example above, the "name" tag had the value SEPD.  Assume that this
was not in the source file but needs to be added to the target file.  This
is easy to do in Peter's scheme.  Missing data are just put in as they would
appear in the final XML version of the data file.  In Mark's scheme, we have
to use a script command (writeText SEPD) for every data item because every
tag would have to be parsed for a possible scripting command.   This has
another consequence that has implications for using a NeXus file to access
live data.

It is often the case that an instrument scientist would like to treat
archived data and live data with the same software.  Before a run starts, it
is often possible to construct the entire NeXus file, but with the data
itself missing.  Once the run starts, it would be nice to use that file for
all the meta-information, but have a way to access the live data as well.
Peter's scheme allows this.  In this scenario, the translation file is just
a regular NeXus file; it can even be a binary HDF file.  However, the data
tag, instead of containing data, contains attributes that point to the data
and the library used to read it.

This is not possible in Mark's scheme because every data item has to contain
a scripting command.  I don't think there is any way to parse a data item
and automatically tell whether it contains data or scripts, unless every
script is enclosed by some special delimiters.   That would make the files
look even more complex, and we would have to ensure that the delimiters
never appear in real data.

Incidentally, Peter's method also becomes a scheme for referencing external
data within a NeXus file.  I'm not a great fan of doing this, but I could
see that it might be nice to have a link to some publically accessible
document as part of the file's self-documentation.

e.g.

<NXinstrument name="lrmecs">
   <picture source="http://www.pns.anl.gov/lrmecs/lrmecs.jpg"
            mime_type="image/jpeg" />
...
</NXinstrument> 

Of course, we would have to put something in the API to handle data requests
for such external links gracefully, but that's something for the future.

Eventually, I think we should extend the API to have a thin GetData layer
that interfaces to external libraries, or returns sensible error messages if
the libraries are not available.  Open Genie has a generalized data acess
library; something similar could be adapted for NeXus.  Initially, it would
be part of the translation utility, but eventually, it could be part of the
standard API.

Anyone else have any thoughts?

Ray Osborn