[Nexus] NeXus - a solution to what is not the real problem ?

Tue Mar 9 16:51:46 GMT 2010

I'll chime in here with my perspective.  I've been using Nexus for 10+ 
years.  I have written acquisition and processing software that has 
generated literally 100s of TB of Nexus files.

It is my opinion that a distinction needs to be made between the Nexus 
API and the Nexus Schema.  The API is a tool to create files.  The 
Schema defines what's in the files and is what is important for 
consistency of data storage and the *possibility* of developing smart 
applications that know how to find data nuggets of interest.  The 
community seems to focus too much on the API and not enough on the 
Schema.  I *do not* need the API to write Nexus Schema compliant files 
and, in fact, the first thing I did when starting to use Nexus was wrap 
the API into something I could better deal with.  I've since thought 
many times of ripping out the Nexus API and writing HDF or XML directly.

For the past 10 years I have been the only developer in my community 
developing tools that generate and consume Nexus files.  The biggest 
adopter of Nexus has been Tomography where I am the primary developer 
responsible for 90% of the applications in the typical work flow.  Nexus 
has worked very well here *for me--the developer*, but less well for the 
scientists.

As a developer, having a standard data format that all processes int he 
work flow can read and write has allowed for a lot of easy integration.  
We carry one meta-data file around with the raw data and use it to store 
a process history of the data as well as to store data for future 
applications.  The acquisition control system creates the meta-data file 
and populates it with lots of interesting information about the sample.  
When it's time to acquire, the meta-data file is read and information 
within used to configure and control acquisition.  The acquisition 
process writes back real world information into the meta-data file for 
the next application to read and use, etc...Using this system, we have 
an automated tomography beamline capable of handling ~100 samples per 
day with little to no human interaction.  Using a defined Schema for our 
nexus files allows this to work and demonstrates the power of *any* file 
format that imposes a data dictionary on how and where meta-data is 
stored.  It's not specific to Nexus.

Now, the problems with the system--which are near and dear to the 
everyday lives of the scientists.

1)  No tools to edit Nexus files.  Downstream applications consume the 
Nexus meta-data file to read in parameters to be used for processing.  
How does the user change those parameters?  Simple answer:  can't!  To 
my knowledge there is no generic Nexus editor that lets me open a Nexus 
file and modify the contents of a group or field.  And yes, I've tried 
XML as my data format and ran into too many issues.  The final deal 
breaker was the inability of the Nexus API to read back an XML format 
file that the Nexus API itself generated!  To be fair, this was a year 
or so ago and I haven't tried again since--but I wasted a lot of effort 
trying to switch to XML back then, why should I give it a try again?  
I've already had to implement another solution.

2)  Scientists do not want to learn an API.  The most basic computer 
programming class teaches the equivelant of fread/fwrite.  This is what 
scientists know how to do.  For example, I have one user of Nexus files 
who was clever enough to probe the raw file to sort out where the data 
resides.  He uses fread to open an HDF file and pull out the data!  When 
someone adds a new group or field and the data moves, he knows how to 
open the raw file and look for where the raw data got moved to.  This is 
standard operation for this scientist!  As a developer, it's easy to 
stand on my podium and proclaim he's "doing it wrong" and circumventing 
the advantages of such a well thought out system as Nexus.  But, the 
fact is that Nexus has been a common buzz word here at the APS since the 
day I started.  *Every* tool I have developed can read and write Nexus 
files and I have personally spent numerous hours lobbying scientists to 
adopt the format.  The fact is scientists I routinely deal with continue 
to work around learning Nexus and continue to actively revolt against 
its proliferation--despite my efforts to convince them otherwise!  This 
in and of itself points to a problem with Nexus either in implementation 
or philosophy.

3) Where are the tools?  Since our scientists don't buy into Nexus, no 
tools--aside from the ones I myself write--are being developed to 
leverage Nexus.  I've pointed people to the website and the API.  I've 
gone so far as to offer my C++ wrapper code to users to provide a 
relatively trivial way to interface to Nexus files but they refuse.  
I've even spent my own time writing importers for certain applications 
and, if given the opportunity, the scientists invariable pick raw binary 
over Nexus.  In the end, the most used Nexus tools are the ones that 
convert data out of Nexus format.  The Tomography PI went so far as to 
write his own tool--using raw HDF calls no less--to convert data out of 
Nexus into binary so he can give usable data to his end-users who have 
no way of handling Nexus files.

I like the HDF format.  I've been waiting a long time for a usable Nexus 
data dictionary to standardize the meta-data schema.  I've spent many 
many development hours making sure my applications support Nexus 
wherever it makes sense to do so.  I think of myself as a supported of 
Nexus.  And yet I've spent the last three months backing away from Nexus 
for the one group that has been writing Nexus files for years because 
they've finally given up on the ability of the end-user to deal with the 
Nexus format.

In the end, people need to chose to collaborate for collaborations to be 
successful and what I see with the people I support is a group refusing 
to collaborate with Nexus.

Brian

On 3/9/2010 8:39 AM, Pete Jemian wrote:
>
> Joachim:
>
> You've almost got hold of the point, just a couple more steps (IMHO).  
> Careful retention of the raw data is desired by so many scientists 
> (experimentalists) that they become uncomfortable if that information 
> is not retained.  That must have been an early driving force for 
> NeXus.  Practical experience shows that the real common denominator 
> for data analysis is data which has been reduced to some common form 
> (common, as decided by the science underlying that data analysis).  So 
> where you introduce yet another data format (with good aims for sure), 
> does it progress towards the goal that the new format will be adopted 
> by more than one facility?  Be careful there.  No such thing as 
> temporary software.  Recently, there was a workshop at the ESRF to 
> discuss the suitability of HDF as a common underlying file format for 
> multispectral data.  At this workshop, some raised points similar to 
> yours.  I'm sure you could get a copy of the workshop summary from V. 
> Armando Solé <sole at esrf.fr> (it does not appear to be easily found by 
> Google today).
>
>     http://www.esrf.eu/events/conferences/hdf5/workshop-agenda
>
> Recently, NeXus has begun to broaden its view of data from raw data to 
> the description of processed data such as the reduced data for a 
> specific technique.  One of the barriers has been documenting what 
> should be in such files.  The NIAC is just about ready to introduce 
> the NeXus Definition Language (that has been engineered by instrument 
> responsibles) to document what should be needed for a specific 
> technique such as powder diffraction or SAS.  Yours truly has been 
> working on the manual to help those new to NeXus learn how to use this 
> resource.  It's not a requirement to use a NXDL specification when 
> writing data but it can help to codify what some analysis program or 
> scientific technique might require for processing.  Here's links to 
> the draft manual in PDF and HTML forms:
>
>     http://download.nexusformat.org/doc/NeXusManual.pdf
>     http://download.nexusformat.org/doc/NeXusManual.html
>
> Another response by the NIAC to the community expressed desire for 
> human readable data is XML as an alternative (to HDF) for the 
> underlying file format.  The current NeXus API now has support for 
> writing and reading a "NeXus" file in XML.  For some, this is great 
> news since data sets such as 1-D SAS are easily expressed by a few 
> columns of numbers and rarely go beyond a few hundred, let alone a few 
> thousand rows.  Other techniques, such as 2-D SAS or even to an 
> extreme, tomography and protein crystallography, cannot suffer the 
> performance penalties of being written in a TEXT (ASCII or utf-8) file 
> such as XML.  For these, HDF is the common best choice of many.
>
> So, my summary is thus:
>  * The NIAC has been listening and is trying to meet the community needs.
>  * I believe you are describing the need to communicate not raw data but
>     processed data as input for common analysis routines.
>  * Other techniques than yours have also expressed this need.
>  * NeXus is capable of handling these needs.
>  * NeXus is a 100% volunteer effort and is always looking for more 
> helpers.
>
> I welcome your input here.
>
> Regards,
>    Pete
>
>
>
> On 3/9/10 7:37 AM, Wuttke, Joachim wrote:
>> Dear colleagues,
>>
>> I am currently preparing a deliberately provocative memo with
>> working title »Why don't we have better data processing software
>> for quasielastic neutron scattering ?«. One section in this paper
>> will deal with data storage, and in its present form, it is quite an
>> attack on NeXus. To play fair, I post it here, looking forward for
>> your comments. Maybe you will convince me that I am mistaken.
>>
>> Looking forward to a sound discussion - Joachim
>>
>>
>> Though all raw data produced by QENS instruments have basically the
>> same structure, many different storage formats are in use.
>> Therefore, porting data processing software from one instrument
>> to another is generally not possible without
>> adapting at least a read-in routine or providing a raw-data 
>> conversion tool.
>> This is a severe nuisance for users,
>> and an obstacle for code sharing and collaborative software development.
>> For these reasons,
>> it is a popular idea that efforts to improve the software environment
>> should start with the adoption of a \textsl{common raw data format} ---
>> I shall call this strategy \textsl{data format first}.
>>
>> The common raw data format of our time will be NeXus, if any.
>> Under development since more than 15 years,
>> NeXus~\cite{qda3} addresses neutron as well as X-ray scattering.
>> It enjoys strong political backing,
>> as evidenced by an International Advisory Committee
>> with delegates from all major facilities.
>> A growing number of new spectrometers actually use NeXus,
>> be it by choice or forced by site policy;
>> on the other hand, so far only few existing instruments have migrated.
>>
>> When writing the instrument software for SPHERES,
>> I consciously opted against NeXus,
>> in favor of a less rigid self-defined format
>> that is easier to read by a human,
>> thereby facilitating the debugging of data acquisition and
>> raw data processing software.
>> Maybe, my wishes could have been accomodated within NeXus,
>> had I communicated more intensely with the project team.
>> However, I have more fundamental objections ---
>> not against NeXus itself,
>> but against unrealistic promises,
>> against overestimating data formats,
>> against the flawed strategy \textsl{data format first}.
>>
>> Unifying data formats reminds me of church history:
>> attempts to (re)unify $n$ different denominations regularly
>> result in $n+1$ denominations being around:
>> the new, unified church, plus all the groups that split off
>> to preserve the good old faith of their own.
>> When migrating an existing spectrometer towards NeXus,
>> the instrument scientist needs either to support for long time
>> read-in routines for both the old and the new data format,
>> or to provide routines that achieve lossless conversion from the old
>> into the new format.
>> Choosing NeXus as raw data format is not sufficient to guarantee
>> that data from different instruments can be read by the same software.
>> For instance, at SPHERES,
>> energy calibration is done at acquisition time,
>> and energy transfers $\hbar\omega$ are part of the raw data set.
>> At the ILL backscattering spectrometers,
>> only a few hardware parameters are stored from which
>> the downstream software must construct the energy scale.
>> Translating the current output format into something looking like NeXus
>> would not make the raw data files mutually legible.
>> Unifying raw data formats is not possible without unifying
>> data acquisition programs ---
>> which will be rarely feasible
>> because in most cases the hardware is too different.
>>
>> Some time ago,
>> NeXus may have been attractive for developers
>> because its rich application programming interface (API)
>> relieved them from implementing write-out and read-in routines.
>> However, this advantage has vanished because
>> modern generic data formats like YAML \cite{qda5}
>> allow to store and retrieve
>> complex data, composed of scalars, hashes, arrays
>> in arbitrary tree-like structures,
>> at zero cost through a much simpler API.
>>
>> Most fundamentally,
>> I think that efforts to unify the raw data format
>> are adressing the wrong interface:
>> most users do not want to see raw data at all.
>> What users want is a calibrated, normalized, reasonably binned
>> scattering law $S(q,\omega)$.
>> What should be standardized is the procedure to obtain such 
>> $S(q,\omega)$.
>> While most of this procedure can be implemented in quite a generic way,
>> it will remain the instrument scientist's resposibility
>> to plug in a low-level routine that reads in and calibrates the
>> raw data from his instrument.
>> Only he has the technical knowledge required to do it correctly,
>> and hardly anybody else needs to care about the raw data and their 
>> format.
>>
>> ------------------------------------------------------------------------------------------------ 
>>
>> ------------------------------------------------------------------------------------------------ 
>>
>> Forschungszentrum Juelich GmbH
>> 52425 Juelich
>> Sitz der Gesellschaft: Juelich
>> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
>> Vorsitzende des Aufsichtsrats: MinDir'in Baerbel Brumme-Bothe
>> Geschaeftsfuehrung: Prof. Dr. Achim Bachem (Vorsitzender),
>> Dr. Ulrich Krafft (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
>> Prof. Dr. Sebastian M. Schmidt
>> ------------------------------------------------------------------------------------------------ 
>>
>> ------------------------------------------------------------------------------------------------ 
>>
>> _______________________________________________
>> NeXus mailing list
>> NeXus at nexusformat.org
>> http://lists.nexusformat.org/mailman/listinfo/nexus
>