[NeXus-committee] Example of links

Caswell, Thomas tcaswell at bnl.gov
Mon Feb 3 18:55:35 GMT 2025


A couple of quick thoughts:


  *
Handling hdf5 external links could presumable be done with an extra level of indirection (use @target to look at an internal dataset and have that be the actual external link)
  *
The internal links can be extended to handle external in a straight forward way by adding  a URI scheme (something like nexus_external:// ) to the spec with the understanding that if the scheme is missing it means THIS FILE [well, "this nexus tree" to be pedantic)  as in standard for URIs.   All existing files are still valid and new files grow the ability to target external files.
  *
I don't fully grasp why we care where the data "really" is.  The important thing is that if you start from the top of (one of) the defined trees you can walk down and find what you need with the names you expect.  It should not matter to a consumer if that is implemented as a combination of soft/hard/external links or the only copy of the data or via a database that fabricates the structure on demand.


Tom



Thomas A Caswell (he,him,his)
Data Acquisition and Detectors Deputy Group Leader

Data Science and Systems Integration Program

National Synchrotron Light Source II
Brookhaven National Laboratory
Office: 631.344.3146
tcaswell at bnl.gov



________________________________
From: NeXus-committee <nexus-committee-bounces at shadow.nd.rl.ac.uk> on behalf of Raymond Osborn via NeXus-committee <nexus-committee at shadow.nd.rl.ac.uk>
Sent: Monday, February 3, 2025 12:45
To: Brockhauser Sandor <sandor.brockhauser at physik.hu-berlin.de>
Cc: NeXus Committee <nexus-committee at nexusformat.org>; Lukas Pielsticker <lukas.pielsticker at physik.hu-berlin.de>
Subject: Re: [NeXus-committee] Example of links

Hi Sandor,
Thanks for taking the time to summarize your thinking. It is good to know that your reasoning is similar to numerous discussions we had 20 or more years ago, which was then incorporated by Mark Koennecke into the NAPI (RIP) and then by Paul Kienzle and I into the Python API. Now that NIAC doesn’t support an API, I think it does make the documentation of links much more important than it was when we provided the APIs, because it is very easy to make mistakes when writing directly using the HDF5 libraries.

Here are a few comments on your notes.

On Jan 31, 2025, at 10:58 AM, Brockhauser Sandor <sandor.brockhauser at physik.hu-berlin.de> wrote:

In fact, it does not know anymore, if /g3/g32 was supposed to point to /g1/g12 (e.g. nice_instrument/nice_detector) and not to /g2/g22 (e.g. bad_instrument/bad_detector), because it does not point to a path(!), but to the physical object.

This is a big difference between hard links and soft links in hdf5! In case of a soft link, the link is actually a path and it is resolved in runtime. Just like linux symbolic links, these can be broken and can point to different things if the targeted object is changed or replaced.

I agree that the HDF5 object pointed to by a soft link could in principle be replaced by


Additionally, the so called external links can even point you to a path in a different file. Obviously, if you change the content of this file, such links can easily point to a different physical object.

We have to have a completely separate discussion about external links. They are not the same as internal links. For example, you cannot add a ’target’ attribute to an external link because the physical object is in a different file with a completely different tree. If it was added to the external file object, then the path would apply to the external file, not the local file, and would, indeed, be meaningless in the local file, which has no way of knowing the tree structure of the external file.

The reason why we need a concept of a "target" attribute, so we can register for any group or dataset this attribute is attached to that this object was actually derived from here and there. Please note the difference, that we do not assume that the data object here would be the same as the referenced one (e.g. the one here may contain only the relevant section what a monitor was measuring during the experiment, or the one here is converted to a different uint compared to the referenced one). This is a big difference compared to a simple hdf5 link (or even a soft link). We argue, that in some cases the community using NeXus would like to know where the data was originated from.
Hence, additionally to the data (which is either a new dataset, a hard/soft/external link, or even a virtual dataset which one it is just an hdf5 implementation details when NeXus is used on top of hdf5) we would like to allow attaching an attribute telling where it is coming from.

Although there are reasons to criticize the documentation, this is precisely what is described in https://manual.nexusformat.org/design.html#links<https://urldefense.com/v3/__https://manual.nexusformat.org/design.html*links__;Iw!!P4SdNyxKAPE!BAHVOCezB8o6HzbWvmbr8u55pCfK_fXv1JlcV24bmMQ7xSuldijdXEoIHKS-uyo3qZB83wCC1glKthByH8i-3ncypIVdUIHKzxEH$>. The diagram explicitly shows a link from the two-theta axis in the NXdata group to the same array in the detector group. However, I think your description is clearer, because unfortunately, the text above talks about avoiding replication of the data between the two groups, which may have made people think the link was to save space. That has never been the main reason for needing links.

- @napimount: doc says that it is a group attribute, but is not it a linkType attribute? Note that the provided link for further explanation (http://manual.nexusformat.org/_static/NeXusIntern.pdf<https://urldefense.com/v3/__http://manual.nexusformat.org/_static/NeXusIntern.pdf__;!!P4SdNyxKAPE!BAHVOCezB8o6HzbWvmbr8u55pCfK_fXv1JlcV24bmMQ7xSuldijdXEoIHKS-uyo3qZB83wCC1glKthByH8i-3ncypIVdUMFYOL8o$>) is not valid.

The ’napimount’ attribute was a programmatic mechanism for the NAPI to handle external links. It is not part of the standard, and is not used by the Python API at all. You cannot explicitly add a napimount attribute within the local file for the same reason you can’t add a target attribute.

- @target: doc says that it is added only because of hdf5, but we believe that its usefulness is independent of the backend if it is hdf5 or something else.

The concept of links using the target attribute was introduced when the NeXus supported HDF4 and XML files. In fact, if we had only supported HDF5, we could have used soft links instead. So it was limitations particularly in XML that made the target attribute necessary. The documentation must have been written after dropping support for XML and was probably explaining why the attribute was necessary when using hard links.

- in the example @target is added to /entry/data/polar_angle which corresponds to the explanation, but it is also added to /entry/instrument/detector/polar_angle. It is not explained why is it needed there. It is because this is not derived from anywhere else? Why is not it then simply a "." which convention is used throughout NeXus?

They are the same object, so of course the target attribute appears in both groups.

In the Python API, when an object has a target attribute added, it checks if the target is the same as the object path. If it’s the same, the dataset is read in as a NXfield object. If it’s different, the dataset is read in as a NXlink object. The NXlink object can still be used as if it were a field (e.g., you can check its dtype or shape), but it is in fact a sub-class of both NXlink and NXfield. Structurally, this works well. The use is alerted to the fact that it is a link, and can recover the link target using the ’nxlink’ attribute, but it can be used in most contexts that a NXfield is used. However, changes directly to the NXlink are forbidden. If the user wants to change a link value, they need to explicitly change the parent’s value (even though they are the same object). Of course, this is the same behavior as soft links.

- If these two datasets would actually be the same physical objects (e.g. both occurrences would be hdf5 hard links to the same object), this would explain this example, but as pointed out above, we foresee other usecases, too.

I think this could be dangerous. If you make the NXlink object a physically different object to its target, then any assumption that the user might make about their equivalence could be invalid. Whether we use hard or soft links, I believe that the two objects have to be identical.

- according to the documentation @target must be an absolute path (although validTargetName suggests its future extension to relative paths, even including 'parent' relationship although we have seen that in case of hdf5 hard links in the tree, the interpretation of 'parent' can become tricky)

The interpretation of the parent may be tricky in HDF5, but it’s not in the NeXus standard, where it is explicitly defined by the target attribute.

- note that the example at validTargetName is not at all an absolute path what is explained at linkType. Instead of an absolute (hdf5) path, here a class_path is used, like "NXentry/NXinstrument/analyser:NXcrystal/ef". This is not at all pointing to a unique location in a NeXus file (e.g. if we have 2 entries with their respective instruments and analyser-s defined) resulting in ambiguity when link targets are tried to be resolved.

As you know, the use of classes in validTargetName rather than names is because NeXus allows different names, so a validator would only be able to check that the actual target, which has to use names, contains the right chain of classes.

- https://manual.nexusformat.org/datarules.html#index-3<https://urldefense.com/v3/__https://manual.nexusformat.org/datarules.html*index-3__;Iw!!P4SdNyxKAPE!BAHVOCezB8o6HzbWvmbr8u55pCfK_fXv1JlcV24bmMQ7xSuldijdXEoIHKS-uyo3qZB83wCC1glKthByH8i-3ncypIVdUERzC4du$> explains the use of NXdata. In explaining signal, it says that it shall point to a Field (field or link) with such name. This either suggests
   (1) that an NXdata group shall have the referenced dataset child inside either as a fieldType or as a linkType implemented. Note that ; or
   (2) NXdata needs a dataset inside as a child which is either a field (aka hdf5 dataset) or a link (aka hdf5 link), but both are actually fieldType from NeXus point of view.

Personally, I think it’s safer for the NXDL files not to specify whether a NXfield or NXgroup should be a link or not, i.e., NXDL should only refer to fields and groups, with the understanding that a user could choose to make some of them links at runtime. Those links would have to conform to what is described in the NeXus Design web page. Is this equivalent to your option 2?

- how to implement and use a linkType object in hdf5 nexus file? It is actually stated (https://manual.nexusformat.org/design.html#index-17<https://urldefense.com/v3/__https://manual.nexusformat.org/design.html*index-17__;Iw!!P4SdNyxKAPE!BAHVOCezB8o6HzbWvmbr8u55pCfK_fXv1JlcV24bmMQ7xSuldijdXEoIHKS-uyo3qZB83wCC1glKthByH8i-3ncypIVdUO9bm7hJ$>) that NeXus links are hdf5 hard links to objects having a @target attribute inside. This statement alone makes linkType unusable in practice since data collected in many facilities (e.g. EuXFEL) are in multiple (huge) hdf5 files, so one cannot just create hard links between them.

As I wrote above, external links cannot be handled the same way as internal links. Even if we think it’s a limitation, there is simply no practical way of treating them the same. I tried within the Python API and I couldn’t make it work.

I don’t actually think this is a problem because I am fairly sure that external links do not serve the same structural purpose as internal links in the NeXus standard. We too make extensive use of external links to point to raw data, which is really big (at least to me - 36GB) but which we never want to touch. But it’s only the data - never the axes - and it’s the axes that contains metadata that we might want to associate with other metadata, e.g., associating 'time-of-flight' with ‘distance’. These are small arrays so there is no reason not to keep them in the local file. If you are wanting to store all the metadata in an external file, then I think you will find that there is no way to make the ’target’ attribute work. Please let me know if I’m wrong.

With best regards,
Ray
--
Ray Osborn, Senior Scientist
Materials Science Division
Argonne National Laboratory
Lemont, IL 60439, USA
Phone: +1 (630) 252-9011
Email: ROsborn at anl.gov<mailto:ROsborn at anl.gov>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.nexusformat.org/pipermail/nexus-committee/attachments/20250203/f71e60c9/attachment.htm>


More information about the NeXus-committee mailing list