[NeXus-committee] Example of links

Chang, Peter (DLSLtd,RAL,LSCI) Peter.Chang at Diamond.ac.uk
Mon Feb 3 15:54:48 GMT 2025



Yes, you can add any attribute to a VDS.

My guess is that the linkType was historically added as a field with a target attribute so in NAPI they were in
HDF4:
https://github.com/nexusformat/code/blob/5b803b3a0014bd9759b3d846da3cd3c1cfafd7d5/src/napi4.c#L1323
and, in xml:
https://github.com/nexusformat/code/blob/5b803b3a0014bd9759b3d846da3cd3c1cfafd7d5/src/nxxml.c#L1891

Regards,
Peter




From: NeXus-committee <nexus-committee-bounces at shadow.nd.rl.ac.uk> On Behalf Of Watts Benjamin via NeXus-committee
Sent: 03 February 2025 11:00
To: Raymond Osborn <rayosborn at mac.com>; Aaron Brewster <asbrewster at lbl.gov>; Lukas Pielsticker <lukas.pielsticker at physik.hu-berlin.de>; Brockhauser Sandor <sandor.brockhauser at physik.hu-berlin.de>
Cc: NeXus Committee <nexus-committee at nexusformat.org>
Subject: Re: [NeXus-committee] Example of links

Hi Everyone,
   Thank you, Sandor, for that very informative wall of text. I like this revisiting of links in NeXus and agree with everything that Sandor says. However, I would also like to make sure that we are considering other cases, such as:

  1.  Is the @target attribute usable with an HDF5 virtual dataset? This would have multiple data sources and so would perhaps require a list of paths in the value of the attribute?

  1.  The NeXus data format is supposed to be container format agnostic and so we should make sure that we clearly document how linking and @target operate in less featureful container formats such as XML.
Cheers,
Ben



________________________________
From: NeXus-committee <nexus-committee-bounces at shadow.nd.rl.ac.uk<mailto:nexus-committee-bounces at shadow.nd.rl.ac.uk>> on behalf of Brockhauser Sandor via NeXus-committee <nexus-committee at shadow.nd.rl.ac.uk<mailto:nexus-committee at shadow.nd.rl.ac.uk>>
Sent: Friday, January 31, 2025 17:58
To: Raymond Osborn <rayosborn at mac.com<mailto:rayosborn at mac.com>>; Aaron Brewster <asbrewster at lbl.gov<mailto:asbrewster at lbl.gov>>; Lukas Pielsticker <lukas.pielsticker at physik.hu-berlin.de<mailto:lukas.pielsticker at physik.hu-berlin.de>>
Cc: NeXus Committee <nexus-committee at nexusformat.org<mailto:nexus-committee at nexusformat.org>>
Subject: Re: [NeXus-committee] Example of links

Dear all,

In fact, hdf5 has internally a graph data structure and not a tree (where we may or may not set a separately marked, so called links in between nodes). The tree view we generally see is just how this graph is presented by most software, but each parent-child relationship (when creating a subgroup or a dataset in a group) is actually just a hard link, an edge on the graph, just like any other hard links we may set at a later stage during the creation of the file. These links are registered at the object being targeted, and approaching an object via any of these hard links are basically the same from hdf5 perspective, and I am not sure if h5py would be able to tell you if you are coming from the direction of the "original" link or not. Although /g1/g12 and /g2/g22 are actually the same physical objects in the example below, the parent relationship to the same object is actually depends on where you were coming from:
>>> f=h5py.File('htest.h5','w')
>>> g1 = f.create_group("g1")
>>> g12 = f['g1'].create_group("g12")
>>> g2 = f.create_group("g2")
>>> f['g2']['g22']=g12
>>> f['g2']['g22'].parent
<HDF5 group "/g2" (1 members)>
>>> f['g1']['g12'].parent
<HDF5 group "/g1" (1 members)>
>>> f['g2']['g22']==f['g1']['g12']
True
>>> f.close()
In fact, the created links are ordered according to their creation, so one could work out some chronology. This is how h5dump does it:
HDF5 "htest.h5" {
GROUP "/" {
GROUP "g1" {
GROUP "g12" {
}
}
GROUP "g2" {
GROUP "g22" {
HARDLINK "/g1/g12"
}
}
}
}
But please note(!), this is not the "original" assignment, as shown below by extending the test a bit:
>>> import h5py
>>> f=h5py.File('htest.h5','r+')
>>> g3 = f.create_group("g3")
>>> f['g3']['g32']=f['g2']['g22']
>>> f.close()
Here, one would naively expect to see that the hardlink actually point to /g2/g22, but have a look on h5dump:
HDF5 "htest.h5" {
GROUP "/" {
GROUP "g1" {
GROUP "g12" {
}
}
GROUP "g2" {
GROUP "g22" {
HARDLINK "/g1/g12"
}
}
GROUP "g3" {
GROUP "g32" {
HARDLINK "/g1/g12"      <--!!!
}
}
}
}
In fact, it does not know anymore, if /g3/g32 was supposed to point to /g1/g12 (e.g. nice_instrument/nice_detector) and not to /g2/g22 (e.g. bad_instrument/bad_detector), because it does not point to a path(!), but to the physical object.

This is a big difference between hard links and soft links in hdf5! In case of a soft link, the link is actually a path and it is resolved in runtime. Just like linux symbolic links, these can be broken and can point to different things if the targeted object is changed or replaced.
Additionally, the so called external links can even point you to a path in a different file. Obviously, if you change the content of this file, such links can easily point to a different physical object.

================
Up to now, it was all about hdf5. In NeXus, we do use these hdf5 features a lot, and even more, like virtual datasets (where a dataset is virtually as a nexus Field, but its content is actually not a pure binary block of bits, but a dataset created on the fly by the hdf5 library using multiple datasets being referenced separately e.g. via external links - so we can concatenate, crop, slices, etc. on the fly).

The reason why we need a concept of a "target" attribute, so we can register for any group or dataset this attribute is attached to that this object was actually derived from here and there. Please note the difference, that we do not assume that the data object here would be the same as the referenced one (e.g. the one here may contain only the relevant section what a monitor was measuring during the experiment, or the one here is converted to a different uint compared to the referenced one). This is a big difference compared to a simple hdf5 link (or even a soft link). We argue, that in some cases the community using NeXus would like to know where the data was originated from.
Hence, additionally to the data (which is either a new dataset, a hard/soft/external link, or even a virtual dataset which one it is just an hdf5 implementation details when NeXus is used on top of hdf5) we would like to allow attaching an attribute telling where it is coming from.

Indeed, the documented linkType has a very similar purpose: with its target attribute this can delivers the information where a given object is coming from. Some problems with its documentation (https://manual.nexusformat.org/nxdl_desc.html#linktype) which pushed us for proposing something (indeed) similar:

- linkType says that it can be defined under definition, group, or field, but the documentation of fieldType (contrary to the documentation of definition and groupType) does not listed it as a possibility to add.

- @napimount: doc says that it is a group attribute, but is not it a linkType attribute? Note that the provided link for further explanation (http://manual.nexusformat.org/_static/NeXusIntern.pdf) is not valid.

- @target: doc says that it is added only because of hdf5, but we believe that its usefulness is independent of the backend if it is hdf5 or something else.

- in the example @target is added to /entry/data/polar_angle which corresponds to the explanation, but it is also added to /entry/instrument/detector/polar_angle. It is not explained why is it needed there. It is because this is not derived from anywhere else? Why is not it then simply a "." which convention is used throughout NeXus?

- If these two datasets would actually be the same physical objects (e.g. both occurrences would be hdf5 hard links to the same object), this would explain this example, but as pointed out above, we foresee other usecases, too.

- according to the documentation @target must be an absolute path (although validTargetName suggests its future extension to relative paths, even including 'parent' relationship although we have seen that in case of hdf5 hard links in the tree, the interpretation of 'parent' can become tricky)

- note that the example at validTargetName is not at all an absolute path what is explained at linkType. Instead of an absolute (hdf5) path, here a class_path is used, like "NXentry/NXinstrument/analyser:NXcrystal/ef". This is not at all pointing to a unique location in a NeXus file (e.g. if we have 2 entries with their respective instruments and analyser-s defined) resulting in ambiguity when link targets are tried to be resolved.

- https://manual.nexusformat.org/datarules.html#index-3 explains the use of NXdata. In explaining signal, it says that it shall point to a Field (field or link) with such name. This either suggests
   (1) that an NXdata group shall have the referenced dataset child inside either as a fieldType or as a linkType implemented. Note that ; or
   (2) NXdata needs a dataset inside as a child which is either a field (aka hdf5 dataset) or a link (aka hdf5 link), but both are actually fieldType from NeXus point of view.
Interpretation (1) contradicts the NeXus documentation in several places. E.g. in the NXdata defintions (https://github.com/nexusformat/definitions/blob/main/base_classes/NXdata.nxdl.xml), NXdata/DATA is actually defined as a fieldType: <field name="DATA" type="NX_NUMBER" nameType="any">. The NXDL syntax (https://github.com/nexusformat/definitions/blob/main/nxdl.xsd) handles fieldType and linkType separately as not interchangable terms, but those which can be used separately in definitions. Hence, NXDL supports defining a field or a link. Note that linkType is rarely used in NeXus. An example is NXxas (https://github.com/nexusformat/definitions/blob/main/applications/NXxas.nxdl.xml), where 'energy' is not a fieldType but a linkType: <link name="energy" target="/NXentry/NXinstrument/monochromator:NXmonochromator/energy"/>. Based on this, "energy" cannot be referenced in NXdata under @signal as an NXdata/DATA (which is a fieldType) or under @axes as an NXdata/AXISNAME (which is also a FieldType).
Interpretation (2) tries to resolve the problem by simply saying that a linkType is a kind of fieldType, but this is not at all made clear from the NeXus documentation.

- how to implement and use a linkType object in hdf5 nexus file? It is actually stated (https://manual.nexusformat.org/design.html#index-17) that NeXus links are hdf5 hard links to objects having a @target attribute inside. This statement alone makes linkType unusable in practice since data collected in many facilities (e.g. EuXFEL) are in multiple (huge) hdf5 files, so one cannot just create hard links between them.

====================
Hence, a clarification in documentation would be nice:
- Groups can have Groups and Fields inside.  (In hdf5, they can be either created directly as children /groups, datasets, or virtual datasets/ or referenced via links /hard, soft, or external links/)
- @target (or @origin? or @reference?) attribute can be added to any Group or Field to declare where the data is coming from.
   + If a new Group or Field is created here, its 'origin' attribute can be set to the other object where its data is coming from.
   + If we use a link here, the origin attribute can be set in the referenced object.
   + Note that multiple linking looses the intermediate connections: e.g. in case of a -> b -> c where c at origin<mailto:c at origin>=c results that resolving 'a at origin<mailto:'a at origin>' will tell that its origin is 'c' and not the direct parent 'b'. This is not necessarily a problem, because the data is actually coming from there.
   + In case the data would have been only referenced (and potentially altered) in the chain somewhere and not linked, this will also be resolved correctly. e.g. a (being different from b but having a at origin<mailto:a at origin>=b) with b -> c (being different from d, but having c at origin<mailto:c at origin>=d) and  d -> e where e at origin<mailto:e at origin>=e The full chain of dependency would be readable from the attributes properly: a at origin: b;<mailto:a at origin> b at origin<mailto:b at origin>: d; [also c at origin<mailto:c at origin>: d]; d at origin<mailto:d at origin>: e
- Application definition may require the presence of this attribute, so it can find out where the data was coming from and what are the corresponding data objects.

Thanks,
Sandor


On Fri, 2025-01-31 at 08:43 -0600, Raymond Osborn via NeXus-committee wrote:
It is possible to query a soft link in HDF5, but I don’t believe there is any way to query a hard link, without walking through the entire file checking for object IDs. And, of course, there is no way of telling which is the parent.

Ray


On Jan 30, 2025, at 3:50 PM, Aaron Brewster <asbrewster at lbl.gov<mailto:asbrewster at lbl.gov>> wrote:

In h5py, I had thought you could query a group or field to see if it's a soft link and get its original location.  I don't know how to do the same for a hard link but I presume it's possible.  Therefore the target attribute would appear to be redundant.

However, to me, the most important reason why to have @target is to not be tied to HDF5.  It's useful to have it from a specification point of view.
-Aaron

On Thu, Jan 30, 2025 at 1:28 PM Raymond Osborn via NeXus-committee <nexus-committee at shadow.nd.rl.ac.uk<mailto:nexus-committee at shadow.nd.rl.ac.uk>> wrote:
Hi Paul,
Thanks for the follow-up questions. I will try to answer them below.

From: NeXus-committee <nexus-committee-bounces at shadow.nd.rl.ac.uk<mailto:nexus-committee-bounces at shadow.nd.rl.ac.uk>> on behalf of Paul Millar via NeXus-committee <nexus-committee at shadow.nd.rl.ac.uk<mailto:nexus-committee at shadow.nd.rl.ac.uk>>
Date: Thursday, January 30, 2025 at 12:07 PM
To: nexus-committee at nexusformat.org<mailto:nexus-committee at nexusformat.org> <nexus-committee at nexusformat.org<mailto:nexus-committee at nexusformat.org>>
Subject: Re: [NeXus-committee] Example of links

Hi Ray,

Thanks for sharing these examples, for talking about the "target" attribute.

For me, this is very interesting.

I took the opportunity to read through the description of groups and
links in the HDF5 manual.  I've a background in storage and filesystem
programming, so the concepts in HDF5 make perfect sense to me: it's
(more or less) just the standard POSIX filesystem's namespace.  HDF5
even reuses some of the POSIX vocabulary.

What confuses me is the "target" attribute in NeXus.

As the NeXus Design page itself describes, hard links (i.e., the same
object being linked to under multiple groups) are symmetric. There is no
sense of source and destination.  Instead, hard links are simply being
able to refer to the same object via two (or more) paths.  Under HDF5,
these paths are equivalent: neither path is more important.

 From what I see, the NeXus "target" attribute seeks to break this
symmetry.  The "target" attribute's value is the absolute path of these
paths.  This makes the "target" path a preferred way of referring to the
object.

What I'm missing is why having a preferred path is necessary in NeXus.

If the reason for using links is to save space (e.g., adding the same sample information to multiple entries), then it probably doesn’t matter which is the parent and which the child. The purpose of the link could also be to ensure that, e.g., the sample lattice parameter is updated in every entry when it is changed in one of them. Again, none of the objects is obviously the parent.

However, there are important structural reasons for adding links with one of the objects as the parent. The most common use of links is in the NXdata group, where the axes are stored elsewhere. Here’s a shortened version of chopper.nxs, for example.

>>> print(chopper.tree)
chopper:NXroot
    entry:NXentry
       data:NXdata
           @axes = ['polar_angle', 'time_of_flight']
           @signal = 'data'
           data = int32(148x750)
           polar_angle -> /entry/instrument/detector/polar_angle
           time_of_flight -> /entry/instrument/detector/time_of_flight
       instrument:NXinstrument
           detector:NXdetector
               distance = float32(148)
               polar_angle = float32(148)
               time_of_flight = float32(751)
               type = 'He3 gas cylinder'

Here the main NXdata group plots the data against polar angle and time-of-flight, both of which are properties of the detector and so are stored in ‘entry/instrument/detector’. If someone plotting the data wants to know about other detector properties, such as the sample-to-detector distance, those are also in the NXdetector group and the target attribute shows the user where to look. There could be multiple NXdetector groups, but the link identifies the right one. So the target attribute provides important functionality. In a data reduction script that wants to convert from time-of-flight to energy transfer, it is essential they know in which group the relevant distance fields are stored. That is only possible by making the object in the NXdetector group the parent and using the ’target’ attribute to point to it.

Ironically, I think this functional purpose is what led the Fairmat group to propose the ’target’ attribute, so the original reasoning was sound, if now forgotten.

The NeXus Design page is somewhat coy about saying why a "target"
attribute is needed.  There's some vague mention of people getting
confused when using a particular tool, but nothing concrete.  If people
are confused, isn't this rather a problem with that tool or with how
NeXus is organising data?

The importance of links was crystal-clear to the original developers of NeXus twenty years ago for the reasons I described above. I hadn’t realized that this aspect of the standard was no longer understood. I guess we did a bad job of documenting it at the time.

The page also includes some rather confusing use of terminology. The
page seemingly confuses "links" (all objects are accessible through at
least one link, if not they are garbage collected) with "hard linking"
(a common term for creating a new reference to some existing objects).

If documentation of NeXus links is intermingled with discussions of garbage collection, then it should be changed.


The NeXus Design page also talks about the "original dataset" . This is
arguable wrong.  There is no "original dataset" since all hard links
refer to the same, single dataset. One might talk about the "original
path".  However, given two paths, what is it that makes one path "original"?

This may be clumsy wording, but I think the meaning in the above example is that ‘/entry/instrument/detector/time_of_flight’ is the “original dataset.” It is reproduced in the NXdata group to make plotting more convenient.

As a counter example using the "Linking in a NeXus file" diagram from
the NeXus Design page, with HDF5 semantics I could create the dataset in
one group (that happens to be NXdata) and then create a link to that
dataset under a different group (which happens to be
NXinstrument/NXdetector). In temporal order, the "original dataset" (or
original path, if you prefer) would be under the NXdata group, which
isn't what is shown on the NeXus Design page and (I suspect) not what is
intended.

The temporal order when writing the file is irrelevant.

All your complaints about the documentation seem justified, so we should probably revise it, but the value of using the target attribute is still, I believe, valid.

I hope this helps.

With best regards,
Ray
 --
Ray Osborn, Senior Scientist
Materials Science Division
Argonne National Laboratory
Lemont, IL 60439, USA
Phone: +1 (630) 252-9011
Email: ROsborn at anl.gov<mailto:ROsborn at anl.gov>

_______________________________________________
NeXus-committee mailing list
NeXus-committee at nexusformat.org<mailto:NeXus-committee at nexusformat.org>
https://lists.nexusformat.org/mailman/listinfo/nexus-committee


_______________________________________________

NeXus-committee mailing list

NeXus-committee at nexusformat.org<mailto:NeXus-committee at nexusformat.org>

https://lists.nexusformat.org/mailman/listinfo/nexus-committee

This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd.
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.nexusformat.org/pipermail/nexus-committee/attachments/20250203/eec8e510/attachment.htm>


More information about the NeXus-committee mailing list