[NeXus-committee] Example of links

Brockhauser Sandor sandor.brockhauser at physik.hu-berlin.de
Fri Jan 31 16:58:54 GMT 2025


Dear all,

In fact, hdf5 has internally a graph data structure and not a tree
(where we may or may not set a separately marked, so called links in
between nodes). The tree view we generally see is just how this graph
is presented by most software, but each parent-child relationship (when
creating a subgroup or a dataset in a group) is actually just a hard
link, an edge on the graph, just like any other hard links we may set
at a later stage during the creation of the file. These links are
registered at the object being targeted, and approaching an object via
any of these hard links are basically the same from hdf5 perspective,
and I am not sure if h5py would be able to tell you if you are coming
from the direction of the "original" link or not. Although /g1/g12 and
/g2/g22 are actually the same physical objects in the example below,
the parent relationship to the same object is actually depends on where
you were coming from:
>>> f=h5py.File('htest.h5','w')
>>> g1 = f.create_group("g1")
>>> g12 = f['g1'].create_group("g12")
>>> g2 = f.create_group("g2")
>>> f['g2']['g22']=g12
>>> f['g2']['g22'].parent
<HDF5 group "/g2" (1 members)>
>>> f['g1']['g12'].parent
<HDF5 group "/g1" (1 members)>
>>> f['g2']['g22']==f['g1']['g12']
True
>>> f.close()
In fact, the created links are ordered according to their creation, so
one could work out some chronology. This is how h5dump does it:
HDF5 "htest.h5" {
GROUP "/" {
GROUP "g1" {
GROUP "g12" {
}
}
GROUP "g2" {
GROUP "g22" {
HARDLINK "/g1/g12"
}
}
}
}
But please note(!), this is not the "original" assignment, as shown
below by extending the test a bit:
>>> import h5py
>>> f=h5py.File('htest.h5','r+')
>>> g3 = f.create_group("g3")
>>> f['g3']['g32']=f['g2']['g22']
>>> f.close()
Here, one would naively expect to see that the hardlink actually point
to /g2/g22, but have a look on h5dump:
HDF5 "htest.h5" {
GROUP "/" {
GROUP "g1" {
GROUP "g12" {
}
}
GROUP "g2" {
GROUP "g22" {
HARDLINK "/g1/g12"
}
}
GROUP "g3" {
GROUP "g32" {
HARDLINK "/g1/g12"      <--!!!
}
}
}
}
In fact, it does not know anymore, if /g3/g32 was supposed to point to
/g1/g12 (e.g. nice_instrument/nice_detector) and not to /g2/g22 (e.g.
bad_instrument/bad_detector), because it does not point to a path(!),
but to the physical object. 

This is a big difference between hard links and soft links in hdf5! In
case of a soft link, the link is actually a path and it is resolved in
runtime. Just like linux symbolic links, these can be broken and can
point to different things if the targeted object is changed or
replaced. 
Additionally, the so called external links can even point you to a path
in a different file. Obviously, if you change the content of this file,
such links can easily point to a different physical object.

================
Up to now, it was all about hdf5. In NeXus, we do use these hdf5
features a lot, and even more, like virtual datasets (where a dataset
is virtually as a nexus Field, but its content is actually not a pure
binary block of bits, but a dataset created on the fly by the hdf5
library using multiple datasets being referenced separately e.g. via
external links - so we can concatenate, crop, slices, etc. on the fly).

The reason why we need a concept of a "target" attribute, so we can
register for any group or dataset this attribute is attached to that
this object was actually derived from here and there. Please note the
difference, that we do not assume that the data object here would be
the same as the referenced one (e.g. the one here may contain only the
relevant section what a monitor was measuring during the experiment, or
the one here is converted to a different uint compared to the
referenced one). This is a big difference compared to a simple hdf5
link (or even a soft link). We argue, that in some cases the community
using NeXus would like to know where the data was originated from.
Hence, additionally to the data (which is either a new dataset, a
hard/soft/external link, or even a virtual dataset which one it is just
an hdf5 implementation details when NeXus is used on top of hdf5) we
would like to allow attaching an attribute telling where it is coming
from.

Indeed, the documented linkType has a very similar purpose: with its
target attribute this can delivers the information where a given object
is coming from. Some problems with its documentation
(https://manual.nexusformat.org/nxdl_desc.html#linktype) which pushed
us for proposing something (indeed) similar:

- linkType says that it can be defined under definition, group, or
field, but the documentation of fieldType (contrary to the
documentation of definition and groupType) does not listed it as a
possibility to add.

- @napimount: doc says that it is a group attribute, but is not it a
linkType attribute? Note that the provided link for further explanation
(http://manual.nexusformat.org/_static/NeXusIntern.pdf) is not valid.

- @target: doc says that it is added only because of hdf5, but we
believe that its usefulness is independent of the backend if it is hdf5
or something else.

- in the example @target is added to /entry/data/polar_angle which
corresponds to the explanation, but it is also added to
/entry/instrument/detector/polar_angle. It is not explained why is it
needed there. It is because this is not derived from anywhere else? Why
is not it then simply a "." which convention is used throughout NeXus? 

- If these two datasets would actually be the same physical objects
(e.g. both occurrences would be hdf5 hard links to the same object),
this would explain this example, but as pointed out above, we foresee
other usecases, too.

- according to the documentation @target must be an absolute path
(although validTargetName suggests its future extension to relative
paths, even including 'parent' relationship although we have seen that
in case of hdf5 hard links in the tree, the interpretation of 'parent'
can become tricky)

- note that the example at validTargetName is not at all an absolute
path what is explained at linkType. Instead of an absolute (hdf5) path,
here a class_path is used, like
"NXentry/NXinstrument/analyser:NXcrystal/ef". This is not at all
pointing to a unique location in a NeXus file (e.g. if we have 2
entries with their respective instruments and analyser-s defined)
resulting in ambiguity when link targets are tried to be resolved.

- https://manual.nexusformat.org/datarules.html#index-3 explains the
use of NXdata. In explaining signal, it says that it shall point to a
Field (field or link) with such name. This either suggests 
   (1) that an NXdata group shall have the referenced dataset child
inside either as a fieldType or as a linkType implemented. Note that ;
or
   (2) NXdata needs a dataset inside as a child which is either a field
(aka hdf5 dataset) or a link (aka hdf5 link), but both are actually
fieldType from NeXus point of view. 
Interpretation (1) contradicts the NeXus documentation in several
places. E.g. in the NXdata defintions
(https://github.com/nexusformat/definitions/blob/main/base_classes/NXdata.nxdl.xml
), NXdata/DATA is actually defined as a fieldType: <field name="DATA"
type="NX_NUMBER" nameType="any">. The NXDL syntax
(https://github.com/nexusformat/definitions/blob/main/nxdl.xsd) handles
fieldType and linkType separately as not interchangable terms, but
those which can be used separately in definitions. Hence, NXDL supports
defining a field or a link. Note that linkType is rarely used in NeXus.
An example is NXxas
(https://github.com/nexusformat/definitions/blob/main/applications/NXxas.nxdl.xml
), where 'energy' is not a fieldType but a linkType:
<link name="energy"
target="/NXentry/NXinstrument/monochromator:NXmonochromator/energy"/>.
Based on this, "energy" cannot be referenced in NXdata under @signal as
an NXdata/DATA (which is a fieldType) or under @axes as an
NXdata/AXISNAME (which is also a FieldType).
Interpretation (2) tries to resolve the problem by simply saying that a
linkType is a kind of fieldType, but this is not at all made clear from
the NeXus documentation.

- how to implement and use a linkType object in hdf5 nexus file? It is
actually stated (https://manual.nexusformat.org/design.html#index-17)
that NeXus links are hdf5 hard links to objects having a @target
attribute inside. This statement alone makes linkType unusable in
practice since data collected in many facilities (e.g. EuXFEL) are in
multiple (huge) hdf5 files, so one cannot just create hard links
between them.

====================
Hence, a clarification in documentation would be nice:
- Groups can have Groups and Fields inside.  (In hdf5, they can be
either created directly as children /groups, datasets, or virtual
datasets/ or referenced via links /hard, soft, or external links/)
- @target (or @origin? or @reference?) attribute can be added to any
Group or Field to declare where the data is coming from.
   + If a new Group or Field is created here, its 'origin' attribute
can be set to the other object where its data is coming from.
   + If we use a link here, the origin attribute can be set in the
referenced object. 
   + Note that multiple linking looses the intermediate connections:
e.g. in case of a -> b -> c where c at origin=c results that resolving
'a at origin' will tell that its origin is 'c' and not the direct parent
'b'. This is not necessarily a problem, because the data is actually
coming from there.
   + In case the data would have been only referenced (and potentially
altered) in the chain somewhere and not linked, this will also be
resolved correctly. e.g. a (being different from b but having
a at origin=b) with b -> c (being different from d, but having c at origin=d)
and  d -> e where e at origin=e The full chain of dependency would be
readable from the attributes properly: a at origin: b; b at origin: d; [also
c at origin: d]; d at origin: e
- Application definition may require the presence of this attribute, so
it can find out where the data was coming from and what are the
corresponding data objects.

Thanks,
Sandor


On Fri, 2025-01-31 at 08:43 -0600, Raymond Osborn via NeXus-committee
wrote:
> It is possible to query a soft link in HDF5, but I don’t believe
> there is any way to query a hard link, without walking through the
> entire file checking for object IDs. And, of course, there is no way
> of telling which is the parent. 
> 
> Ray
> 
> > On Jan 30, 2025, at 3:50 PM, Aaron Brewster <asbrewster at lbl.gov>
> > wrote:
> > 
> > In h5py, I had thought you could query a group or field to see if
> > it's a soft link and get its original location.  I don't know how
> > to do the same for a hard link but I presume it's possible. 
> > Therefore the target attribute would appear to be redundant.
> > 
> > However, to me, the most important reason why to have @target is to
> > not be tied to HDF5.  It's useful to have it from a
> > specification point of view.
> > -Aaron
> > 
> > On Thu, Jan 30, 2025 at 1:28 PM Raymond Osborn via NeXus-committee
> > <nexus-committee at shadow.nd.rl.ac.uk> wrote:
> > > Hi Paul,
> > > Thanks for the follow-up questions. I will try to answer them
> > > below.
> > > 
> > > From: NeXus-committee
> > > <nexus-committee-bounces at shadow.nd.rl.ac.uk> on behalf of Paul
> > > Millar via NeXus-committee <nexus-committee at shadow.nd.rl.ac.uk>
> > > Date: Thursday, January 30, 2025 at 12:07 PM
> > > To: nexus-committee at nexusformat.org
> > > <nexus-committee at nexusformat.org>
> > > Subject: Re: [NeXus-committee] Example of links
> > > 
> > > > Hi Ray,
> > > >  
> > > > Thanks for sharing these examples, for talking about the
> > > > "target" attribute.
> > > >  
> > > > For me, this is very interesting.
> > > >  
> > > > I took the opportunity to read through the description of
> > > > groups and 
> > > > links in the HDF5 manual.  I've a background in storage and
> > > > filesystem 
> > > > programming, so the concepts in HDF5 make perfect sense to me:
> > > > it's 
> > > > (more or less) just the standard POSIX filesystem's namespace. 
> > > > HDF5 
> > > > even reuses some of the POSIX vocabulary.
> > > >  
> > > > What confuses me is the "target" attribute in NeXus.
> > > >  
> > > > As the NeXus Design page itself describes, hard links (i.e.,
> > > > the same 
> > > > object being linked to under multiple groups) are symmetric.
> > > > There is no 
> > > > sense of source and destination.  Instead, hard links are
> > > > simply being 
> > > > able to refer to the same object via two (or more) paths. 
> > > > Under HDF5, 
> > > > these paths are equivalent: neither path is more important.
> > > >  
> > > >  From what I see, the NeXus "target" attribute seeks to break
> > > > this 
> > > > symmetry.  The "target" attribute's value is the absolute path
> > > > of these 
> > > > paths.  This makes the "target" path a preferred way of
> > > > referring to the 
> > > > object.
> > > >  
> > > > What I'm missing is why having a preferred path is necessary in
> > > > NeXus.
> > > 
> > > 
> > > If the reason for using links is to save space (e.g., adding the
> > > same sample information to multiple entries), then it probably
> > > doesn’t matter which is the parent and which the child. The
> > > purpose of the link could also be to ensure that, e.g., the
> > > sample lattice parameter is updated in every entry when it is
> > > changed in one of them. Again, none of the objects is obviously
> > > the parent.
> > > 
> > > However, there are important structural reasons for adding links
> > > with one of the objects as the parent. The most common use of
> > > links is in the NXdata group, where the axes are stored
> > > elsewhere. Here’s a shortened version of chopper.nxs, for
> > > example. 
> > > 
> > > >>> print(chopper.tree)
> > > chopper:NXroot
> > >     entry:NXentry
> > >        data:NXdata
> > >            @axes = ['polar_angle', 'time_of_flight']
> > >            @signal = 'data'
> > >            data = int32(148x750)
> > >            polar_angle -> /entry/instrument/detector/polar_angle
> > >            time_of_flight ->
> > > /entry/instrument/detector/time_of_flight
> > >        instrument:NXinstrument
> > >            detector:NXdetector
> > >                distance = float32(148)
> > >                polar_angle = float32(148)
> > >                time_of_flight = float32(751)
> > >                type = 'He3 gas cylinder'
> > > 
> > > Here the main NXdata group plots the data against polar angle and
> > > time-of-flight, both of which are properties of the detector and
> > > so are stored in ‘entry/instrument/detector’. If someone plotting
> > > the data wants to know about other detector properties, such as
> > > the sample-to-detector distance, those are also in the NXdetector
> > > group and the target attribute shows the user where to look.
> > > There could be multiple NXdetector groups, but the link
> > > identifies the right one. So the target attribute provides
> > > important functionality. In a data reduction script that wants to
> > > convert from time-of-flight to energy transfer, it is essential
> > > they know in which group the relevant distance fields are stored.
> > > That is only possible by making the object in the NXdetector
> > > group the parent and using the ’target’ attribute to point to it.
> > > 
> > > Ironically, I think this functional purpose is what led the
> > > Fairmat group to propose the ’target’ attribute, so the original
> > > reasoning was sound, if now forgotten.
> > >  
> > > > The NeXus Design page is somewhat coy about saying why a
> > > > "target" 
> > > > attribute is needed.  There's some vague mention of people
> > > > getting 
> > > > confused when using a particular tool, but nothing concrete. 
> > > > If people 
> > > > are confused, isn't this rather a problem with that tool or
> > > > with how 
> > > > NeXus is organising data?
> > > 
> > > 
> > > The importance of links was crystal-clear to the original
> > > developers of NeXus twenty years ago for the reasons I described
> > > above. I hadn’t realized that this aspect of the standard was no
> > > longer understood. I guess we did a bad job of documenting it at
> > > the time.
> > > 
> > > > The page also includes some rather confusing use of
> > > > terminology. The 
> > > > page seemingly confuses "links" (all objects are accessible
> > > > through at 
> > > > least one link, if not they are garbage collected) with "hard
> > > > linking" 
> > > > (a common term for creating a new reference to some existing
> > > > objects).
> > > 
> > > 
> > > If documentation of NeXus links is intermingled with discussions
> > > of garbage collection, then it should be changed. 
> > > >  
> > > > The NeXus Design page also talks about the "original dataset" .
> > > > This is 
> > > > arguable wrong.  There is no "original dataset" since all hard
> > > > links 
> > > > refer to the same, single dataset. One might talk about the
> > > > "original 
> > > > path".  However, given two paths, what is it that makes one
> > > > path "original"?
> > > 
> > > 
> > > This may be clumsy wording, but I think the meaning in the above
> > > example is that ‘/entry/instrument/detector/time_of_flight’ is
> > > the “original dataset.” It is reproduced in the NXdata group to
> > > make plotting more convenient.
> > > >  
> > > > As a counter example using the "Linking in a NeXus file"
> > > > diagram from 
> > > > the NeXus Design page, with HDF5 semantics I could create the
> > > > dataset in 
> > > > one group (that happens to be NXdata) and then create a link to
> > > > that 
> > > > dataset under a different group (which happens to be 
> > > > NXinstrument/NXdetector). In temporal order, the "original
> > > > dataset" (or 
> > > > original path, if you prefer) would be under the NXdata group,
> > > > which 
> > > > isn't what is shown on the NeXus Design page and (I suspect)
> > > > not what is 
> > > > intended.
> > > 
> > > 
> > > The temporal order when writing the file is irrelevant. 
> > > 
> > > All your complaints about the documentation seem justified, so we
> > > should probably revise it, but the value of using the target
> > > attribute is still, I believe, valid.
> > > 
> > > I hope this helps.
> > > 
> > > With best regards,
> > > Ray
> > >  -- 
> > > Ray Osborn, Senior Scientist
> > > Materials Science Division
> > > Argonne National Laboratory
> > > Lemont, IL 60439, USA
> > > Phone: +1 (630) 252-9011
> > > Email: ROsborn at anl.gov
> > > 
> > > _______________________________________________
> > > NeXus-committee mailing list
> > > NeXus-committee at nexusformat.org
> > > https://lists.nexusformat.org/mailman/listinfo/nexus-committee
> 
> _______________________________________________
> NeXus-committee mailing list
> NeXus-committee at nexusformat.org
> https://lists.nexusformat.org/mailman/listinfo/nexus-committee

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.nexusformat.org/pipermail/nexus-committee/attachments/20250131/96b78349/attachment.htm>


More information about the NeXus-committee mailing list