[Nexus] [netcdfgroup] [Hdf-forum] Detecting netCDF versus HDF5 -- PROPOSED SOLUTIONS --REQUEST FOR COMMENTS

dmh at ucar.edu dmh at ucar.edu
Thu Apr 21 23:02:20 BST 2016


If you have hdf5 files that should be readable, then I will undertake to
look at them and see what the problem is.
WRT to old files:  We could produce a utility that would redef the file 
and insert the
      _NCProperties attribute. This would allow someone to wholesale
      mark old files.
=Dennis Heimbigner
   Unidata

On 4/21/2016 2:17 PM, Pedro Vicente wrote:
> Dennis
>
>>>>> I am in the process of adding a global attribute in the root group
>> that captures both the netcdf library version and the hdf5 library 
>> version
>> whenever a netcdf file is created. The current  form is
>> _NCProperties="version=...|netcdflibversion=...|hdflibversion=..."
>
>
> ok, good to know, thank you
>
>
>>>> > 1. I am open to suggestions about changing the format or adding 
>>>> info > to it.
>
>
> I personally don't care, anything that uniquely identifies a netCDF 
> file (HDF5 based) as such will work
>
>
>>>> 2. Of course this attribute will not exist in files written using 
>>>> older
>> versions of the netcdf library, but at least the process will have 
>> begun.
>
> yes
>
>
>> 3. This technically does not address the original issue because there 
>> exist
>>      hdf5 files  not written by netcdf that are still compatible with 
>> and can be
>>      read by netcdf. Not sure this case is important or not.
>
> there will always be HDF5 files  not written by netcdf that netCDF 
> will read as we are now.
>
> this is not really the issue, but you just made a further issue :-)
>
> the issue is that I would like an application that reads a netCDF 
> (HDF5 based) file to decide to use the netCDF or HDF5 API.
> your attribute writing will do , for future files.
> for older nertCDF files there may be  a way to detect the current 
> attributes and data structures to see if we can make it "identify itself"
> as netCDF. A bit of debugging will confirm that, since Dimension 
> Scales are used, that would be an (imperfect maybe) way to do it
>
> regarding the "further issue " above
>
> you could go one step further and for any HDF5 files  not written by 
> netcdf , you could make netCDF reject the file reading,
> because it's not "netCDF compliant".
> Since having netCDF read pure HDF5 files is not a problem (at least 
> for me), I don't know if you would want to do this, just an idea.
> In my mind taking complexity and ambiguities of problems is always a 
> good thing
>
>
> ah, I forgot one thing, related to this
>
>
> In the past I have found several pure HDF5 files that netCDF failed in 
> reading.
> Since netCDF is HDF5 binary compatible, one would expect that all HDF5 
> files will be read by netCDF.
> Except if you specifically wrote something in the code that makes it 
> to fail if some condition is not met,
> This was a while ago, I'll try to find those cases and I'll send a bug 
> report to the bug report email
>
> ----------------------
> Pedro Vicente
> pedro.vicente at space-research.org
> https://twitter.com/_pedro__vicente
> http://www.space-research.org/
>
> ----- Original Message ----- From: <dmh at ucar.edu>
> To: "Pedro Vicente" <pedro.vicente at space-research.org>; "HDF Users 
> Discussion List" <hdf-forum at lists.hdfgroup.org>; 
> <cf-metadata at cgd.ucar.edu>; "Discussion forum for the NeXus data 
> format" <nexus at nexusformat.org>; <netcdfgroup at unidata.ucar.edu>
> Cc: "John Shalf" <jshalf at lbl.gov>; <Richard.E.Ullman at nasa.gov>; 
> "Marinelli, Daniel J. (GSFC-5810)" <daniel.j.marinelli at nasa.gov>; 
> "Miller, Mark C." <miller86 at llnl.gov>
> Sent: Thursday, April 21, 2016 2:30 PM
> Subject: Re: [netcdfgroup] [Hdf-forum] Detecting netCDF versus HDF5 
> --  PROPOSED SOLUTIONS --REQUEST FOR COMMENTS
>
>
>> I am in the process of adding a global attribute in the root group
>> that captures both the netcdf library version and the hdf5 library 
>> version
>> whenever a netcdf file is created. The current  form is
>> _NCProperties="version=...|netcdflibversion=...|hdflibversion=..."
>> Where version is the version of the _NCProperties attribute and the 
>> others
>> are e.g. 1.8.18 or 4.4.1-rc1.
>> Issues:
>> 1. I am open to suggestions about changing the format or adding info 
>> to it.
>> 2. Of course this attribute will not exist in files written using 
>> older versions
>>     of the netcdf library, but at least the process will have begun.
>> 3. This technically does not address the original issue because there 
>> exist
>>      hdf5 files  not written by netcdf that are still compatible with 
>> and can be
>>      read by netcdf. Not sure this case is important or not.
>> =Dennis Heimbigner
>>    Unidata
>>
>>
>> On 4/21/2016 9:33 AM, Pedro Vicente wrote:
>>> DETECTING HDF5 VERSUS NETCDF GENERATED FILES
>>> REQUEST FOR COMMENTS
>>> AUTHOR: Pedro Vicente
>>>
>>> AUDIENCE:
>>> 1) HDF, netcdf developers,
>>> Ed Hartnett
>>> Kent Yang
>>> 2) HDF, netcdf users, that replied to this thread
>>> Miller, Mark C.
>>> John Shalf
>>> 3 ) netcdf tools developers
>>> Mary Haley  , NCL
>>> 4) HDF, netcdf managers and sponsors
>>> David Pearah  , CEO HDF Group
>>> Ward Fisher, UCAR
>>> Marinelli, Daniel J. , Richard Ullmman, Christopher Lynnes, NASA
>>> 5)
>>> [CF-metadata] list
>>> After this thread started 2 months ago, there was an annoucement on 
>>> the [CF-metadata] mail list
>>> about
>>> "a meeting to discuss current and future netCDF-CF efforts and 
>>> directions.
>>> The meeting will be held on 24-26 May 2016 in Boulder, CO, USA at 
>>> the UCAR Center Green facility."
>>> This would be a good topic to put on the agenda, maybe?
>>> THE PROBLEM:
>>> Currently it is impossible to detect if an HDF5 file was generated 
>>> by the HDF5 API or by the netCDF API.
>>> See previous email about the reasons why.
>>> WHY THIS MATTERS:
>>> Software applications that need to handle both netCDF and HDF5 files 
>>> cannot decide which API to use.
>>> This includes popular visualization tools like IDL, Matlab, NCL, HDF 
>>> Explorer.
>>> SOLUTIONS PROPOSED: 2
>>> SOLUTION 1: Add a flag to HDF5 source
>>> The hdf5 format specification, listed here
>>> https://www.hdfgroup.org/HDF5/doc/H5.format.html
>>> describes a sequence of bytes in the file layout that have special 
>>> meaning for the HDF5 API. It is common practice, when designing a 
>>> data format,
>>> so leave some fields "reserved for future use".
>>> This solution makes use of one of these empty  "reserved for future 
>>> use" spaces to save a byte (for example) that describes an enumerator
>>> of "HDF5 compatible formats".
>>> An "HDF5 compatible format" is a data format that uses the HDF5 API 
>>> at a lower level (usually hidden from the user of the upper API),
>>> and providing its own API.
>>> This category can still be divide in 2 formats:
>>> 1) A "pure HDF5 compatible format". Example, NeXus
>>> http://www.nexusformat.org/
>>> NeXus just writes some metadata (attributes) on top of the HDF5 API, 
>>> that has some special meaning for the NeXus community
>>> 2) A "non pure HDF5 compatible format". Example, netCDF
>>> Here, the format adds some extra feature besides HDF5. In the case 
>>> of netCDF, these are shared dimensions between variables.
>>> This sub-division between 1) and 2) is irrelevant for the problem 
>>> and solution in question
>>> The solution consists of writing a different enumerator value on the 
>>> "reserved for future use" space. For example
>>> Value decimal 0 (current value): This file was generated by the HDF5 
>>> API (meaning the HDF5 only API)
>>> Value decimal 1: This file was generated by the netCDF API (using HDF5)
>>> Value decimal 2: This file was generated by <put here another HDF5 
>>> based format>
>>> and so on
>>> The advantage of this solution is that this process involves 2 
>>> parties: the HDF Group and the other format's organization.
>>> This allows the HDF Group to "keep track" of new HDF5 based formats. 
>>> It allows to make the other format "HDF5 certified" .
>>> SOLUTION 2: Add some metadata to the other API on top of HDF5
>>> This is what Nexus uses.
>>> A Nexus file on creation writes several attributes on the root 
>>> group, like "NeXus_version" and other numeric data.
>>> This is done using the public HDF5 API calls.
>>> The solution for netCDF consists of the same approach, just write 
>>> some specific attributes, and a special netCDF API to write/read them.
>>> This solutions just requires the work of one party (the netCDF group)
>>> END OF RFC
>>> In reply to people that commented in the thread
>>> @John Shalf
>>> >>Perhaps NetCDF (and other higher-level APIs that are built on top of
>>> HDF5) should include an attribute attached
>>> >>to the root group that identifies the name and version of the API
>>> that created the file?  (adopt this as a convention)
>>> yes, that's one way to do it, Solution 2 above
>>> @Mark Miller
>>> >>>Hmmm. Is there any big reason NOT to try to read a netCDF produced
>>> HDF5 file with the native HDF5 library if someone so chooses?
>>> It's possible to read a netCDF file using HDF5, yes.
>>> There are 2 things that you will miss doing this:
>>> 1) the ability to inquire about shared netCDF dimensions.
>>> 2) the ability to read remotely with openDAP.
>>> Reading with HDF5 also exposes metadata that is supposed to be 
>>> private to netCDF. See below
>>> >>>> And, attempting  to read an HDF5 file produced by Silo using just
>>> the HDF5 library (e.g. w/o Silo) is a major pain.
>>> This I don't understand. Why not read the Silo file with the Silo API?
>>> That's the all purpose of this issue, each higher level API on top 
>>> of HDF5 should be able to detect "itself".
>>> I am not familiar with Silo, but if Silo cannot do this, then you 
>>> have the same design flaw that netCDF has.
>>>
>>> >>> In a cursory look over the libsrc4 sources in netCDF distro, I see
>>> a few things that might give a hint a file was created with netCDF. . .
>>> >>>> First, in NC_CLASSIC_MODEL, an attribute gets attached to the
>>> root group named "_nc3_strict". So, the existence of an attribute on 
>>> the root group by that name would suggest the HDF5 file was 
>>> generated by netCDF.
>>> I think this is done only by the "old" netCDF3 format.
>>> >>>>> Also, I tested a simple case of nc_open, nc_def_dim, etc.
>>> nc_close to see what it produced.
>>> >>>> It appears to produce datasets for each 'dimension' defined with
>>> two attributes named "CLASS" and "NAME".
>>> This is because netCDF uses the HDF5 Dimension Scales API internally 
>>> to keep track of shared dimensions. These are internal attributes
>>> of Dimension Scales. This approach would not work because an HDF5 
>>> only file with Dimension Scales would have the same attributes.
>>>
>>> >>>> I like John's suggestion here.
>>> >>>>>But, any code you add to any applications now will work *only*
>>> for files that were produced post-adoption of this convention.
>>> yes. there are 2 actions to take here.
>>> 1) fix the issue for the future
>>> 2) try to retroactively have some workaround that makes possible now 
>>> to differentiate a HDF5/netCDF files made before the adopted convention
>>> see below
>>>
>>> >>>> In VisIt, we support >140 format readers. Over 20 of those are
>>> different variants of HDF5 files (H5part, Xdmf, Pixie, Silo, Samrai, 
>>> netCDF, Flash, Enzo, Chombo, etc., etc.)
>>> >>>>When opening a file, how does VisIt figure out which plugin to
>>> use? In particular, how do we avoid one poorly written reader plugin 
>>> (which may be the wrong one for a given file) from preventing the 
>>> correct one from being found. Its kinda a hard problem.
>>>
>>> Yes, that's the problem we are trying to solve. I have to say, that 
>>> is quick a list of HDF5 based formats there.
>>> >>>> Some of our discussion is captured here. . .
>>> http://www.visitusers.org/index.php?title=Database_Format_Detection
>>> I"ll check it out, thank you for the suggestions
>>> @Ed Hartnett
>>> >>>I must admit that when putting netCDF-4 together I never considered
>>> that someone might want to tell the difference between a "native" 
>>> HDF5 file and a netCDF-4/HDF5 file.
>>> >>>>>Well, you can't think of everything.
>>> This is a major design flaw.
>>> If you are in the business of designing data file formats, one of 
>>> the things you have to do is how to make it possible to identify it 
>>> from the other formats.
>>>
>>> >>> I agree that it is not possible to canonically tell the
>>> difference. The netCDF-4 API does use some special attributes to 
>>> track named dimensions,
>>> >>>>and to tell whether classic mode should be enforced. But it can
>>> easily produce files without any named dimensions, etc.
>>> >>>So I don't think there is any easy way to tell.
>>> I remember you wrote that code together with Kent Yang from the HDF 
>>> Group.
>>> At the time I was with the HDF Group but unfortunately I did follow 
>>> closely what you were doing.
>>> I don't remember any design document being circulated that explains 
>>> the internals of the "how to" make the netCDF (classic) model of 
>>> shared dimensions
>>> use the hierarchical group model of HDF5.
>>> I know this was done using the HDF5 Dimension Scales (that I wrote), 
>>> but is there any design document that explains it?
>>> Maybe just some internal email exchange between you and Kent Yang?
>>> Kent, how are you?
>>> Do you remember having any design document that explains this?
>>> Maybe something like a unique private attribute that is written 
>>> somewhere in the netCDF file?
>>>
>>> @Mary Haley, NCL
>>> NCL is a widely used tool that handles both netCDF and HDF5
>>> Mary, how are you?
>>> How does NCL deal with the case of reading both pure HDF5 files and 
>>> netCDF files that use HDF5?
>>> Would you be interested in joining a community based effort to deal 
>>> with this, in case this is an issue for you?
>>>
>>> @David Pearah  , CEO HDF Group
>>> I volunteer to participate in the effort of this RFC together with 
>>> the HDF Group (and netCDF Group).
>>> Maybe we could make a "task force" between HDF Group, netCDF Group 
>>> and any volunteer (such as tools developers that happen to be in 
>>> these mail lists)?
>>> The "task force" would have 2 tasks:
>>> 1) make a HDF5 based convention for the future and
>>> 2) try to retroactively salvage the current design issue of netCDF
>>> My phone is 217-898-9356, you are welcome to call in anytime.
>>> ----------------------
>>> Pedro Vicente
>>> pedro.vicente at space-research.org 
>>> <mailto:pedro.vicente at space-research.org>
>>> https://twitter.com/_pedro__vicente
>>> http://www.space-research.org/
>>>
>>>     ----- Original Message -----
>>>     *From:* Miller, Mark C. <mailto:miller86 at llnl.gov>
>>>     *To:* HDF Users Discussion List 
>>> <mailto:hdf-forum at lists.hdfgroup.org>
>>>     *Cc:* netcdfgroup at unidata.ucar.edu
>>>     <mailto:netcdfgroup at unidata.ucar.edu> ; Ward Fisher
>>>     <mailto:wfisher at ucar.edu>
>>>     *Sent:* Wednesday, March 02, 2016 7:07 PM
>>>     *Subject:* Re: [Hdf-forum] Detecting netCDF versus HDF5
>>>
>>>     I like John's suggestion here.
>>>
>>>     But, any code you add to any applications now will work *only* for
>>>     files that were produced post-adoption of this convention.
>>>
>>>     There are probably a bazillion files out there at this point that
>>>     don't follow that convention and you probably still want your
>>>     applications to be able to read them.
>>>
>>>     In VisIt, we support >140 format readers. Over 20 of those are
>>>     different variants of HDF5 files (H5part, Xdmf, Pixie, Silo,
>>>     Samrai, netCDF, Flash, Enzo, Chombo, etc., etc.) When opening a
>>>     file, how does VisIt figure out which plugin to use? In
>>>     particular, how do we avoid one poorly written reader plugin
>>>     (which may be the wrong one for a given file) from preventing the
>>>     correct one from being found. Its kinda a hard problem.
>>>
>>>     Some of our discussion is captured here. . .
>>>
>>> http://www.visitusers.org/index.php?title=Database_Format_Detection
>>>
>>>     Mark
>>>
>>>
>>>     From: Hdf-forum <hdf-forum-bounces at lists.hdfgroup.org
>>>     <mailto:hdf-forum-bounces at lists.hdfgroup.org>> on behalf of John
>>>     Shalf <jshalf at lbl.gov <mailto:jshalf at lbl.gov>>
>>>     Reply-To: HDF Users Discussion List <hdf-forum at lists.hdfgroup.org
>>>     <mailto:hdf-forum at lists.hdfgroup.org>>
>>>     Date: Wednesday, March 2, 2016 1:02 PM
>>>     To: HDF Users Discussion List <hdf-forum at lists.hdfgroup.org
>>>     <mailto:hdf-forum at lists.hdfgroup.org>>
>>>     Cc: "netcdfgroup at unidata.ucar.edu
>>>     <mailto:netcdfgroup at unidata.ucar.edu>"
>>>     <netcdfgroup at unidata.ucar.edu
>>>     <mailto:netcdfgroup at unidata.ucar.edu>>, Ward Fisher
>>>     <wfisher at ucar.edu <mailto:wfisher at ucar.edu>>
>>>     Subject: Re: [Hdf-forum] Detecting netCDF versus HDF5
>>>
>>>         Perhaps NetCDF (and other higher-level APIs that are built on
>>>         top of HDF5) should include an attribute attached to the root
>>>         group that identifies the name and version of the API that
>>>         created the file?  (adopt this as a convention)
>>>
>>>         -john
>>>
>>>             On Mar 2, 2016, at 12:55 PM, Pedro Vicente
>>>             <pedro.vicente at space-research.org
>>> <mailto:pedro.vicente at space-research.org>> wrote:
>>>             Hi Ward
>>>             As you know, Data Explorer is going to be a general
>>>             purpose data reader for many formats, including HDF5 and
>>>             netCDF.
>>>             Here
>>>             http://www.space-research.org/
>>>             Regarding the handling of both HDF5 and netCDF, it seems
>>>             there is a potential issue, which is, how to tell if any
>>>             HDF5 file was saved by the HDF5 API or by the netCDF API?
>>>             It seems to me that this is not possible. Is this correct?
>>>             netCDF uses an internal function NC_check_file_type to
>>>             examine the first few bytes of a file, and for example for
>>>             any HDF5 file the test is
>>>             /* Look at the magic number */
>>>                /* Ignore the first byte for HDF */
>>>                if(magic[1] == 'H' && magic[2] == 'D' && magic[3] == 
>>> 'F') {
>>>                  *filetype = FT_HDF;
>>>                  *version = 5;
>>>             The problem is that this test works for any HDF5 file and
>>>             for any netCDF file, which makes it impossible to tell
>>>             which is which.
>>>             Which makes it impossible for any general purpose data
>>>             reader to decide to use the netCDF API or the HDF5 API.
>>>             I have a possible solution for this , but before going any
>>>             further, I would just like to confirm that
>>>             1)      Is indeed not possible
>>>             2)      See if you have a solid workaround for this,
>>>             excluding the dumb ones, for example deciding on a
>>>             extension .nc or .h5, or traversing the HDF5 file to see
>>>             if it's non netCDF conforming one. Yes, to further
>>>             complicate things, it is possible that the above test says
>>>             OK for a HDF5 file, but then the read by the netCDF API
>>>             fails because the file is a HDF5 non netCDF conformant
>>>             Thanks
>>>             ----------------------
>>>             Pedro Vicente
>>>             pedro.vicente at space-research.org
>>>             <mailto:pedro.vicente at space-research.org>
>>>             http://www.space-research.org/
>>>             _______________________________________________
>>>             Hdf-forum is for HDF software users discussion.
>>>             Hdf-forum at lists.hdfgroup.org
>>>             <mailto:Hdf-forum at lists.hdfgroup.org>
>>>
>>> http://secure-web.cisco.com/1r-EJFFfg6rWlpQsvXstBNTjaHQaKT_NkYRN0Jj_f-Z3EK0-hs6IbYc8XUBRyPsH3mU3CS0iiY7_qnchCA0QxNzQt270d_2HikCwpAWFmuHdacin62eaODutktDSOULIJmVbVYqFVSKWPzoX7kdP0yN9wIzSFxZfTwfhU8ebsN409xRg1PsW_8cvNiWzxDNm9wv9yBf9yK6nkEm-bOx2S0kBLbg9WfIChWzZrkpE3AHU9I-c2ZRH_IN-UF4g_g0_Dh4qE1VETs7tZTfKd1ox1MtBmeyKf7EKUCd3ezR9EbI5tK4hCU5qW4v5WWOxOrD17e8yCVmob27xz84Lr3bCK5wIQdH5VzFRTtyaAhudpt9E/http%3A%2F%2Flists.hdfgroup.org%2Fmailman%2Flistinfo%2Fhdf-forum_lists.hdfgroup.org 
>>>
>>>             Twitter: https://twitter.com/hdf5
>>>
>>>
>>>
>>>         _______________________________________________
>>>         Hdf-forum is for HDF software users discussion.
>>>         Hdf-forum at lists.hdfgroup.org 
>>> <mailto:Hdf-forum at lists.hdfgroup.org>
>>>
>>> http://secure-web.cisco.com/1r-EJFFfg6rWlpQsvXstBNTjaHQaKT_NkYRN0Jj_f-Z3EK0-hs6IbYc8XUBRyPsH3mU3CS0iiY7_qnchCA0QxNzQt270d_2HikCwpAWFmuHdacin62eaODutktDSOULIJmVbVYqFVSKWPzoX7kdP0yN9wIzSFxZfTwfhU8ebsN409xRg1PsW_8cvNiWzxDNm9wv9yBf9yK6nkEm-bOx2S0kBLbg9WfIChWzZrkpE3AHU9I-c2ZRH_IN-UF4g_g0_Dh4qE1VETs7tZTfKd1ox1MtBmeyKf7EKUCd3ezR9EbI5tK4hCU5qW4v5WWOxOrD17e8yCVmob27xz84Lr3bCK5wIQdH5VzFRTtyaAhudpt9E/http%3A%2F%2Flists.hdfgroup.org%2Fmailman%2Flistinfo%2Fhdf-forum_lists.hdfgroup.org 
>>>
>>>         Twitter: https://twitter.com/hdf5
>>>
>>>
>>> ------------------------------------------------------------------------ 
>>>
>>>     _______________________________________________
>>>     Hdf-forum is for HDF software users discussion.
>>>     Hdf-forum at lists.hdfgroup.org
>>>
>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>     Twitter: https://twitter.com/hdf5
>>>
>>>
>>>
>>> _______________________________________________
>>> netcdfgroup mailing list
>>> netcdfgroup at unidata.ucar.edu
>>> For list information or to unsubscribe,  visit: 
>>> http://www.unidata.ucar.edu/mailing_lists/
>>
>



More information about the NeXus mailing list