[Nexus-developers] Handling of C-string null characters

Wed Jul 6 14:28:39 BST 2005

I do not have enough experience (and just go with trial/error) to
suggest anything on the length of strings, but I do have an opinion on
spaces and strings.

There should be four options (note that I call whitespace to be \n, \t,
^M, and " "):
  NXACC_NOSTRIP - leave all whitespace as discovered in the file
  NXACC_STRIP_LEAD - remove leading and trailing whitespace, leave
intererior whitespace unchanged
  NXACC_STRIP_INTERIOR - collapse all interior whitespace into a single
" " for each set encountered
  NXACC_STRIP_ALL - the oposite of NXACC_NOSTRIP

These should be present reguardless of what the default behaviour is so
the user can declare what they want if they care (much like what file
base is default for NX_CREATE). Also, this should be setable for HDF4
and HDF5 files as well for consistency of interface. As for the default,
I prefer NXACC_STRIP_ALL.

P^2

-----Original Message-----
From: nexus-developers-bounces at anl.gov
[mailto:nexus-developers-bounces at anl.gov] On Behalf Of Akeroyd, FA
(Freddie)
Sent: Wednesday, July 06, 2005 9:04 AM
To: nexus-developers at anl.gov
Subject: RE: [Nexus-developers] Handling of C-string null characters

NXmalloc() should allocate length+1 bytes (where length is what
NXgetinfo() returns) and then set element "length+1" to NULL. When
NXgetdata() is called, even though it doesn't add a NULL byte itself,
there will then be a NULL present at the end of the string as
NXgetdata() will only write "length" characters. If the user instead
uses malloc() he needs to remember to allocate "length+1" bytes and then
add the NULL himself.

The other question is the stripping of spaces in strings. Currently the
API, unless you open the file with the new NXACC_NOSTRIP option, will
strip both leading and trailing spaces and also collapse/merge multiple
spaces between words to a single space e.g.

"  nexus       data    "    ->   "nexus data"

I think stripping leading + trailing spaces is probably reasonable, but
what about embedded spaces - is it reasonable to always reduce them to a
single space? Note that "space" here means anything recognised as a
space by the isspace() C function i.e. tabs and newline characters will
also get removed/turned into a single space. I think we need another
option to control the merging of spaces between words in addition to
stripping leading and trailing spaces - embedded spaces/tabs/newlines
may be important for formatting purposes if a text data/log file has
been included in a NeXus file. I would propose that the default be to
strip leading/trailing "spaces" but to preserve embedded "spaces".

Freddie

> -----Original Message-----
> From: nexus-developers-bounces at anl.gov [mailto:nexus-developers- 
> bounces at anl.gov] On Behalf Of Ray Osborn
> Sent: 05 July 2005 18:20
> To: Nexus-Developers at anl.gov
> Subject: [Nexus-developers] Handling of C-string null characters
> 
> There is one urgent thing that we need to clear up before we release
NAPI
> v3.0, and that concerns how we handle string lengths.  Following
problems
> with the XML API, Mark has now changed NXgetinfo so that it returns
the
> length of the string in the Fortran API but adds one to the length in
the
> C
> API to accommodate the NULL character.  I think this is the wrong way
to
> approach this problem, and I think Freddie agreed with me when he
wrote to
> confirm what the API now does.  We need to resolve this quickly so
other
> opinions are welcomed.
> 
> So I'm raising the old question - how long is a string?
> 
> Current Behaviour:
> 
> NXgetinfo and NXmalloc adds the extra byte to the length of character 
> strings, when called in C, but it is removed in the Fortran API.  The 
> length of "neutron" is 8 in C but 7 in Fortran (and presumably other 
> APIs
such as
> Python).  NXgetdata will return "neutron\0" in C, but "neutron" in 
> Fortran.
> 
> Proposal (my view, and I believe Freddie's):
> 
> The length of a character string returned by NXgetinfo should be the 
> number of characters excluding the NULL character, and NXgetdata 
> should
return
> exactly those characters.  The documentation should warn the
C-programmer
> to
> add one byte to the allocation, if they use malloc directly, and to
add
> the
> NULL character to the string returned by NXgetdata to make a C-string.
> NXmalloc will automatically add the extra byte when allocating memory.
> 
> This ensures that the length does not depend on the language used to
read
> the NeXus file.   C-programmers are used to dealing with this issue
and
> don't need to be spoon-fed.  The average non-programming user will, 
> however, be confused why "neutron" is 8 characters long according to 
> NXbrowse
and
> most other generic file readers, but only seven according to the
Fortran
> API.  This will prevent such confusion in a well-documented way.
> 
> We may need to put this to a vote, but we should settle it before
Friday
> if
> Nick's timetable is to be kept.
> 
> Regards,
> Ray
> --
> Dr Ray Osborn                Tel: +1 (630) 252-9011
> Materials Science Division   Fax: +1 (630) 252-7777
> Argonne National Laboratory  E-mail: ROsborn at anl.gov Argonne, IL 
> 60439-4845
> 
> 
> 
> _______________________________________________
> NeXus-developers mailing list
> NeXus-developers at anl.gov
> http://www.neutron.anl.gov/mailman/listinfo/nexus-developers

_______________________________________________
NeXus-developers mailing list
NeXus-developers at anl.gov
http://www.neutron.anl.gov/mailman/listinfo/nexus-developers