[Nexus-developers] NeXus 2D Strings

Akeroyd, FA (Freddie) F.A.Akeroyd at rl.ac.uk
Tue Jan 16 16:02:21 GMT 2007


Peter,

According to the mxml web page it supports reading of UTF-8 and UTF-16
and writing of UTF-8 encoded XML files and strings. The release notes
for HDF5 1.8 say that they now support UTF-8 strings in datasets, the
names of links and the names of attributes. I'm not sure about the HDF4
situation.

There are two places where Unicode support could be added in NeXus: for
variable/group names and for the data itself. We could add wchar_t
versions of all our functions and then convert these arguments to UTF-8
before talking to the lower file writing layers etc. As for character
data itself supporting UTF-16 would mean changing a few parts where the
API interprets the character data, such places might be:
- stripping of whitespaces
- NULL terminating strings
As far as I understand it, the issues with Unicode are more presentation
than storage ... you should be able store/read/write/compare UTF-8
strings without knowing they are UTF-8, you just will not be able to
display them correctly to the screen. Though NX_CHAR would be sufficient
to store UTF-8 we may want to define a way for an application to
indicate the fact that the bytes it has written should be interpreted as
UTF-8 rather than ASCII by a reading application ... say NX_UTF8.

Freddie

-----Original Message-----
From: Peterson, Peter F. [mailto:petersonpf at ornl.gov] 
Sent: 16 January 2007 13:51
To: Akeroyd, FA (Freddie); nexus-developers at nexusformat.org
Cc: tieman
Subject: RE: [Nexus-developers] NeXus 2D Strings

Freddie,

I don't have a strong opinion on much of this, but I do have a question
about Unicode support: What does hdf4/hdf5 do for Unicode support? I
know that xml is generically Unicode, but does mxml provide real support
for it? 

I know that more popular / widely known libraries (like libxml2) deal
with this by defining their own typdef for 16 bit characters and provide
recipes for converting it to traditional 8 bit characters. What is your
strategy for this?

Finally a note of procedure, if our tech chair (Nick) agrees then this
should be added as a milestone on trac so people like Dr. Tieman can see
how close it is to being done without looking directly at the code.

P^2

-----Original Message-----
From: nexus-developers-bounces at nexusformat.org
[mailto:nexus-developers-bounces at nexusformat.org] On Behalf Of Akeroyd,
FA (Freddie)
Sent: Sunday, January 14, 2007 5:11 PM
To: tieman; nexus-developers at nexusformat.org
Subject: Re: [Nexus-developers] NeXus 2D Strings

Hi,

I think there is a general need that 2D character arrays should be
supported so I propose we should now:

* Not strip whitespace for character arrays with number_of_dimensions >
2; NXACC_NOSTRIP only applies to 1 dimensional character arrays where
whitespace stripping is the default (this needs to be documented).
* Remove the warning messages from napi.c for HDF4 creating
multi-dimensional string arrays
* Add HDF5 support for these arrays (should be easy to do)
* Print an error message if you try to create them with the XML
interface for the moment.

As for 3D and higher dimensional character arrays, these should
generalise straight away from HDF and probably too from the XML once it
is implemented. 

With regard to XML, I think the embedded whitespace issue is OK i.e.
they can be preserved; however they are a few funny rules (see sections
2.10 and 2.11 of http://www.w3.org/TR/REC-xml/ and
http://www.oracle.com/technology/pub/articles/wang-whitespace.html)

As for the XML array data itself, we could just "write it raw" to the
file, though it may be best to split this into lines and escape embedded
newlines as this avoids \n being converted into \r\n etc. by some
application/copy and causing the character count to be wrong; it also
makes things more readable. Using a CDATA section would avoid us having
to escape any < or > characters etc. in the string data, but would break
if an XML file containing another CDATA was embedded in an NX_CHAR array
and we didn't catch the extra ]]> sequence ourselves.

> Then there is the issue of ragged string arrays. Usually strings are 
> of different length in a string array. Currently this is solved by 
> padding arrays to the longest string in the set.
We can document the API as only supporting character arrays, which are
by nature rectangular, and provide a utility function to do this
"convert and pad" for any user that requires it. The user would be free
to choose any pad character they like as the API will not need to use
it.

>
> This gets even more complicated if we start to think about
unicode.....
>
Were you thinking about lines of the same "length" being in fact
different lengths when put into UTF8 encoding? Maybe we say NX_CHAR is
really NX_CHAR8 (i.e. 8 bit ASCII only) and create an NX_CHAR16 for
unicode purposes?

Regards,

Freddie

-----Original Message-----
From: nexus-developers-bounces at nexusformat.org
[mailto:nexus-developers-bounces at nexusformat.org] On Behalf Of tieman
Sent: 15 December 2006 15:49
To: Mark Koennecke
Cc: nexus-developers at nexusformat.org
Subject: Re: [Nexus-developers] NeXus 2D Strings

Mark,

Freddie suggested:

>> I think the problem is due to an error in the way the API tries to
strip
>> whitespace on strings - try opening the file with the flags
>> NXACC_READ|NXACC_NOSTRIP

This does, indeed, work to read the HDF4 files.  I had to hack 
NXmakedata in napi.c to remove the check on multi-dimensional character 
arrays that was preventing the writes of 2D data in order to get writes 
to work as I'm used to, though.

For the most part, my 2D char arrays are in a sort of electronic log we 
generate for each sample.  The "experiment file" as we refer to it is a 
quasi complete log of all experimental parameters (beamline setting, 
detector setting, etc...) as well as a processing history of the data 
(acquired data file names, acquired white/dark file names, processing 
algorithms used, cluster machines used to process, etc...)  The 
experiment file contains all the data that would be redundant to put 
into each data file itself.

The only place I use 2D char arrays is for lists of file names which, in

my case, are a fixed size for a given list.  The file names are not 
terminated nor are there embedded escape characters.  On read, I know 
how long a file name is and how many there are simply by looking at
dims[].

I'd like to continue to be able to do this with HDF4 and HDF5 if
possible.

I don't care much about XML but I would almost argue to treat strings in

XML the same as HDF does--that is a '\n' is a single character.  Sure, 
looking at the XML in a text editor will look funny and one will need to

be careful about how those files are copied about, but I think XML will 
handle it OK if you don't try and strip the unprintable characters.  
And, as you mentioned, there is no need for supporting multi-dimensional

char arrays in the Nexus spec.  Some of use just like Napi as an API and

only loosely adhere to the Nexus spec though...

...my $0.02 worth...

Brian

Mark Koennecke wrote:
> High,
>
> 2D string arrays should work in HDF-4. We never supported them in 
> HDF-5 because the NeXus standard nowhere requires 2D strings  and we 
> were lazy.  It
> is possible to support string arrays in HDF-5. As Freddy rightly 
> mentioned there is a problem writing 2D string arrays in XML. The 
> obvious solution is to
> make a new line for each run in the array. However, this falls over 
> when newlines are in the data. This can be solved by escaping newlines

> in the data. But this
> causes trouble to those  who solved the current NeXus 2D string 
> problem by formatting their string arrays in a newline separated long 
> string. This may be solved
> by escaping newline only when  the dimensionality is higher then 1.
>
> This raises the question of dimensionality: is 2D sufficient or do we 
> have to go for the most general case of up to 32 dimensional string 
> arrays?
>
> Then there is the issue of ragged string arrays. Usually strings are 
> of different length in a string array. Currently this is solved by 
> padding arrays to the longest
> string in the set.
>
> This gets even more complicated if we start to think about
unicode.....
>
> Summing it up, before we can implement 2D string arrays we need to 
> find some consensus on:
> - Padding strings to match arrays
> - Formatting string arrays in XML
> - Decide if 2D is enough or if we wish to support the more general 
> case which is also more work.
>
> Finally, I wish to point out that storing the strings in array for 
> NX_UINT8 might be a feasible workaround. This just is
> ugly to look at when printed with a program which does not know about 
> this.
>
>                     Best Regards,
>
>                                   Mark Koennecke
>

_______________________________________________
NeXus-developers mailing list
NeXus-developers at nexusformat.org
http://lists.nexusformat.org/mailman/listinfo/nexus-developers

_______________________________________________
NeXus-developers mailing list
NeXus-developers at nexusformat.org
http://lists.nexusformat.org/mailman/listinfo/nexus-developers



More information about the NeXus-developers mailing list