[NeXus-committee] Unicode support

Osborn, Raymond rosborn at anl.gov
Wed May 27 16:51:00 BST 2015


Paul Kienzle’s email to the NeXus Mailing List made me go and check some of the files in the ExampleData repository. I think at some point, we have to go through them to make sure they conform to the latest version of the standard. Since the purpose of this directory is to provide people with working examples, we need to make sure that they don’t lead people astray. Perhaps non-standard-conforming legacy files can be put in a subdirectory.

One issue I came across is that the Soleil examples use Unicode characters for their units (e.g., Angstrom) with ISO-8859-1 encoding ('\xc5'), rather than the UTF-8 encoding that HDF5 uses for variable-length strings. I had a quick Google of the nexusformat.org<http://nexusformat.org> site, and couldn’t find a definitive answer concerning how we treat unicode characters. I don’t know if it’s embedded in a PDF anywhere. If we have not defined an encoding, then I think we should define ‘UTF-8’ as officially recommended, and enforce it in the API.

I presume the Soleil files were created before we deprecated HDF4, so it was arbitrary. I don’t think there is a field in their file to state what the encoding is, so I’m not sure how to handle this in NeXpy. I guess I could assume ISO-8859-1 if UTF-8 triggers an exception.

Ray
--
Ray Osborn, Senior Scientist
Materials Science Division
Argonne National Laboratory
Argonne, IL 60439, USA
Phone: +1 (630) 252-9011
Email: ROsborn at anl.gov<mailto:ROsborn at anl.gov>


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.nexusformat.org/pipermail/nexus-committee/attachments/20150527/4a42eb47/attachment.html>


More information about the NeXus-committee mailing list