NeXus error bars: Suggestions for "treated" data
Tanya Maria Riseman
tmr at thsun7.ph.bham.ac.uk
Mon Mar 26 19:58:29 BST 2001
Suggestions for extending the NeXus data format for "treated" data.
from Tanya Riseman tmr at th.ph.bham.ac.uk.
(... DKA300:[TANYA.TEX.DATAFORMATS]nexus2001_treated_suggest.txt ...)
I apologize for not spell checking; I lost my spell checker in a
disk crash.
The current raw data format does not contain any information about
systematic and statistical errors assoicated with the experimental
data. Designing how to handle treated data with errors will be a
fair amount of work. I expect a lot of discussion.
This rather long email contains my suggestions for how it might be done,
and what different styles of storage might useful.
There is a summary at the end.
********************************************************
***** Strong Opinion ****** ***** Strong Opinion ******
********************************************************
Raw data should be write-protected both as a matter of safety
and as evidence (not proof) that you have not fraudulently modified it.
(It will also simply archiving.)
Added experimental conditions, "treated" data and results should NOT
be put in the same file. With HDF 5 "mounted" file links will allow
the raw data to appear as if it is in the NeXus file containing
treated data, which should reduce the desire/need to modify the raw data.
********************************************************
***** Strong Opinion ****** ***** Strong Opinion ******
********************************************************
"Treated" data (or "manipulated" data, even parameters from fits to
treated data) have errors associated with it, which ideally should be tightly
bound to the "treated" data itself so that
1. The errors appear automatically when you plot and
including error bars along y, x, and z axis, etc. simultaneously.
2. They are immediately avaible for error propagation.
3. They are automatically available for fits.
4. You should have to go out of your way to strip the
errors from the treated data.
The drawback of having a separate array called "errors" associated with
the data is that it is easily forgotten and usually left out as an extra
step when plotting initially (when exported to spreadsheet programs).
I am not sure if having an item called "error" next to the data
within NXdata is sufficently tighly bound. Alternatively,
we could interweave the data and the errors.
Perhaps both approaches are useful to have! Certainly when
exporting to spreadsheet type programs, it is usually easier
to have data and errors interwoven. For exporting to symbolically
based programs like Octave, it is easier to have them in separate
arrays.
Let me give you an example of tightly bound errors associated
with data, using Jess Brewer's DB format (used by some muSR people)
The following example is a file which stores
the time [t]
the treated asymmetry for detector 1 [data]
the fitted asymmetry for detector 1 [th]
difference between the fit and the data [diff]
Another test of the accuracy the fit [test]
==========================================================================
title
spec01.db SLO r1588_2 : 500nm YBCO (Kinder, 35.5 149.19981026 16:32:00
labels
Time (musec)
a(t) Theory
a(t) data
Difference a(t)
ratio*(datum(i,j)-guess(i,j))/sigma(i,j)
Comments
binfactor= 1
group = 1
data t th data diff test
4.0000003E-02 ,,, \
-0.1639927 ,,, \
-0.1156916 , 3.1063100E-02 ,, \
-4.8301034E-02 ,,, \
1.045283 ,,,
6.0000006E-02 ,,, \
-0.1761850 ,,, \
-0.1668108 , 3.0301264E-02 ,, \
-9.3742078E-03 ,,, \
-0.1680480 ,,,
8.0000006E-02 ,,, \
-0.1490452 ,,, \
-0.1568016 , 3.0621275E-02 ,, \
7.7564027E-03 ,,, \
-0.6223261 ,,,
...
8.860001 ,,, \
9.2142010E-03 ,,, \
0.0000000E+00 , 6.2336028E-02 ,, \
9.2142010E-03 ,,, \
-0.1273109 ,,,
(data ended by a single blank line, no need to say ahead of time low long it
will be.)
==========================================================================
Note: this is essentially an "ntple" storage of treated data,
where the column name "data" is associated with its plotting label
"a(t) data". The first instance of a(t=4.0000003E-02) is
"-0.1156916 , 3.1063100E-02 ,," where "-0.1156916" is the value
and "3.1063100E-02" is the symmetric error. If there were asymmetric
errors, it would be something like "-0.1156916, 3.106E-02, 2.1e-2,"
with the error bar extending from -0.1156916 +3.106E-02 down to
-0.1156916 - 2.1e-2. This format allows no-errors, symmetric and
and asymmetric errors to be mixed freely.
Maybe we should call this kind of data storage "etple"
short for error-enabled ntple, like ntples in OpenGenie. (I'm not sure
how ntples are actually handled in OpenGenie, since I haven't used the
program.)
Plotting "Plot t data" automatically produces y-axes error bars
on the "data" variable. One can also plot "plot data th"
with x-axis errors on "data".
One can also generate new variables with rundamentary error
propagation :
generate diff = data - th
label diff
Difference a(t)
==========================================================================
When one extracts this data to a spreadsheet readable form, one
gets only the errors actually used:
!! Comment: spec01.db SLO r1588_2
!! Comment: t th data edata diff test
4.0000003E-02, -0.1639927, -0.1156916, 3.11E-02, -4.8301034E-02, 1.045283
6.0000006E-02, -0.1761850, -0.1668108, 3.03E-02, -9.3742078E-03,-0.1680480
8.0000006E-02, -0.1490452, -0.1568016, 3.06E-02, 7.7564027E-03, -0.6223261
...
8.860001, 9.2142010E-03, 0.00E+00, 6.23E-02, 9.2142010E-03, -0.1273109
Note that the errors should be reported in a form that truncates the
number of digits after the decimal place. This emphasises the fact that
they are only error estimates.
==========================================================================
Suggestions for extending NeXus to include treatment of errors.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
NXdata data An array of numbers
Only store raw data in this catagory!
Don't store error information directly for raw data?
add attribute errors_type
*Needed* errors_style="unknown"
*Needed* errors_style="not_recorded"
Needed? errors_style="derived" For something too complicated to
explain here.
*Needed* errors_style="counting" For Neutrons and muons
Limited use? errors_style="fractional"
errors_value = 0.01 for 1 percent error
Limited use? errors_style="constant" same units as data.
errors_value = 1023.4
Allow for raw? errors_style="stored" See error matrix
for example:
data size NxM for 2-D array
errors_style="stored"
with
errors size 2xNxM for 2-D array
errors_type = "asymmetric" Stores asymmetric errors.
Symmetric errors have same value saved twice for + and - errors.
Data with no error have zero in errors matrix.
or errors (same rank as data) size NxM for 2-D array
errors_type = "symmetric" Only stores symmetric errors.
Data with no error have zero in errors matrix.
or errors (zero rank) size 0 for 2-D array
errors_type = "none" (Errors matrix empty)
Is raw data ALWAYS uncorrelated? It is for muSR and SANS.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
NXTreated
Might have two different ways to represent uncorrelated treated data:
NXtreated
or NXetple
Designers should probably impliment both or chose one representation
of uncorrelated treated data for initial use.
What shall we decide? I prefer NXetple for data reduction, especially
in the advanced stages. But NXetple will probably be pretty wasteful
compared to NXtreated for only slightly treated data.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
NXTreated
NXprogram program_name
NXmacro Array of strings, containing macros to create the
results stored in NXtreated using the program
program_name. Basically, want to be able to reproduce
independently the results shown with only info stored
in the NeXus file.
NXcomments Less extensive information than macro
if automatic storing of the macro is not possible.
Any other info user wants to add.
my_spectra size NxM for 2-D array for example
errors_style="stored"
parent_data_link Link to parent data, used to create my_spectra
with
errors size 2xNxM for 2-D array
errors_type = "asymmetric" Stores asymmetric errors.
Symmetric errors have same value saved twice for + and - errors.
Data with no error have zero in errors matrix.
or errors (same rank as data) size NxM for 2-D array
errors_type = "symmetric" Only stores symmetric errors
and any number with no error have zero.
or errors (zero rank) size 0 for 2-D array
errors_type = "none" (Errors matrix empty)
Note: all the errors have to be stored the same way.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
NXetple
NXprogram program_name
NXmacro Array of strings, containing macros to create the
results stored in NXtreated using the program
program_name. Basically, want to be able to reproduce
independently the results shown with only info stored
in the NeXus file.
NXcomments Less extensive information than macro
if automatic storing of the macro is not possible.
Any other info user wants to add.
my_fit_results
nrows= Number of local rows (variable)
(data can always be appended to in this direction)
ncolumns= Number of columns (less easy to modify)
variable_name(ncolumns) Arrays of short names
lables(ncolumns)= Array of strings for axes lables
units(ncolumns)= Units for each column
errors_type(ncolumns)= "asymmetric", "symmetric", "none"
parent_data_link(nrows) Link to parent data, used to create row.
row_note(nrows) Optional short comment for each row
(Run number, sample, time stamp perhaps)
mplots= Number of plots to produce automatically
default_plot(mplots)= which variables (columns) to plot
condition_plot(mplots)= select only some row to plot.
The "NXetple" data will be stored with its errors interwoven.
NXget_etple(handle, pointer, thiscolumn or variable_name(thiscolumn), thisrow)
will return
a scalar for "none" (just the value),
Or vector of length 3 (value, 0.0, 0.0)
a vector of length 2 for "symmetric" (just the value and symmetry error),
Or vector of length 3 (value, error, error)
a vector of length 3 for "asymmetric" (the value, positive and negative errors)
i.e. (value, positive_error, negative_error)
NXget_eptple_column(handle, pointer, thiscolumn or variable_name(thiscolumn))
will return up to three vectors of length ncolumns
value(ncolumns), positive_error(ncolumns), and negative_error(ncolumns)
That is the symmetric and asymmetric errors (when present) are always
assumed to be independent errors.
Each row will contain ncolumns of variables with errors.
data var1, var2, var3, ....
val1(1),e+1(1),e-1(1), val2(1),e+2(1),e-2(1), val3(1),e+3(1),e-3(1),...
val1(2),e+1(2),e-1(2), val2(2),e+2(2),e-2(2), val3(2),e+3(2),e-3(2),...
...
val1(n) Or vector of length 3 (value, error, error)
a vector of length 3 for "asymmetric" (the value, positive and negative errors)
i.e. (value, positive_error, negative_error)
NXget_eptple_column(handle, pointer, thiscolumn or variable_name(thiscolumn))
will return up to three vectors of length ncolumns
value(ncolumns), positive_error(ncolumns), and negative_error(ncolumns)
That is the symmetric and asymmetric errors (when present) are always
assumed to be independent errors.
Each row will contain ncolumns of variables with errors.
data var1, var2, var3, ....
val1(1),e+1(1),e-1(1), val2(1),e+2(1),e-2(1), val3(1),e+3(1),e-3(1),...
val1(2),e+1(2),e-1(2), val2(2),e+2(2),e-2(2), val3(2),e+3(2),e-3(2),...
...
val1(n),e+1(n),e-1(n), val2(n),e+2(2),e-2(2), val3(n),e+3(n),e-3(n),...
For treated data containing information like fit parameters,
it VERY USEFUL to be able to easily
1. add extra columns of information by hand
2. add extra columns of information such as selectively
adding experimental conditions (from the NXsample, NXinstrument,
NXmonitor)
3. Append rows
4. Delete columns or rows.
5. Mask/comment out columns or rows (like in Kaledeagraph).
For the slighted treated data in the example, editing is less useful.
Internally, the columns of data could be rearranged to bunch
together all the columns with no errors, with symmetric errors, and
with asymmetric errors, respectively; to save space.
Try NOT to require that any given column has consistently
the same sorts of errors. For example, from minuit,
one can optionally get asymmetric errors but it is much slower
and will return symmetric errors if there is a failure.
If one is fitting interactively, one might store symmetric
errors for one run and asymmetric errors for another. Therefore, allow
automatic upgrading of columns from "none" to "symmetric" to "asymmetric"
--------------- Automatically plotting ----------------------------
Now the mechanism for plotting data automatically
that is used for NXdata won't work easily for ntples.
Conceptually it is easier to refer to the variables by their
variable name, rather than by integer position in the array.
Also, it would be nice to have the option to apply a
condition on which data points are plotted.
NXplot_etple?() could optionally return either all the rows for the variables
requested, or it could return only those that satisfy the condition.
Better way
default_plot(1)= "plot variable_name(i) variable_name(j)"
for x-y plot
default_plot(2)= "plot+ variable_name(l) variable_name(k)"
superimpose on previous x-y plot
default_plot(3)= "plot variable_name(i) variable_name(j) variable_name(k)"
for 2-D z(x,y) x-y-z plot in new window
For example
default_plot(1)= "plot t data"
condition_plot(1)= "t<3.0"
default_plot(2)= "plot+ t th"
condition_plot(2)= "t<3.0"
default_plot(3)= "plot t diff"
condition_plot(3)= null
default_plot(4)= "plot t test"
condition_plot(4)= "(t>3)&(abs(diff) < 0.01)"
--------------------------------------------------------------------------
--------------------------------------------------------------------------
NXCorrelated data
For the real analysis warrior!
limited use? (usually because too hard to keep track of in normal programs)
Correlated treated data is most useful with obviously correlated
raw data or with inverse solutions to uncorrelated raw data.
It is also probably useful for treated data that has been "deconvoluted".
In this case, the size of the array is fixed (N) and
there is a covariance matrix (NxN) because the
errors are too correlated to use just the diagonal elements.
You can't just add things to the end, like in an NXetple.
I am using 1-D data for my example, but it should be generalized
to at least 3-D.
data (N) size N for 1-D array
signal=1
errors_style="stored"
xarray
axis =1
primary =1
errors size N^2 for 1-D array
errors_type= "correlated" (Covariance matrix)
accuracy= 1.2e-9??? How accurately errors made,
might be much worse than values in data array.
Possible variations on NXCorrelated :
-------------------------------------
--- Variation 1 ----------------------------------
1.
NXCorrelated
my_lineshape
signal=1
errors_style="stored"
parent_data_link Link to parent data, used to create my_lineshape
xarray
axis =1
primary =1
errors size N^2 for 1-D array
errors_type= "correlated" (use full covariance matrix)
--- Variation 2 ----------------------------------
2.
my_lineshape
signal=1
errors_style="stored"
parent_data_link Link to parent data, used to create my_lineshape
xarray
axis =1
primary =1
errors roughly (N*sqrt(2)) X (num_off_diagonal)
errors_type= "n_off_diagonal"
num_off_diagonal= 5
Use only the diagonal and the first n=5 off diagonal rows
of the covariance matrix and zero fill the rest.
num_off_diagonal= 0 defaults to the case of using
only the diagonal elements as if they are independent errors.
--- Variation 3 ----------------------------------
For the real analysis warrior:
3.
my_lineshape
signal=1
errors_style="stored"
parent_data_link Link to parent data, used to create my_lineshape
xarray
axis =1
primary =1
errors_hessian_row size N*sqrt(2)
errors_type= "hessian_row"
Maximum in first element then decays towards zero,
perhaps with oscillations.
num_off_diagonal= 5
Use only the first element and the next n=5 elements
of the hessian_row and zero fill the rest.
If not present, fill the whole hessian.
errors_hessian_diagonal size N*sqrt(2)
errors_type= "hessian_diagonal"
Create the hessian by taking copying errors_hessian_row
in every row of hessian matrix (NxN), shifting it so that the
maximum is centered on the diagonal, and mirrored symmetrically
about the diagonal. Optionally zero-fill the off-diagonal rows
beyond num_off_diagonal. Then add the values of errors_hessian_diagonal
to the diagonal. Then invert the matrix to make the covariance matrix.
Maybe there are clever, fast routines to invert matrices of
this shape.
This is useful for the case of using Maximum Entropy for
inverse problems. For example, with muSR, one has time domain data a(t)
and uses Maximum Entropy (instead of ordinary Fourier transform) to
find the distribution of local fields P(b). The hessian_row term comes
from requiring the reduced chi^2 = 1 (in the time domain) and the
hessian_diagonal term comes from the entropy, which is a function of P(b).
I suspect that covariance matrices from any generalized Maximum Entropy
technique or one that uses some smoothness criterion instead could be
separated into two vectors (each corresponding term being best described in
either the "forward" or "inverse" domains) which are used to construct
the hessian. It might also be useful for conserving disk space for the
error matrices associated with image-reconstruction of samples using
the phase information of neutron or x-rays. Whereas this representation saves
disk space, it is slow to invert the matrix.
The two vectors making up the hessian are more "meaningful" in
monitoring the quality of the results than is the resulting covariance matrix.
For simplicity, one could just store them separately from the
pre-constructed covariance matrix.
==========================================================================
For the real analysis warrior:
Treated data and its errors probably need an optional accuracy
attribute, since accumulated rounding errors plus the machine accuracy
of the analysis program will probably give an accuracy smaller
smaller than that implied by the data type (single or double precision).
Since the covariance matrix is typically created by inverting
a hessian, one would expect the accuracy of the errors to be
worse than the data itself, and highly dependent on the inverse algorthm used.
Fitting packages such as Minuit in the Cern library can provide a reasonable
estimate of the accuracy of its results.
accuracy= 1.2e-9???
==========================================================================
There should be links/ and/or mounted files from NXtreated and
NXetple to the parent data used to create the treated data.
Links would be for parent data within the same NeXus file,
mounts to parent data in a separate NeXus file.
This will allow one follow up the analysis "tree" and see all the steps
used to distill the data. The final results will be the "trunk"
and will be of type NXTreated, NXcorrelated or NXetple.
Raw data will be in principle the top-most leaf nodes,
although in practice people may not bother to use or maintain the links.
Of course, different people might analyse the same set of
data, or a data set might be analyzed again for a different purpose,
so there will be multiple analysis trees, with some or all
NXdata leafs in common. This is why I don't like treated data being stored in
the same file as the raw data. Doing so encourages turf wars.
Raw data should be write-protected both as a matter of safety
and as evidence that you have not fraudulently modified it.
NXData
Always one ending point "leaf" in the tree. Will never have a parent.
NXTreated
parent_data_link Link to parent data, used to create my_spectra
NXetple
parent_data_link(nrows) Link to parent data, used to create row.
NXcorrelated
parent_data_link Link to parent data, used to create my_lineshape
==========================================================================
Example of
1. parent links
2. shared parameters in NXetple
3. diffuse plotting of NXetple
Let's suppose a NeXus file contains NXtreated data for
a series of runs corresponding to a temperature scan.
For each temperature, there is
NXEntry n=3 (say temperature = 20K)
NXSample (actually stored in a separate file, via mount)
...
NXdata
raw_data (actually stored in a separate file, via mount)
NXtreated
my_spectra_dectector1 time, data, theory
my_spectra_dectector2 time, data, theory
parent_data_link = raw_data
NXCorrelated my_lineshape field, probability
parent_data_link = raw_data
NXetple my_fit_results
variable_names= (run, temp, det, asy, b, rlx, ph)
(pretend that the following are numbers, except for the links)
run, temperature, dectector1, \
asymmetry1, avfield, relaxationrate, phase1, \
parent_data_link = my_spectra_detector1
run, temperature, dectector2, \
asymmetry2, link=avfield, link=relaxationrate, phase2, \
parent_data_link = my_spectra_detector2
Where "run" and "temperature" are copies of the values
stored under other parts of NXEntry such as NXSample.temperature.
Note that the fitting program assume the same value of
avfield and relaxationrate for all detectors, so there are a links
for all but the first detector.
Now, I would like to plot the NXetple items
temperature and relaxationrate for all temperatures (one etple per NXentry).
Therefore, we either need
1. to extract all the NXetple information which has the same name
"my_fit_results" and has the same dimensions (a consistency check),
but not necessarily the same variable names, into a separate NeXus file.
[This is probably prefered, because one might want to
delete the intermediary results due to the size and
because the NeXus defintion so far is a very flat implementation.
We can maintain an effectively deeper heirachical structure
artifically by using file mounts in HDF 5 while keeping
the NeXus file format flat.]
and/or
2. Have the analysis program append all related NXetple results to
a separate NeXus file.
or
3. Make the NeXus data format have a deeper structure.
So we need to define above the level of NXentry
a lay to plot diffuse data
mplots=1 Number of plots to produce automatically
which_NXetple(mplots)=my_fit_results
default_plot(mplots)="plot temp rlx"
condition_plot(mplots)="det=1" Only detector # 1.
This will encourage a very deep heirachical structure.
It is probably inefficient for accessing and difficult to
keep track of for both the programmer and the user when browsing a file.
For "documentation" purposes, it is a good idea to keep a copy
of the NXetple information in the same NXentry as
the treated data (NXtreated or NXcorrelated) it is derived from
as well as appending it to a summary NeXus file containing only NXetples.
(Combo of options 1 and 2.)
It might be a good idea to import/export NeXus files
containing only NXetples items into XML or some other ascii
format (maybe one optimized for NeXus NXetple files)
to that people have the option of using their favorite
ascii-based editor. I think that there will be less interest
in extensive hand-editing of NXtreated or NXcorrelated.
==========================================================================
Summary
NXdata (Only store raw data in this catagory!)
------
data
errors_style="unknown"
or errors_style="not_recorded"
or errors_style="derived" For something too complicated to
explain here.
or errors_style="counting" For Neutrons and muons
or? errors_style="fractional"
errors_value = 0.01 for 1 percent error
or? errors_style="constant" same units as data.
errors_value = 1023.4
or? errors_style="stored" See error matrix
errors (only for errors_style="stored") Appropriate for raw?
errors_type = "asymmetric"
or errors_type = "symmetric"
or errors_type = "none"
NXTreated (Best for manipulated data, not for final reduction)
---------
NXprogram program_name that generated results
NXmacro
NXcomments
my_spectra
errors_style="stored"
parent_data_link Link to parent data, used to create my_spectra
errors (only for errors_style="stored")
errors_type = "asymmetric"
or errors_type = "symmetric"
or errors_type = "none"
NXcorrelated Could possibly be folded in with NXtreated,
------------ except tor? errors_style="stored" See error matrix
errors (only for errors_style="stored") Appropriate for raw?
errors_type = "asymmetric"
or errors_type = "symmetric"
or errors_type = "none"
NXTreated (Best for manipulated data, not for final reduction)
---------
NXprogram program_name that generated results
NXmacro
NXcomments
my_spectra
errors_style="stored"
parent_data_link Link to parent data, used to create my_spectra
errors (only for errors_style="stored")
errors_type = "asymmetric"
or errors_type = "symmetric"
or errors_type = "none"
NXcorrelated Could possibly be folded in with NXtreated,
------------ except that you may not truncate or append to the set.
NXprogram program_name that generated results
NXmacro
NXcomments
my_lineshape
errors_style="stored"
parent_data_link Link to parent data, used to create my_lineshape
errors
accuracy= 1.2e-9??? How accurately errors made
errors_type= "correlated" Covariance matrix
or errors
errors_type= "n_off_diagonal" Strip along the diagonal of cov. matrix
(Could be folded in with "correlated".)
num_off_diagonal= 5
or errors_hessian_row May be left out initially.
errors_type= "hessian_row"
num_off_diagonal= 5
errors_hessian_diagonal
errors_type= "hessian_diagonal"
NXetple (Best for data reduction)
-------
NXprogram program_name that generated results
NXmacro
NXcomments
my_fit_Results
nrows= Number of local rows (variable)
ncolumns= Number of columns (less easy to modify)
variable_name(ncolumns) Arrays of short names
lables(ncolumns)= Array of strings for axes lables
units(ncolumns)= Units for each column
errors_type(ncolumns)= "asymmetric", "symmetric", "none"
parent_data_link(nrows) Link to parent data, used to create row.
row_note(nrows) Optional short comment for each row
mplots= Number of plots to produce automatically
default_plot(mplots)= which variables (columns) to plot
condition_plot(mplots)= select only some row to plot.
etple(ncolumns, nrows) or etple(nrows, ncolumns) of 3-vectors
(value, positive_error, negative_error)
The "NXetple" data will be stored with its errors interwoven.
---
Dr. Tanya Riseman
School of Physics and Astronomy phone 44-121-414-7322
University of Birmingham fax 44-121-414-4719
Birmingham B15 2TT, United Kingdom email: tmr at th.ph.bham.ac.uk
More information about the NeXus
mailing list