The TruScan handheld Raman system can very accurately analyse and report PASS/FAIL results with extremely high precision.
However, when a material is not found in the method, i.e. it does not pass the test and returns a "FAIL" verdict, what is the performance of the discovery mode in determining the correct unknown identity? Can we characterise the discovery library performance with objective measurement criteria?
Most dedicated spectroscopic analytical methods are quantitative and addressed to a small number of chemical species, e.g., protein content in wheat, and moisture content in a pharmaceutical drying process. Traditional analytical performance criteria such as mean-squared error, sensitivity, selectivity and limit-of-detection are often used to characterise the performance of these systems.
Increasingly, devices such as FTIR and handheld Raman spectroscopy systems are being employed for qualitative analysis tasks, namely in situ material verification of raw materials. While the developers and end-users of these devices would still greatly benefit from a rigorous means of performance characterisation, traditional figures of measure are ill-suited for material identification tasks. Further, since these devices are increasingly employed by non-experts, they should be rigorously characterised in terms of their qualitative end-use attributes.
Decisions and Outcomes
Analytical chemists and laboratory personnel usually associate spectroscopic material identification with software library search methods which tend to be viewed like internet search tools. A query is submitted to the search engine as a material is presented to the spectroscopic system, and the results of the query, sorted by some ranking measure, are returned for expert spectroscopic review. But the end-users of field material identification systems are not usually spectroscopy experts, so they rely on an algorithm to convert instrument data to a qualitative result.

Like all data-driven judgments, this qualitative result is invariably right or wrong. For spectroscopic material identification, the outcome tree is slightly more complex than simply right and wrong, as seen on the following page.
Material Identification Decision Tree
Therefore, the critical question for the end-user is "If my instrument fails to PASS the method verification, how often am I correct in identifying the unknown, and how often am I mistaken?"
Performance Characterisation
Receiver operator characteristic (ROC) curves1-3 have long been used to depict the tradeoff between sensitivity of detection and false-positive rates for qualitative tests. They are easy to understand, graphical descriptors of test capability. But ROC curves are insufficient for the material identification task in at least one sense: spectroscopic material identification systems could return multiple material records in response to a single measurement query. Qualitative precision is therefore an additional performance characteristic that is important from an end-use perspective.
The effectiveness of information retrieval systems are more typically characterised by precision-recall (PR) curves.4 McLafferty et al. employed PR curves5 in their evaluation of mass spectral library search software; however, their objective was substructure identification rather than full molecular identification. The PR curve also does not reflect false-positive rates, because in document retrieval it is assumed that there are always relevant records in the database. This is an inappropriate assumption for spectroscopic material identification because the systems will invariably encounter materials that are not in the device's library of materials.
Therefore, it is necessary to construct the necessary parameters for a spectroscopic material identification system which can be characterised by the following:
- true-positive rate @ t: if materials in the system library are tested under field conditions, how likely is the system to declare a match of an unknown to the correct library materials?
- false-positive rate @ t: if material that are not in the system library are tested under field conditions, how likely is the system to declare false matches of an unknown to other irrelevant materials in the system library?
- imprecision @ t: when the device declares a 'match', regardless of whether the match is correct or not, how long is the list of inappropriately listed materials?
To examine the system's operating characteristics, we also have to assume that the spectroscopic material identification device has a tunable parameter (denoted "t" in the definitions above) which would allow it to be more liberal or conservative in suggesting a material identification.
Experiment
We examined the performance of Ahura Scientific's handheld Raman verification system, TruScan. This device employs a power-tunable 785 nm laser excitation with a sub-0.2 nm linewidth, and a resulting Raman shift range from 200 to 2900 cm-1. Embedded systems handle all data acquisition, verification algorithm calculations, and the on-board LCD screen displays results and system navigation. Weighing less than 4 lbs, this droppable, water-tight, environmentally-robust device was designed to meet demanding specifications for field-use by non-experts. An experimental performance verification of the resulting system was critical.
A multi-device, multi-user experiment was designed and executed over 8 weeks with materials chosen at random by serial number. This involved:
| 6 devices | 335 measurements in vial-holder |
| 6 operators | 455 measurements free-space |
| 261 unique materials | 376 liquid measurements |
| 790 total measurements | 414 solid measurements |
1454 total challenges to the systems
Five of the users were characterised as having light experience with the device (equivalent to two days of training) and one user was a novice, having approximately five to ten minutes of operational training at the start of the study.
664 of the 790 measurements were made on materials represented in the system library. For a more rigorous assessment of false-positive rates, the results of these tests were also re-analysed by removing the relevant library record and re-examining test results for any false-positives that would have occurred. The devices executed the embedded version of Ahura Scientific's probabilistic material verification software in real-time, and were operating in "auto-measure" mode, in which the device governs all tunable software parameters on a measurement-by-measurement basis.
The aggregate ROC curve6 for all six devices is shown here. The experimental uncertainty (95th percentile) in the point estimates is indicated by the blue shaded area, while the measured curve itself is in red. The area under this ROC curve, a bulk measure of qualitative accuracy, is 96.9% with an uncertainty of 1.0% (95th percentile).
The results were very consistent across the 6 devices/users participating in the study.
FD1203 had distinctly low performance for free-space measurements of solids. This was the device operated by the most novice user and was therefore more prone to user related error. There were no other measurable degradations in performance associated with either material state (solid or liquid) or measurement geometry (vial-holder or free-space). The areas under the ROC curves across all stratifications of the experiment are tabulated below.
| Instrument serial # Measured samples | FD1006 | FD1051 | FD1053 | FD0801 | FD1603 | FD1203 | aggregate |
| All | 97.30% | 97.20% | 98.80% | 97.50% | 98.30% | 94.60% | 96.9% (±1.0%) |
| Free-space | 94.70% | 95.10% | 99.10% | 98.80% | 99.90% | 92.10% | 95.6% (±1.7%) |
| Vial Holder | 99.40% | 99.50% | 98.70% | 96.40% | 97.10% | 97.10% | 98.2% (±1.1%) |
| Liquids | 94.60% | 96.00% | 98.70% | 98.00% | 99.20% | 96.90% | 97.0% (±1.4%) |
| Solids | 99.20% | 98.30% | 99.50% | 95.80% | 96.00% | 92.40% | 97.0% (±1.5%) |
The ROC curve does not convey precision information, so we have tabulated the precision
characteristics of the experiment below.
| # of Possible Materials Identified | % of cases | Interval Estimate |
| 0 | 92.7% | [89.7% 94.9%] |
| 1 | 5.2% | [3.4% 7.8%] |
| 2 | 1.1% | [0.5% 2.7%] |
| 3 | 0.6% | [0.2% 2.0%] |
| ≥4 | 0.4% | [0% 1.6%] |
In 93 percent of the 1490 cases, the system reported only the correct material as a plausible match.
In 99.4 percent of all cases, the system reported the correct material as the first (e.g. most likely) or only choice.
Summary
Traditional analytical figures of measuring qualitative performance are unsuitable (SNR, SEL, SEN, LOD) for characterizing identification of unknowns. Like all analytical devices, these qualitative detectors require performance characterization. Strict rules must be defined for robust, repeatable operation by non-expert users. ROC curves and precision characteristics represent a promising means to measure spectroscopic discovery library performance. When rigorously tested in an end-use situation, the Ahura Scientific handheld Raman devices accurately and precisely identify the correct or most likely material. These spectroscopic material identification tools have been employed by non-expert field-users and are found acceptable for pharmaceutical validation requirements.
References
- JA Swets et al., "Assessment of diagnostic technologies" Science 205:753-759 (1979)
- TD Wickens, Elementary Signal Detection Theory, Oxford University Press 2001
- CD Brown, and HT Davis, "Receiver operating characteristic curves and related decision theory measures: a tutorial", Chemometrics and Intelligent Laboratory Systems 80:24-38 (2006)
- D Lewis, "Representation quality in text classification: an introduction and experiment", Proceedings of the Workshop on Speech and Natural Language, Morgan Kaufmann 1990; pp. 312-318
- GM Pesyna et al. "Probability based matching system using a large collection of reference mass spectra" Analytical Chemistry 48:1362-1368 (1976)
- Proc. of SPIE, spie.org, Vol. 6378 637809,1-11

