At one end of the scale are sensors, perhaps body-worn medical sensors, that are clearly-defined objects and are clearly important. At the other end of the scale (you might argue) are social media comments, but even these can be important.
NPL head of data science Mike Oldham tells Electronics Weekly: “There are UK SMEs analysing social media tweets for business decisions – for example, mapping tweets onto the UK transport network and picking up problems, before a transport company’s own instruments in some cases.”
With ever-more machine learning sifting data for decisions, or even making decisions, how can confidence in data quality be maintained throughout the data’s lifecycle, asks Oldham. It’s a cycle he summarises as “collect, connect, comprehend” with compress in the list if a lot of data has to go through a narrow pipe.
Is there any feel for what is important data and what is not?
“If you have spoof data on a smart power grid, you don’t want a semi‑autonomous machine to close down the grid,” says Oldham.
For the tweets and transport situation, “I am not sure NPL needs to get involved if navigation data is a bit wrong and you are late once in a while, but for medical diagnosis you will need the gold standard. Data importance is all about lives, safety and financial risk.”
With its counterpart metrology institutes in the US and Germany – NIST and PTB – NPL has just kicked off a project to set some standards for data quality, according to NPL strategy manager Sundeep Bhandari. It is also working with Brunel University, some China-based organisations and the Turing Institute in various ways towards similar ends.
Part of the research has involved quizzing organisations, including telecoms firms, energy firms, healthcare providers, the BBC and the Metropolitan Police, about what they need from data quality metrics. “After speaking to them, we will try to distil down what NPL needs to do for industry,” says Bhandari.
One known need is to quantify measurements made during scan‑based medical diagnoses, to allow ‘big data’ techniques to extract new knowledge from millions of scans, and to remove variability from individual diagnoses.
“At the moment, medical diagnosis is a close working relationship between a clinician and the machine they use,” says data scientist Oldham. “We are working on standardising this, so any clinician can work with results from any machine. Part of the process is deciding what kind of metadata you have to collect.”
One specific medical data quality project at NPL is an attempt to improve measurements made by MRI scanners observing ‘myocardial perfusion rate’ – the rate blood, and therefore oxygen, is delivered to heart tissue.
The scan is acquired over time, and then post-processed by a clinician who selects the objects of interest within the scan, allowing the machine to extract a time-curve of the behaviour of a contrast agent. This curve is manipulated against a mathematical model to estimate the perfusion rate.
The trouble is, an MRI scan is largely a qualitative instrument, unsuited to absolute measurement. The best way to measure perfusion is with a PET scan, but this needs an injection of radioactive material.
“PET is the most quantitative, but a person can only have so many PET scans in one life,” NPL data scientist Nadia Smith tells Electronics Weekly. “If we can bring up MRI to be more quantitative, it would be much better for diagnosis.”
And that is exactly what NPL will be attempting as part of a three-year European project alongside its peers in France and Germany (LNE and PTB), King’s College London and a Finnish hospital.