Charles Humby (he of Tesco Clubcard fame) is credited with coining the data as oil analogy in 2006. Then it was given a boost by the Economist. There are definite parallels between the costs of turning crude oil into a garage forecourt product and the effort necessary to turn raw data into the fuel that powers the data revolution. The analogy has survived because it is good. So good that it is now something of a cliché, as all-pervading as the light bulb is as shorthand for all things innovative.
My curiosity is whether the analogy stands for data that is collected as part of an academic research project, particularly routinely collected health data.
Expensive stuff, data
The analogy catches one of the expensive problems with data. Wiser heads than mine have mentioned that you can’t own data. The analogy implicitly acknowledges that data, like the sticky black stuff that comes out of the ground, needs work before it becomes the user-friendly tool on which theories can be based and algorithms tested. That work takes time, resource and ultimately, money. Hydrocarbons need to be extracted from unforgiving parts of the planet. Data needs to be harvested from people or pulled out of record systems or databases that can be labyrinthine (try getting two NHS Trust IT systems to talk to one another) or just plain muddy (put too much data into a datalake without paying attention and the datalake starts to look like plasticine at the end of a toddler group session: a brownish-purple blob with few distinguishable features). Where crude oil needs refining, data needs contextualising. In its crude state, data is noisy and messy: illegible, out of date, incomplete or in the wrong format. Routinely collected health data often isn’t collected or organised in a way that makes the switch to a research use easy.
Stretching the data as oil analogy, where extracting and refining oil carries risk so the use of data can be a minefield. In the oil industry, stuff can blow up. Data does not (always) pose the same immediate threat to life and environment but it does carry risks: where data is used in a clinical setting there is a very real risk to life. If you read the data wrong, if you fail to correct inherent bias in the data or in the process of collecting the data then your predictions and algorithms will be a massive fail. Legally speaking, data use is dependent on permissions. Exceeding those permissions risks infringement actions and damages. Data protection imposes restrictions and obligations that are not quick, simple or cheap to comply with. Failure to comply risks regulatory action, damages claims and/or a public backlash. There have been two attempts to use NHS data on a national scale: the first failed miserably in the face of transparency and confidentiality issues (remember care.data?); the second (GP Data for Planning and Research) has been postponed to an unspecified start date having been heavily criticised for …. a lack of transparency. [Lack of] trust is a massive issue.
Breaking the analogy, there are two aspects of research that do not map onto the process of refining crude oil.
Refining crude is a one way process. Oil cannot be unrefined and reused. Refined data can be interrogated and broken down and then reused. That is how analysts seek to understand anomalies: go back to the source. It doesn’t always work – refining data often involves removing parts of the data and that loss of data may render it impossible to reverse the process.
The costs of refining crude into petrol are predictable: forecourt pricing can be broken down:
The costs of refining data are harder to predict: not all processes are equal or equally effective.
This highlights a big problem: how do you value data? Spoiler alert, I have no answer to that. I can offer two observations.
When it comes to valuing data, there are extreme views and nuances in between.
One extreme view: raw data is in such a mess that the user has to spend so much time and so much money sorting it out that there is no sense in the user paying for it, even though once the user has sorted it out it will be commercialised with enthusiasm. That view is bonkers. If you rely on another source to supply the data then you need to accept that that source will need to see a return.
Another extreme view: data has so much potential value in research terms and in financial terms that no sensible provider should release it unless the user is prepared to pay handsomely and upfront. That view is less bonkers but only if the price charged is calculated based on the cost of extracting the data. Want to price your data according to the value that will be realised from it? You need a crystal ball. Or an algorithm (ironically). Valuation is commonplace conundrum in the research world: my material (cell line, antibody, tissue sample, data) forms a tiny part of your process. But your process cannot progress without my material. Hence observation one: be realistic. It takes money and resource both to refine data to a usable product and to collect the data in the first place. In many cases, access to data becomes as valuable as the data itself. There is a reason that AI companies fight to forge links with the NHS. Providers and users each need to recognise that the other has incurred costs already (collecting data or building a business) and that each is contributing a valuable asset (raw data or the expertise and resource to refine and use it). Each needs to be realistic in their expectations of return (‘the more resource I have to commit, the less I will pay for access’) but also realistic enough to recognise that each needs what the other has.
Do you need to value the data itself?
Logically, if you can’t own data, you can’t sell data so you don’t need a notional price per data unit. Focus on selling the things around the data: sell the right to access and use the data by way of granting licences to database right or copyright or by using confidentiality as a controlling mechanism. You can assess and charge for the resource you have expended in collecting and cleaning (mining and refining) the data. The real money isn’t in the data. It is in the data services.
So, does the analogy work for research data?
Does the ‘data is oil’ analogy work for research data including routinely collected health data used in research? Only up to a point. Tesco’s Clubcard is nothing if not an exercise in research: collecting data and spotting trends. Crude oil and crude data are both difficult to mine and refine. But there are key differences: the cost of extracting and refining oil is easier to quantify and the price of the end product is easier to predict. The cost of extracting and refining data are much harder to quantify and the end product is much harder to value. The money is in the services not the product, especially in a health care setting where selling data is still at least a mild taboo.
“Once the data is collected, it will only be used for the purposes of improving health and care. Patient data is not for sale and will never be for sale.”[Extract from letter from DHSC (Jo Churchill MP) to GPs about GPDPR – 19 July 2021. Available here]