Dark Data is the ROI problem.
Author: Tatiana Collins
The age of data generation
Data in the Digital Economy has been a massive generative force, now doubt, and as a result has accumulated in vast amounts.
Zetabytes of data storage.
Billions of applications.
Trillions of queries.
Thousands of data centers.
Continuous generation, duplication, storage and processing of data is increasing exponentially with the adoption of AI, especially in the form of latest large language models, which continuously consume and tokenize words and characters to produce desired output. By 2030, the amount of data produced is expected to reach 1 yottabyte per annum! Let’s be honest – none of us can comprehend or visualize what this means, except that it is A LOT, largely invisible, and increasingly puts pressure on financial and natural resources.
Dark Data, defined by Gartner as “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes”, is now estimated to be 85% of all data stored and used by an enterprise.
Keeping all this data safe and secure is a nagging headache for the Chief Information Officers, expected to deliver investment returns on data assets and vast, sophisticated data infrastructures, typically running into $100 million+ in capex.
All data are business assets
Technologists have long argued that there is perpetual value in data. It does not depreciate and does not deteriorate.
The other aspect is its perpetual effect. It has become almost impossible to delete and very easy to multiply.
We are told that dark data holds significant commercial potential, for instance:
- Unique Insights: Historic data can reveal hidden patterns, correlations and insights that are not visible in recent, “hot” data analysis.
- Better informed Decision Making: use a wider pool of information to improve strategic planning.
- Hidden Treasures: Business could gain competitive advantage by uncovering unique assets.
- Resource Optimization: Analyzing dark data can lead to better resource allocation and optimization, reducing operational inefficiencies.
Data Storage dynamics
As organizations and individuals have shifted their data to cloud, as part of the ongoing digital transformation, more often than not, in a “lift & shift” fashion, hyperscalers decided how to best optimize storage based on company’s usage patterns (and structured their commercial contracts accordingly). It typically looks like this, on an example of AWS.
Businesses, therefore, can make an informed decision on how to store their dark data, and they do.
Whilst most of such stored dark data is for compliance purposes, today, financial institutions, for instance, make limited use of historic transactions, customer communication logs and market data to optimise investment strategies and for fraud detection.
With the arrival of Generative AI, there is now a massive demand for training data, which means access, analysis and more frequent usage accommodated to Large Language Models. This calls for a complete re-assessment of initially defined commercial terms with cloud providers.
We’ve got lost of data. Let’s go!
Some businesses have already deployed small armies of data engineers and scientists on a dark data treasure hunt. Providing all this effort is a challenge with a clearly defined problem, what are the risks?
Technical
A typical DataMart or warehouse used for isolated data analysis, is designed for static Business Intelligence reports, and relies on batch data processing. AI models require powerful CPUs (Central Processing Units), streaming data pipelines, and wider Network capacity to access cold data in deep freezer archives. As data sets size increase, so does the time it takes for access and processing. This results in speed and efficiency issues, that many data specialists would attest. The current computing capacity and data infrastructure is not designed to cope with the scale.
Financial
When trying to access cold data, because of data multiplication effect, business users and their technology partners find themselves in a scenario of higher storage costs and even higher usage costs, i.e. occasionally 3 x higher than typical monthly bills from cloud and energy providers. Gartner, Inc. reports that “worldwide IT spending is expected to reach $5.06 trillion in 2024, an increase of 8% from 2023”. Commenting on the gold rush level spend on Gen AI projects, Gartner’s analyst Lovelock said: “In 2024, AI servers will account for close to 60% of hyperscalers’ total server spending.”
Reputational & Ethical
The ethical AI framework is only now being defined in some countries and regions. The Infocomm and Media Agency in Singapore recently unveiled the GenAI Model Governance Framework, with the world’s first Testing Framework and Toolkit for AI Governance for business operating in Singapore. The purpose of the framework is to begin to address ethical and governance issues when deploying AI solutions: explainability, bias, and human-centricity.
Environmental
While other industries have begun to quantify and reduce their environmental footprint, the technology industry has been quick to come up with use cases, yet slow to quantify its own footprint. IT contributes up to 5% of the total carbon footprint worldwide, more than aviation. Training one large language model consumes an equivalent of 1 small town of electricity per month. Besides energy, water is another critical resource, often overlooked and under-estimated in impact.
We are damned we don’t and damned if we do. So now what?
The good news is, it is not a new problem and lessons can be learnt from the past few years of digital transformation. Here is a case in point:
ITV is now the largest content, media and broadcasting company in the UK, home to some of the most iconic British filmography and unscripted entertainment. Courted by Apple at some stage, also pushing into streaming, ITV identified it had close to 100,000 hours of content in over 1,000 formats, often duplicated in both physical and digital formats. While certainly an asset, content costs in storage, maintenance, re-mastering and streaming have been at the top of the executive agenda and subject to years of optimisation. The corporation was looking to expand internationally and diversify revenue streams by “monetising” existing archives. An expansive content audit ensued to hunt for golden nuggets.
After scouring the archives for months, there was not much “gold” found. Some of the assets perceived as valuable were sentimental or historic in nature, others already served their purpose and were well-documented. Surfacing and utilising them them brought new and added dimension but did not make ITV a top earner. What did undoubtedly bring value was a new content governance framework, which reduced permissible formats from over 1,000 to 50. Such a radical approach to the company that was living and breathing content, was a key step in the transition to the new age of digital content management, and a key step in becoming a global, diversified business. The new content governance and a rigid definition of what constitutes an “asset”, dictated strict tagging, streamlining and minimisation disciplines on programme producers. Content minimisation KPIs imposed on content creators? Yes, and – ITV almost doubled its revenues in the last 10 years. Has it not been for such a drastic measure, the king of content would have long been dead.
5 Lessons Learnt:
- “Hot” data is your biggest asset: clean, recent, tagged source data that is core to your business model is your biggest asset. It is fascinating how many businesses do not use it to their advantage yet searching for more.
- Minimise: Hoarding more poor quality data and multiplying it in various formats and databases is counter-productive. Instead, aim for data minimization techniques at the point of data creation, starting from computation and architectural design.
- Bin it: There is a reason why dark data is unused, and not all of it because of storage. Why waste valuable resources cleaning, tagging and restoring it, or even holding on to it?
- Let them eat cake? When it comes to training AI models, unless an organization has free-flowing access to someone else’s data (e.g. social media), accessing dark, cold data from cloud archives will quickly become very expensive. Smaller AI/ML or statistical models can be as effective in solving clearly defined business problems. Not every problem is the AI problem.
- Simplicity is the new sexy: simple, robust governance and discipline in data management, whether architecture, storage, tagging, etc. will go a long way in helping to utilize existing data and reap desired commercial benefits.
References:
- Gartner: Planning for GenAI Initiatives is Helping to Drive IT Spending in 2024 and Beyond: https://www.gartner.com/en/newsroom/press-releases/2024-04-16-gartner-forecast-worldwide-it-spending-to-grow-8-percent-in-2024
- Statista: Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2020, with forecasts from 2021 to 2025 ; https://www.statista.com/statistics/871513/worldwide-data-created/
- Gartner: definition of dark data: https://www.gartner.com/en/information-technology/glossary/dark-data
- Dark data: The underestimated cybersecurity threat: https://www.securitymagazine.com/articles/98473-dark-data-the-underestimated-cybersecurity-threat
- TED: The dark side of data: https://www.ted.com/playlists/130/the_dark_side_of_data
- Splunk: The state of dark data: https://www.splunk.com/content/dam/splunk2/en_us/gated/white-paper/the-state-of-dark-data.pdf
- AWS data storage master course: https://www.udemy.com/course/amazon-s3-master-course