So the resulting synthetic dataset looks like a real distribution, but it reveals nothing about any individual in the real-world dataset. The synthetic data points are all randomly generated, but within certain boundaries that preserve the original relationships (like height to weight).
Gathering, labelling, and analysing data can be difficult, time-consuming, and expensive. Synthetic data will democratise these tasks by making useful data abundant, and so challenging what we used to call “big data”. It can provide the vast volumes of data needed to train machine-learning models and also allow for the testing of software applications in a controlled environment.
Synthetic data is taking over
The tech research and consulting firm Gartner predicts that over the next 10 years, synthetic data will start to massively overshadow real data in AI models. By 2024, Gartner projects, 60 per cent of data used for AI and machine learning will be synthetic data. Andrew White Rob Toews claims that “the rise of synthetic data will completely transform the economics, ownership, strategic dynamics, even (geo)politics of data”. Toews cites Ofir Zuk, CEO-founder of synthetic data startup Datagen, claiming that the total addressable market of synthetic data and the total addressable market of data will converge.
An early use of synthetic data was in the US Census in 1993. The technology has advanced beyond recognition since then and today is being used in a wide range of sectors – from robotics and geospatial imagery to banking and genome studies into diseases. Synthetic data was used as early as 1993, when the US Census Bureau released synthetic samples from the Census so that it didn’t disclose any real microdata. But the technology has advanced beyond recognition since then, and has been used in sectors as disparate as robotics, geospatial imagery, banking (e.g. to better analyse the risk of fraud), and genome studies into diseases.
Use cases in Aotearoa go back to at least 2009, when NIWA and MetService created synthetic 10-minute wind datasets at sites across the country, to help model the impact of wind farms on the national grid.
Other public-sector use cases here have included modelling ways to improve housing affordability in south Auckland and exploring options for the delivery of social services.
Access to oceans of data
Digital platforms are two-sided marketplaces, matching sellers and buyers - TradeMe is an obvious local example. Some, like Apple, both manage the marketplace and also sell their own hardware and software.
Many, especially social media platforms, gain a competitive advantage from what they learn about their buyers and sellers. These datasets are key to the network effects that successful platforms ride to become ever more dominant.
Access to this data has allowed them to reverse the 19th-century model of economies of scale of production, with a new model of the economies of scale of consumption – even a single person living in Okarito can order from Amazon. The bigger a platform gets, the better its ability to serve each customer in a bespoke way.
This phenomenon acts as a moat, as Warren Buffett might say – a barrier to entry so daunting that once a platform has achieved dominance, competitors fade away.
Synthetic data challenges this digital platform model in several ways, but there are opportunities as well as threats.
Less dependence on user data
Synthetic data can be used to train machine-learning algorithms to generate insights and to model different consumer offerings without relying on real data about buyers, sellers, and other users.
Many digital platforms generate much of their revenue by collecting user data and using it to target advertisements. If synthetic data can generate the same insights and enable the same modelling of consumer offerings, then it becomes harder for platforms to justify collecting, using, and keeping huge quantities of personal information.
Better insights and modelling
Synthetic datasets can also sometimes be more accurate than real-world datasets.
This counter-intuitive outcome is because, first, synthetic datasets can be adjusted to correct for known bias. Second, the statistical properties designed into the synthetic data can more readily account for rare outlier events that can be missed in smaller real-world data collection – like when a giraffe runs in front of your car just as your front-left tyre blows out.
This is an opportunity for digital platforms, but also a threat, because sellers may be able to generate their own insights and do their own modelling without relying on the platforms’ datasets.
Better protection of personal data
A key advantage of synthetic data is that it allows companies to analyse data and model scenarios without using any data relating to real individuals.
Concerns about the use and security of personal data won’t go away, and regulation and best practice are only heading towards more restrictions. Synthetic data offers companies a way off this track without stifling innovation in their offerings.
Software and machine-learning models can be tested and validated using synthetic data at vastly larger scales and without security and privacy risks.
Stupendously large synthetic datasets underpin the enormous investment in self-driving vehicles (using real-world video from vehicles driving the streets as input to the machine-learning models would take at least 300 years of driving). As an aside, this use has the greatest concentration of synthetic data engineers.
A pathway to a new competitive advantage?
Synthetic data is evolving fast, but already it provides advantages over real-world datasets, as this diagram illustrates:
The global platforms like Meta and Google have vast amounts of data and, of course, also have the resources to use synthetic data to good effect themselves to match any new competitor. That, coupled with their many other revenue streams, mean they are not at risk. However, smaller firms that rely on their customer data for their competitive advantage, including here in Aotearoa, need to take a close look at their business model. They may find synthetic data offers a pathway to a new competitive advantage.
- Kevin Jenkins is a professional director, and founder of www.martinjenkins.co.nz, and writes about the intersection of business, innovation and regulation.