Modern Australian
The Times

Tech companies are turning to ‘synthetic data’ to train AI models – but there’s a hidden cost

  • Written by James Jin Kang, Senior Lecturer in Computer Science, RMIT University Vietnam
Tech companies are turning to ‘synthetic data’ to train AI models – but there’s a hidden cost

Last week the billionaire and owner of X, Elon Musk, claimed the pool of human-generated data that’s used to train artificial intelligence (AI) models such as ChatGPT has run out.

Musk didn’t cite evidence to support this. But other leading tech industry figures have made similar claims in recent months. And earlier research indicated human-generated data would run out within two to eight years.

This is largely because humans can’t create new data such as text, video and images fast enough to keep up with the speedy and enormous demands of AI models. When genuine data does run out, it will present a major problem for both developers and users of AI.

It will force tech companies to depend more heavily on data generated by AI, known as “synthetic data”. And this, in turn, could lead to the AI systems currently used by hundreds of millions of people being less accurate and reliable – and therefore, useful.

But this isn’t an inevitable outcome. In fact, if used and managed carefully, synthetic data could improve AI models.

Phone running ChatGPT application in front of OpenAI logo.
Tech companies such as OpenAI are using more synthetic data to train AI models. T. Schneider/Shutterstock

The problems with real data

Tech companies depend on data – real or synthetic – to build, train and refine generative AI models such as ChatGPT. The quality of this data is crucial. Poor data leads to poor outputs, in the same way using low-quality ingredients in cooking can produce low-quality meals.

Real data refers to text, video and images created by humans. Companies collect it through methods such as surveys, experiments, observations or mining of websites and social media.

Real data is generally considered valuable because it includes true events and captures a wide range of scenarios and contexts. However, it isn’t perfect.

For example, it can contain spelling errors and inconsistent or irrelevant content. It can also be heavily biased, which can, for example, lead to generative AI models creating images that show only men or white people in certain jobs.

This kind of data also requires a lot of time and effort to prepare. First, people collect datasets, before labelling them to make them meaningful for an AI model. They will then review and clean this data to resolve any inconsistencies, before computers filter, organise and validate it.

This process can take up to 80% of the total time investment in the development of an AI system.

But as stated above, real data is also in increasingly short supply because humans can’t produce it quickly enough to feed burgeoning AI demand.

The rise of synthetic data

Synthetic data is artificially created or generated by algorithms, such as text generated by ChatGPT or an image generated by DALL-E.

In theory, synthetic data offers a cost-effective and faster solution for training AI models.

It also addresses privacy concerns and ethical issues, particularly with sensitive personal information like health data.

Importantly, unlike real data it isn’t in short supply. In fact, it’s unlimited.

The challenges of synthetic data

For these reasons, tech companies are increasingly turning to synthetic data to train their AI systems. Research firm Gartner estimates that by 2030, synthetic data will become the main form of data used in AI.

But although synthetic data offers promising solutions, it is not without its challenges.

A primary concerns is that AI models can “collapse” when they rely too much on synthetic data. This means they start generating so many “hallucinations” – a response that contains false information – and decline so much in quality and performance that they are unusable.

For example, AI models already struggle with spelling some words correctly. If this mistake-riddled data is used to train other models, then they too are bound to replicate the errors.

Synthetic data also carries a risk of being overly simplistic. It may be devoid of the nuanced details and diversity found in real datasets, which could result in the output of AI models trained on it also being overly simplistic and less useful.

Creating robust systems to keep AI accurate and trustworthy

To address these issues, it’s essential that international bodies and organisations such as the International Organisation for Standardisation or the United Nations’ International Telecommunication Union introduce robust systems for tracking and validating AI training data, and ensure the systems can be implemented globally.

AI systems can be equipped to track metadata, allowing users or systems to trace the origins and quality of any synthetic data it’s been trained on. This would complement a globally standard tracking and validation system.

Humans must also maintain oversight of synthetic data throughout the training process of an AI model to ensure it is of a high quality. This oversight should include defining objectives, validating data quality, ensuring compliance with ethical standards and monitoring AI model performance.

Somewhat ironically, AI algorithms can also play a role in auditing and verifying data, ensuring the accuracy of AI-generated outputs from other models. For example, these algorithms can compare synthetic data against real data to identify any errors or discrepancy to ensure the data is consistent and accurate. So in this way, synthetic data could lead to better AI models.

The future of AI depends on high-quality data. Synthetic data will play an increasingly important role in overcoming data shortages.

However, its use must be carefully managed to maintain transparency, reduce errors and preserve privacy – ensuring synthetic data serves as a reliable supplement to real data, keeping AI systems accurate and trustworthy.

Authors: James Jin Kang, Senior Lecturer in Computer Science, RMIT University Vietnam

Read more https://theconversation.com/tech-companies-are-turning-to-synthetic-data-to-train-ai-models-but-theres-a-hidden-cost-246248

Why Finding Reliable Doctors In Bundoora Is Important For Long-Term Health

Access to quality healthcare plays an important role in maintaining overall wellbeing and managing health concerns early. Trusted Doctors in Bundoor...

Understanding the Different Types of Car Services: Minor vs Major

When it comes to car maintenance, one of the most important things every vehicle owner should understand is the difference between a minor and a maj...

How Superannuation and TPD Insurance Work Together

Superannuation is an essential part of financial planning in Australia. It is designed to provide individuals with income during retirement, helping...

Tiny Towns funding granted for Mt Hotham and Mt Buller upgrades

Alpine Resorts Victoria (ARV) has welcomed funding support from the Victorian Government’s  Tiny Towns Fund, with both Mt Hotham and Mt Buller se...

Locksmith Services: Why Professional Security Solutions Matter More Than Ever

Security is a critical concern for homeowners, businesses, and vehicle owners alike. Whether it involves protecting a property, replacing damaged lo...

Why Tooth Fillings Are Important For Protecting Damaged Teeth

Cavities and minor tooth damage are common dental problems that can worsen if left untreated. Professional tooth fillings help restore damaged teeth, ...

The Connection Between Visibility and Driver Confidence

Operating a vehicle safely requires an immediate, uncompromised stream of visual information from the surrounding road environment. A driver's decis...

Important Things To Know Before Starting An SMSF Setup

Planning for retirement requires careful financial decisions, and many Australians are now looking for more direct control over how their superannua...

Why Retail Cleaning Plays a Key Role in Customer Experience and Business Success

Professional retail cleaning services are an essential part of maintaining a welcoming, safe, and professional environment for customers and staff...

Simple Ways to Make a Commercial Property More Appealing to Buyers

Selling or leasing a commercial property isn’t just about listing the square metres, taking a few photos and waiting for the right person to appea...

What Café Owners Should Know Before Upgrading Their Display Setup

A café display fridge does a lot more than keep cakes cold and sandwiches fresh. It quietly shapes the way customers browse, the way staff move beh...

Creating a Backyard That Feels Comfortable All Year Round

A great backyard doesn’t need to be huge, expensive or perfectly styled. Most of the time, the spaces people actually use are the ones that feel e...

How Homeowners Can Make Smarter Energy Decisions Before Upgrading

Energy upgrades used to feel like something you only looked into after a power bill gave you a nasty surprise. These days, though, more homeowners a...

Why Retail CX Breaks During Peak Sales Events and How to Prevent It

Retail customer experience has become one of the most important drivers of revenue growth, especially during high-intensity sales periods. However, ev...

15 South Indian Dishes Everyone Should Try

If your only experience of "Indian food" is butter chicken and garlic naan, South Indian cuisine is going to feel like discovering an entirely new c...

What Every Homeowner Should Know About Roof and Drainage Maintenance

A home's roof and drainage system work together every day to protect the property from water damage. While many homeowners focus on visible areas such...

From Plans to Priced Quote: The Estimating Workflow Most Builders Skip

For a small one-off job, an experienced builder can size up the materials in their head. The problem is that most jobs are not small one-off jobs, and...

Organisational Experts Share Their Tips for Achieving a Clutter-Free Kitchen

They say the kitchen is the heart of a house which means a clutter-free kitchen not only makes your home in general look nicer, it also makes cookin...