Once synthetic data was viewed as less desirable than real data, some now view synthetic data as a panacea. Real data is messy and full of bias. New data privacy regulations make it difficult to collect. By contrast, synthetic data is original and can be used to build more diverse data sets. You can produce perfectly categorized faces, for example, of different ages, shapes, and ethnicities to build a face detection system that works across populations.
But synthetic data has its limitations. If it fails to reflect reality, it could end up producing AI that is worse than messy and biased real-world data — or it could simply inherit the same problems. “What I don’t want to do is give the thumbs up to this model and say, ‘Oh, that would solve a lot of problems,'” says Cathy O’Neill, data scientist and founder of auditing firm ORCAA. “Because it would also ignore a lot of things.”
Realistic, not real
Deep learning has always been about data. But in the past few years, the AI community has learned that Hassan Data is more important than Big data. Even small amounts of clearly labeled integer data can do more to improve the performance of an AI system than 10 times the amount of unsaturated data, or even a more advanced algorithm.
Datagen CEO and co-founder, Ofir Chakon, says this is changing the way companies must approach the development of their AI models. Today, they start by getting as much data as possible and then tweaking and tuning their algorithms to get better performance. Instead, they should do the opposite: use the same algorithm while optimizing the composition of their data.
But collecting real-world data to perform this kind of iterative experiment is very expensive and time consuming. This is where Datagen comes in. With the synthetic data generator, teams can create and test dozens of new data sets daily to determine which ones maximize model performance.
To ensure its data is realistic, Datagen gives its vendors detailed instructions on how many individuals to scan in each age group, BMI range, and ethnicity, as well as a specific list of actions they can take, such as walking around a room or drinking soda. Vendors send high-resolution still images and motion capture data for those actions. Datagen’s algorithms then expand this data into hundreds of thousands of combinations. Sometimes the composite data is checked again. Fake faces are drawn on real faces, for example, to see if they look realistic.
Datagen is now creating facial expressions to monitor driver attention in smart cars, body movements to track customers in cashier-free stores, and iris and hand movements to improve the eye and hand tracking capabilities of VR headsets. The company says its data has already been used to develop computer vision systems that serve tens of millions of users.
It’s not just artificial humans that are being manufactured on a large scale. Extra Clicks It is a startup that uses artificial intelligence to perform automated vehicle checks. Using design software, it recreates all the car models that the AI needs to recognize and then displays them with different colors, damages, and distortions under different lighting conditions, against different backgrounds. This allows the company to update its AI when automakers roll out new models, and helps it avoid data privacy violations in countries where license plates are considered proprietary information and therefore cannot be present in images used to train AI.
mostly It works with financial, telecom, and insurance companies to provide spreadsheets of fake customer data that allow companies to share their customer database with third-party vendors in a law-compliant manner. Anonymity can reduce the richness of a data set yet still fail to adequately protect people’s privacy. But synthetic data can be used to create detailed fake data sets that share the same statistical properties as real company data. It can also be used to simulate data that the company does not yet have, including a variety of customers or scenarios such as fraudulent activity.
Proponents of synthetic data say it can help evaluate AI, too. at recent paper Published at a conference on artificial intelligence, Sochi Sariya, associate professor of machine learning and healthcare at Johns Hopkins University, her co-authors showed how data-generation techniques can be used to extrapolate different groups of patients from a single set of data. This might be useful, for example, if a company only had data from a younger population in New York City but wanted to understand how AI is working on an aging population with a high prevalence of diabetes. It has now started its own company, Bayesian Health, which will use this technology to help test medical AI systems.
The limits of counterfeiting
But are synthetic data exaggerated?
When it comes to privacy, “just because data is ‘synthetic’ and doesn’t directly correspond to real user data doesn’t mean it doesn’t encrypt sensitive information about real people,” says Aaron Roth, professor of computer and information sciences at the University of Pennsylvania. It has been shown that some data generation techniques closely reproduce images or text in training data, for example, while others are vulnerable to attacks that cause them to completely reproduce that data.
This might be fine for a company like Datagen, whose synthetic data is not intended to anonymize the individuals who have agreed to be scanned. But it would be bad news for companies that offer their solutions as a way to protect sensitive financial or patient information.
Research indicates that the combination of two synthetic data technologies in particular –differential privacy And the Generative adversarial networksIt can produce the most powerful privacy protections, says Bernice Hermann, a data scientist at the University of Washington’s Electronic Science Institute. But skeptics worry that this nuance could get lost in the marketing language of synthetic data sellers, which won’t always come up about the technologies they use.