Updated: Jun 7, 2021
Artificial intelligence is poised to disrupt nearly every industry by the end of the decade with the promise of increased efficiencies, higher profitability, and smarter, data-driven business decisions.
And yet, as Gartner has publicized, 85% of AI projects fail. Four barriers are cited repeatedly: skills of staff; data quality; unclear business case; and security and privacy. A study by Dimensional Research revealed that 96% of organizations have problems with training data quality and quantity, and that most AI projects require more than 100,000 data samples for success.
Data security is an increasingly important consideration in nearly every industry. Privacy laws are expanding rapidly, leading to a shortage in available data sets; even if the data needed to train AI models exists, it may not be available due to compliance requirements.
As a result, companies are now searching for ways to adopt AI without large data sets. More data is not necessarily better. The key is good data, not just big data.
But what do you do when good data just isn’t available? Increasingly, enterprises are discovering the gap can be filled with synthetic data — a move that promises to revolutionize the industry, enabling more companies to use AI to improve processes and solve business problems with machine intelligence.
Synthetic data is artificial data generated via computer program instead of real-world events. Ideally, synthetic data is created from a “seed” of real data — a few false positives and negatives, and a few true positives and negatives. Then those real pieces of data can be manipulated in various ways to create the synthetic dataset good enough and large enough to drive the creation of successful AI models.
There are many synthetic data generators on the market for structured data, such as Gretel, MOSTLY AI, Synthetic IO, Synthesized IO, Tonic, and the open-source Synthetic Data Vault. Scikit-learn is a free software machine learning library for Python with some synthetic data generation capabilities. In addition to synthetic data generators, data scientists can perform the task manually with more effort.
Generative adversarial networks (GANs) are a type of neural network that generate realistic copies of real data. GANs generate new samples into the dataset with image blending and image translation. This type of work is labor-intensive but does provide a way to solve seemingly unsolvable AI challenges.
While several emerging synthetic data generators exist on the market today, often these “out of the box” tools are either insufficient to solve the problem without significant customization, and/or do not have the capability to tackle unstructured data sets — such as photos and videos.
Training an AI model for a global auto maker with synthetic data
A project my team recently worked on with one of the world’s top three auto manufacturers provides a good example of how you can quickly deploy synthetic data to fill a data gap.
Specifically, this example points out how to create synthetic data when the data is in the form of an image. Due to its unstructured character, image manipulation is more complex than numerical or text-based structured datasets.
The company has a product warranty system that requires customers and dealers to submit photos to file a warranty claim. The process of manually examining millions of warranty submissions is time consuming and expensive. The company wanted to use AI to automate the process: create a model to look at the photos, simultaneously validate the part in question, and detect anomalies.
Creating an AI data model to automatically recognize the product in the photos and determine warranty validity wasn’t an impossible task. The catch: for data privacy reasons, the available data set was inaccessible. Instead of tens of thousands of product photos to train the AI models, they could only provide a few dozen images. Frankly, I felt it was a showstopper. Without a sizable data set, conventional data science had ground to a halt.
And yet, where there is a will, there is a way. We started with a few dozen images with a mixture of good and bad examples, and replicated those images using a proprietary tool for synthetic data — including creative filtration techniques, coloration scheme changes, and lighting changes — much like a studio designer does to create different effects.
One of the primary challenges of using synthetic data is thinking of every possible scenario and creating data with those circumstances. We started out with 30 to 40 warranty images from the auto manufacturer. Based on these few images provided with good and bad examples, we were able to create false positives, false negatives, true positives, and true negatives. We first trained the model to recognize the part in question for the warranty, then trained it to differentiate between other things in the image — for example, the difference between glare on the camera lens and a scratch on a wheel.
The challenge was that as we moved along, outliers were missing. When creating synthetic data, it is important to stop, look at the complete dataset, and see what might be needed to improve the success of the model at predicting what is in the photo. That means considering every possible variable including angles, lighting, blur, partial visibility, and more. Since many of the warranty photos were taken outside, we had to consider cloudy days, rain, and other environmental factors and add those to the synthetic photos as well.
We started with a 70% success rate of identifying the right part and predicting whether it was good or bad and hence, whether to apply the warranty. Upon further manipulation the AI model became smarter and smarter until we reached an accuracy rate above 90%.
The result: In under 90 days the customer had a web-based proof of concept that allowed them to upload any image and produce a yes/no answer on if the image contained the right part in question and a yes/no answer on if the part did in fact fail. An AI model was successfully trained with only a few dozen pieces of actual data and the gaps were filled in with synthetic data.
Dataless AI comes of age
This story is not unique to auto makers. Exciting work is underway to revolutionize industries from insurance and financial services to health care, education, manufacturing, and retail.
Synthetic data does not make real data irrelevant or unnecessary. Synthetic data is not a silver bullet. However, it can achieve two key things:
- Fast-track proofs-of-concept to understand their viability;
- Accelerate AI model training by augmenting real data.
Make no mistake: data — and importantly, unified data across the enterprise — is the key to competitive advantage. The more real data trained through an AI system, the smarter it gets.
For many enterprises today, each AI project represents millions or tens of millions of dollars and years of effort. However, if companies can validate proofs of concept in months — not years — with limited data sets bolstered with synthetic data, AI costs will radically decrease, and AI adoption will accelerate at an exponential pace.