Training Models with Synthetic Data: OpenAI Dall-E Image Generation

"How can I train a model? I don’t have any data!” Data synthesis is here to help. We’ve explored this topic in recent videos on keyword spotting, physics simulations and conveyor belt counting. Now we have put together a video exploring how generative AI (OpenAI Dall-E) can be used to generate an image dataset for detecting whether a user has gloves on.

Generative AI can help speed up your initial proof of concept development, allowing you to prove your use-case without the expense of data collection. This dataset was generated by Dall-E in a matter of minutes:

Example of images generated with Dall-E to create a gloves vs no glove dataset

There are a number of other use-cases beyond simple image generation. The OpenAI tools can be used to create variations of existing images, allowing you to expand an existing dataset. It can also extend images with a mask, with this tool you could pass in an image of your end environment with transparency where you want to generate your desired object and get out a usable dataset.

While these tools can be powerful it is important to recognise their limitations. You may be introducing bias into your end model as there is no way to control the training data that has been used to train Dall-E. Bias can also come from the language used to prompt Dall-E. Generative AI can also be at risk of “hallucination”. This refers to the possibility that Dall-E may generate images that do not accurately represent the intended object or scene. This is particularly true for complex or abstract concepts that may be difficult for Dall-E to understand and represent accurately. As a result, it is important to carefully evaluate the quality and accuracy of your dataset before using them for training. For production models it is important to incorporate other sources of data to ensure that the resulting model is robust and accurate.

Edge Impulse provides some tools to help you identify outliers in your dataset. The Data Explorer gives you a complete view of all of your data in a 2D plane enabling you to easily spot outliers and remove them.

Overview of the Data Explorer feature in Edge Impulse Studio with dataset generated with Dall-E 

The Dall-E Image Generation tool is now available in a Transformation Block for enterprise users, allowing easy integration with your data pipeline. You can also check out the public project and accompanying Python Notebook to understand how it works.

Comments

Subscribe

Are you interested in bringing machine learning intelligence to your devices? We're happy to help.

Subscribe to our newsletter