Machine Learning:Tech Statups Can And Do Use Fake Data For AI Training

Machine learning algorithms need a lot of data to solve problems. Some AI companies don’t have that. What they have is fake data., a startup based in Berlin, had a very interesting problem to solve last spring.

The startup basically allows people to step inside any video that they desire.

Going back to last spring, it had started operations to develop a different kind of augmented reality application.

Think of Snapchat’s selfie filters.

And then turn them into full-body versions.

Using the augmented reality application the user would hold up the smartphone and then see the bodies of their friends transforming with loads of special effects such as flames and/or fur.

In order to make sure that the app worked correctly, needed to successfully train the company’s machine-learning algorithms to not only track human bodies that appeared in the video closely but also very quickly.

But is just a tech startup.

And a scrappy one at that.

In other words, the company simply did not have access to resources which would have allowed it to collect massive amounts of data (in tens and sometimes hundreds of thousands) related to hand-labeled images.

Technology startups and AI companies alike typically need this kind of data in order to teach machine learning algorithms in all types of such projects.

Max Schneider, the company’s CTO recently said that he found it really hard to keep a startup going in the AI industry.


Because companies like, according to Max, cannot afford to dish out money for large amounts of data (large enough so that they are able to train their machine learning algorithms).

The company is still running.

So, the obvious question that arises from this is, how did it manage to survive without having a ton of data?

Well, Max came up with a unique solution, of sorts.

The company fabricated the data.

Engineers working at began to create the company’s very own labeled images.

They then used these images to train their machine learning algorithms.

And how did they do that?

They did that by adapting all the latest techniques that people use to make videogame and movie graphics.



And it worked for the company very well.

Roughly around 12 months later, the company has actually managed to come up with 10 million images.

Engineers working at the company created these images by pasting, what it calls, digital humans (the term likes to use is “simulants”) into photographs of scenes in the real world.

So of course, the images themselves look slightly weird.

But, as far as the company is concerned, they work.

Readers should think of such techniques as putting the word artificial directly into the term artificial intelligence.

Another engineer working at, Adam Schuster, said that the company trained its models solely via synthetic data.

And according to Schuster, synthetic data-trained models work pretty much like models that the company has trained on real data.

The company also showed a demo to reporters last week.

The demo itself showed a table.

And a monkey.

A virtual monkey.

The virtual monkey appeared on a table and jumped to the closeby ground.

Of course, users have to view all of this through a smartphone’s (like iPhone) camera.

The virtual monkey continued to move around and then squirted paint directly onto a real person’s clothes who stood nearby.

Perhaps there is nothing wrong with this approach.


Because operating a startup is a tough business.

Especially when larger competitors are always stalking the little startups like

These startups want to survive in the harsh AI market.

And to do that successfully, these startups usually follow the motto of Fake it ‘til you make it.

Of course, faking it can lead startups into trouble as well.

Some companies, such as the now almost defunct blood-test “innovator” that goes by the name of Theranos ran into some big troubles last year.

But Theranos worked in another industry where technology companies are dealing with real humans and their health. fall completely within the world of machine learning.

This is an industry where spoofing different kinds of training data is not a problem.

In fact, it has become a legitimate and effective strategy for startups like to jumpstart their projects when they are short on real training data or cash.

Time and time again, the media has mentioned that data is like the new oil as far as the fields of machine learning and artificial intelligence is concerned.

If that is true, then the technique of using fake data to train machine learning algorithms is exactly like brewing biodiesel.

But doing it in one’s backyard instead of a lab.


Some don’t call it fake data.

They call it phony data.

And this “data” movement, could find success in accelerating the use of all kinds of artificial intelligence techniques in vastly different and new areas of business as well as life.

Of course, one can’t ignore the fact that human intelligence is not the same as machine learning algorithms.

Unlike human intelligence, machine-learning algorithms are actually pretty inflexible.

That is the reason why startups run into problems when they try to apply machine learning algorithms to new situations and problems.

Generally, such an endeavor requires machine learning engineers to use new training data.

That training data must be specific to the new problem and/or situation that a company wants to solve and/or handle respectively.

Take Neuromation.

It is a technology startup which is based in the city of Tallinn, Estonia.

The company churns out images which contain simulated pigs.


Well, apparently that’s just a part of the company’s work for its clients.

But what type of clients?

Clients who want to use camera equipment in order to track their livestock’s growth.

Big technology companies such as Microsoft, Google, and Apple have all managed to publish deep research papers which that noted the convenience of utilizing fake or synthetic data to train machine learning algorithms.

A partner at LDV Capital (a venture firm), Evan Nisselson recently said that synthetic data offered a lot to AI startups.

This fake data offered them hope.

Hope that they could also compete with AI giants who are swimming in data.

The problem with lack of data is that even talented AI teams are regularly hamstrung by it.

He also said that the ability for AI startups to create and make use of synthetic data to train machine learning related models could help these startups to level the playing field.

Big companies and startups may have to share the AI market a lot more if startups can leverage synthetic data.

And this is where the story of using synthetic data really adds weight to the above-mentioned argument.

Back in February, the social media giant Facebook also unveiled the company’s own machine learning software, Densepose.

Facebook uses Densepose to apply special effects to videos that have humans in them.

Engineers working at Facebook made use of over 50,000 images with people in them and 5 million hand-annotated points to train its machine learning algorithm.

After Facebook disclosed its Densepose, it did not take long for the likes of to take notice and begin synthesizing their own data which would perform similarly to Facebook.

Additionally, has managed to integrate different ideas from Facebook’s Densepose into the company’s own products.

Other AI startups including Neuromation want to approach the market a different way.

They want to establish themselves not necessary as offerers of AI products but as brokers.

Brokers who provide fake data to other AI startups.

Neuromation has launched a lot of projects simultaneously.

One of those projects involves the company creating images of various shelves at grocery stores for OSA Hybrid Platform.

OSA Hybrid Platform is a new retail analytics company that deals with customers such as Auchan, the French supermarket group.

OSA HP uses data to train algorithms so that these algorithms gain the ability to read images.

After models have learned how to do that, they can move forward to track stock on all the shelves.

According to the CEO of OSA HP, Alex Isaev, the raw number of product categories along with retail environments that varied a lot made gathering and then labeling different images very impractical.

The co-founder of DataGen (a startup in Israel), Ofir Chakon, recently mentioned that his company charged clients seven figure money sums to proceed to generate custom videos of somewhat creepy but stills simulated hands.

How does DataGen attain such realism?

Well, in part, the realism comes into existence via a technique that has recently achieved the status of “trendy” in various machine learning circles.

This technique goes by the name of generative adversarial networks.



Generative adversarial networks can help machine learning engineers to create images which are photo-realistic.

It is true that no two humans eyes would pass Neuromation’s “synthetic” pigs and DataGen’s hands as real.

But that is beside the point.

According to’s Schuster, when he first saw the fake (or synthetic) dataset he thought to himself that it was a terrible idea.

How was it possible that the machine could learn anything from those images.

After a while, Schuster came to realize, that his perception of the images did not matter.

What mattered was that the machine understood a lot from a single image.

With that said, there is little doubt about the fact that to make the computer understand anything about the right thing can take a serious amount of work., in the beginning, only synthesized naked figures.

But then the company found that the software actually learned only to look for skin.

That has changed now.

The company’s systems can now generate different types of people with varying,

  • Body Shapes
  • Clothing
  • Hair
  • Skin tones

Other AI startups including often train their machine learning systems on real images which are very small in number.

But they find a lot of success when, in addition to real images, they also use millions and millions of synthetic examples.

Using synthetic data isn’t a stigma anymore.


Because some of the world’s most cash-rich and data-rich AI teams have already embraced fake, or synthetic, data.

For example, it is a well-known fact that researchers working for Google routinely train robots in nothing but simulated worlds.

Other technology giants such as Microsoft, in the last year alone, published many results which showed how just 2 million synthetic, or fake, sentences might enhance translation efforts regarding the Levantine dialect of the Arabic language.

Of course, not all companies are as enthusiastic about machine learning or AI as Google and Microsoft, to publish AI research and hence their AI inspirations.

Some technology companies like to keep their AI plans a bit more under the radar.

One of those technology companies is Apple.

The reason for the company keeping its AI inspirations secret is a topic for another day, but what is relevant regarding this post is that even Apple has signaled its interest in all kinds of fake data to rain machine learning algorithms.

Back in the year 2016, Apple released one of its own research paper on the subject.

The paper talked about how to generate realistic images of human eyes in order to improve and enhance gaze-detection software applications.

Not even a year later, Apple released its latest smartphone model in the form of the iPhone X.

The iPhone X had the ability to unlock itself by detecting just the gaze of the user.

After that, the smartphone also recognized the user’s face.

Both projects took advantage of contributions from the same set of researchers.

Apple, being Apple, declined to make a comment on whether the company incorporated its research findings to come up with the unlocking feature of the iPhone X.

Switching to the robotics industry, researchers have started to take help from synthetic training data in order to carry out many different types of experiments at a much greater scale than before.

Before, they couldn’t because such scales were not possible in the physical and limited real world.

Waymo, a company that Alphabet owns, recently said that the company’s self-driving autonomous cars had driven many millions of real-world miles on various public roads.

But that is nothing in terms of hugeness because Waymo’s self-driving autonomous car control software had managed to take advantage of simulated streets to travel billions of miles.

Some think that machines may well need a digital double.


When machines do have that luxury, robots are able to learn complex tasks such as handling objects present in factories and/or at homes more quickly.

Elon Musk, the Tesla-fame billionaire, cofounded OpenAI, an AI research institute that has hired researchers to work on various AI related problems.

These researchers have found that they have the ability to train software models (in a given simulated world) which, for all intents and purposes, work reasonably efficiently in real robots as well.

There are many tricks that researchers use to advance their machine learning projects.

Sometimes they randomly vary textures and colors present in the given simulated world.

This forces the software to focus only on the core of the given problem.

Research also generate millions of objects which are all different and oddly shaped for the software to grasp.

According to a researcher at OpenAI, Josh Tobin, two years ago the AI community had this prevailing belief that simulated or synthetic data did not have much use.

But, Josh mentioned, that belief had changed a lot in the last year or so.

In other words, AI researchers are beginning to shift their perceptions about synthetic data.

In spite of all these successes, AI startups and giant companies aren’t prepared to consider fake data as omnipotent.

According to DataGen’s Chakon, researchers simply do not understand many complex problems in the AI field.

And hence can’t simulate them realistically.

While in some of the other cases, people and organizations have too high a stake to risk developing a machine learning system that has any disconnect from the real world (or reality).

A University of Iowa professor, Michael Abramoff, has managed to develop different methods to generate retina images.

He recently stated that he used synthetic or fake data in most of his graduate student projects.

With that said, he only stuck to real images when it came to developing his retina-checking software.

Michael’s startup, IDx, convinced the FDA to approve its retina-checking software last month.

Abramoff said that his company wanted to make sure it came off as maximally conservative.


Teaching machines to perform complex but useful tasks is difficult.

However, machine-learning projects will give rise to lots of shoulder industries.

There are human beings on this planet who make a decent living by performing on video in order to school various machine learning algorithms.

The search engine giant, Google has already started to train algorithms that can train other algorithms in order to accelerate the company’s embrace of technologies such as artificial intelligence.

Interesting part?

It did that last year.


Zohair A. Zohair is currently a content crafter at Security Gladiators and has been involved in the technology industry for more than a decade. He is an engineer by training and, naturally, likes to help people solve their tech related problems. When he is not writing, he can usually be found practicing his free-kicks in the ground beside his house.
Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.