Researchers working at Google’s United Kingdom-based subsidiary DeepMind have demonstrated that artificial intelligence deep neural networks possess an extraordinary capacity to comprehend a given scene and then represent that scene in a format that is as compact as possible.
Moreover, after representing the scene, deep neural networks also have the ability to “imagine” what the exact same given scene would look like to the deep neural network from a perspective that the deep neural network has never seen before.
Google researchers say that human beings are already pretty good at such a task.
For example, if someone shows a person a picture that consists of a table which only has three of its legs visible, the majority of the people would not find it hard to intuitively understand that the shown table most likely has got a fourth leg which is on its opposite side.
Moreover, looking at the same picture, the majority of the people would also know that the shown table probably has a wall behind it that has matching color to the parts that the picture has shown to them.
If humans practice enough, then they can also learn to make a sketch of the shown picture or scene from an entirely different angle.
For that to happen, one has to take into account the different shadows and perspective among a host of other different visual effects.
Danilo Rezende and Ali Eslami led a DeepMind team that worked on developing software which was based on the same deep neural networks with the above-mentioned capabilities.
Such deep neural networks have the ability to work with, at the very least, geometric scenes which are simplified.
The DeepMind calls the software GQN or Generative Query Network.
When researchers give GQN just a handful of specific snapshots related to a virtual scene, it can make use of a neural network to successfully take that scene and build a more compact mathematical representation of it.
After that, the software makes use of that mathematical representation in order to render more images of the same given room but from new perspectives.
As mentioned before, these perspectives are completely new to the neural network in the sense that the neural network has never seen the perspectives before.
DeepMind researchers have also revealed that they did not hard-code any type of prior information or knowledge of the type of environment the Generative Query Network would receive to render it.
Currently, human beings are better at such tasks than neural network because human have the assistance of years of first-hand experience by looking at all the real-world environments and objects around them.
The network that DeepMind has developed as the capability to develop its own intuition that is very similar to a human being in the sense that it examines a large number of images that researchers feed it of similar looking scenes.
Eslami held a phone interview with Ars Technica a couple of days ago and told reporters there that the DeepMind found it deeply surprising that the GQN could successfully accomplish tasks such as occlusion, perspective, shadows, and lighting.
He also said that the team at DeepMind knew how to code graphics engines and renderers.
As indicated earlier, the most remarkable thing about the General QUery Network software from DeepMind is that the researchers working at DeepMind did not even make an attempt to hard-code physical laws into the GQN software.
Eslami revealed that instead of doing that the researchers made the software in such a way that it would start off with a blank slate.
After that, the software would, on its own discover all the required laws of physics and rules by effectively only looking at the fed images.
Some believe this is just the latest demonstration of deep neural networks and how incredibly versatile they really are.
Researchers already have the knowledge that deep neural network learning can play Atari 2600 games and can beat master level players at Go along with help researchers classify images.
Now, with the developments at Google DeepMind, researchers also know that deep neural networks can go further and show a remarkable ability for understanding three-dimensional spaces and reasoning about them.
DeepMind GQN (Generative Query Network) And How It Works
Have a look at a rather simple schematic that DeepMind has put out in order to help readers get the intuition of how researchers have put together the new Generative Query Network.
Under the hood, the new Generative Query Network actually makes use of two different and separate deep neural networks.
Both of these are connected to each other.
From the schematic above, it should not be hard to see that on the left-hand side there is a representation network.
This network is responsible for taking in a huge collection of specific images that represent some scene.
The images also come with data regarding the camera and its location with respect to a given image.
Moving forward, the representation network condenses all the collection of images down to, as mentioned before, a compact mathematical representation.
Fundamentally, the representation network changes the images to a vector of numbers.
The representation network does that till it has a mathematical representation for the whole scene.
When that step if completed, then comes the job of the generation network.
What does the generation network do?
Its role is to reverse the process of the representation network.
How does it go about doing that?
It starts the new process with the vector that the representation network created to mathematically represent the whole scene.
The generation network also accepts camera location as one of its inputs.
After that, it begins to generate another image.
This image represents how the previous scene would look to an observer from a different angle or the angle that is specified.
It is quite obvious that if researchers give the generation network a given camera location that corresponds to just a single one of all the input images, then the generation network should not have any problems in reproducing the input image, in its entirety, as it was originally.
However, researchers now know that they can feed the generation network other different camera positions as well.
These different camera positions can include any position.
Even those positions that the generation network has never previously seen of a corresponding image.
Google DeepMind’s Generative Query Network also has the ability to produce more images from these different locations which match the original or the real image very closely if someone viewed the real image from the same location as the generative network.
The DeepMind paper clearly mentions that the representative network and the generative network both train jointly in a fashion that is end-to-end.
How does the team achieve that?
Using these techniques, the team managed to iteratively improve both the representation and generative network.
The GQN software has the responsibility of feeding all the training images into the deep neural network.
Then it generates a specific output image.
After that, it starts to observe the output image and how much the image diverged when compared to the expected result.
This is somewhat different from what conventional neural networks do.
In the case of a conventional neural network, the network makes use of labels that are supplied externally to make the judgment whether the resultant output is correct or not.
On the other hand, Generative Query Network’s training algorithm makes use of scene images not only as an input to its representation network but also as a method to make the judgment whether or not the generation network’s output is correct.
Assuming that the output from the generative network is not able to match the required image, then the GQN software starts the process of back-propagating the output errors.
In the process of doing so, the software also updates the numerical weights which are applied to hundreds and thousands of artificial neurons in order to make sure that it is able to improve the overall network’s performance on the task.
The Generative Query Network software then starts to repeat the above-mentioned process multiple times.
Each time the software repeats the process, the present neural network improves in its ability to match the input and output scene images.
According to Eslami, people should think of the neural networks used in GNQ software as two funnels.
Both funnels have a connection to each other.
Using that connection, the two neural networks are able to connect the bottlenecks that appear in the middle.
He also said that because of the nature of the bottleneck (in other words, bottlenecks are tight) the representation network and the generative network eventually learn how to work together in order to ensure that they are able to communicate the scene’s contents as compactly as possible.
The Generative Query Network Has The Ability To Make Educated Predictions about Regions Of the Image That Aren’t Visible
While the training process is still ongoing, researchers make sure that they have the means to provide the neural network with a multiple number of images that come from a whole bunch of other similar looking but different rooms.
Reports say that in one specific experiment, the Google DeepMind research team managed to generate a whole bunch of square but stylized rooms.
These rooms contained a multiple number of geometric shapes.
The team generated rooms that had multiple light sources along with randomly chosen textures and wall colors.
Now, because of the fact that the deep neural network had trained on data that belonged to a multiple number of different rooms (virtual ones), the network knew that it had to find a way to represent the given room’s content in such a manner that it generalized the content well.
The way the Generative Query Network gets it training is that after researchers have trained it, they can provide the network with one or even more images that belong to a different room that the Generative Query Network has not had the opportunity to see beforehand.
Since, the Generative Query Network has already trained on a whole bunch of other similar-looking rooms with fairly similar characteristics, the Generative Query Network a has high intuition about what general rooms have a tendency to look like in normal circumstances.
Because of that intuition, the Generative Query Network gains the capacity to come up with educated guesses regarding those portions of the rooms that aren’t visible to it directly.
To take an example, a Generative Query Network can successfully predict that a given pattern on the wall which is repeating is actually very likely to continue on those parts of the wall that other objects have obstructed.
The interesting bit about this ability is that, the Generative Query Network is able to do all of that without Google DeepMind researchers actually hard-coding straightforward and clear rules regarding either the characteristics or the physics of the lights present on the scenes that it is analyzing.
Eslami told Ars Technica that the Generative Query Network also had the ability to learn things that researchers themselves did not have a way of learning by hand.
According to Eslami, the simple fact that chairs and tables are generally situated next to each other is something that humans knew intuitively.
However, researchers had found this fact very hard to first quantify and then write code for it.
With that said, the Generative Query Network has the ability to learn that as well.
And it learns that using the same method that it uses to learn the fact that objects actually cast shadows.
To put it in simpler terms, let’s assume for a second that there is a specific Generative Query Network that researchers have managed to train with a whole bunch of home interior images.
In addition, researchers have also fed the Generative Query Network images that belong to a house that the Generative Query Network hasn’t seen before.
Researchers have shown that if the fed images only showed a single portion or half of the house’s dining room table (for example) then the Generative Query Network had the ability to figure out the looks of the other half of the dining room table.
The network also could estimate that the table must have chairs next to it.
Moreover, if the house has a room upstairs that the network knows matches the size of a bedroom but has no information on the interior of the room (because it wasn’t fed those images in the first place) then the network had the ability to take a guess that the room would have a dresser and a bed in it.
With that said, readers should know that if a Generative Query Network figured out that the room probably had furniture, it wouldn’t do that because it had the ability to have a conceptual understanding of chairs, tables or beds.
Or what these things actually are.
The Generative Query Network would simply make the observation that based on statistics, bedroom-shaped objects or rooms tended to have objects inside of rooms that are bed-shaped.
Similarly, the network would also guess that table-shaped things or objects tended to have objects which looked like chairs around them.
GQN And Their High Versatility
Researchers at Google DeepMind have managed to build networks that possess the capacity to take a reasonably limited amount of data and draw extraordinarily rich interferences from it.
But are they really practical yet?
Not, according to Eslami.
While talking to Ars Technica, he stressed the fact that the Generative Query Network research still had not passed its preliminary stages.
And that was the reason GNQ was far away from any type of practical applications.
Afterall, researchers at Google DeepMind have only managed to test GNQ on computer-rendered digital/virtual objects and rooms.
What that means is that researchers don’t know how GNQ would handle more complex and varied environments such as the ones humans encounter in the real physical world.
Moreover, researchers also are not clear how such techniques would generalize.
There is little doubt about the fact that Generative Query Network’s success would depend on one key thing:
Its ability to take a complex scene image and condense it to a compact mathematical representation.
Of course, such a method already makes the huge assumption that the scene image has enough simplicity to be represented with nothing but numerical values.
How would a GNQ software perform on a typical real-world scene where researchers might have to represent a scene that has tens or hundreds or even thousands of real-world objects in it?
Moreover, how would such techniques enable the network to recognize complex objects such as cars, cats, and trees?
It stands to reason that researchers may reach a point where a given scene becomes just a bit too complex for them to represent in an efficient manner by utilizing the compact numerical representation that researchers at DeepMind have based the GNQ software on.
Apart from that, researchers also have to find a way to confirm if such techniques are reliable enough to first scale and then represent even more complex scenes that have even more varied objects.