How SFU researchers are teaching AI to see depth in photographs and paintings
For humans, determining which objects are close and far away when we look at a photograph is effortless. While we take this ability for granted, it’s quite challenging for computers. This may not be the case for long, as researchers in the Computational Photography Lab at SFU are successfully teaching artificial intelligence (AI) how to determine depth from a single photograph, as discussed in their recent CVPR 2021 publication.
This process, called monocular depth estimation, has many applications in image editing and beyond.
“When we look at a picture, we can tell the relative distance of objects by looking at their size, position, and relation to each other,” says Mahdi Miangoleh, an MSc student in the Computational Photography Lab.
“This requires recognizing the objects in a scene and knowing what size the objects are in real life. This task alone is an active research topic for neural networks.”
In recent years, there has been a lot of progress on this research topic, but existing works fail to give a high-resolution input that is ready to transform the image into a 3-dimensional (3D) space. To counter this, the lab’s approach is about utilizing the untapped potential of existing neural network models.
The proposed research explains the lack of high-resolution results in current methods through the limitations of convolutional neural networks. Despite major advancements in recent years, the neural networks still have a relatively small capacity to generate many details at once. Another limitation is the amount of pixels these networks can ‘look at’ at once, which determines how much information the neural network can make use of to understand complex scenes.
“We analyze the image and optimize our process by looking at the image content according to the limitations of current architectures,” says Sebastian Dille, a PhD student in the same lab.
“We give our input image to the network in many different forms, to create as many details as the model allows while preserving a realistic geometry.”
By increasing the resolution of the estimation, the researchers make it possible to create detailed 3D renderings that look realistic to a human eye. Depth maps are used to create 3D renderings of scenes and simulate camera motion in computer graphics.
“With the high-resolution depth maps that we are able to get for real-world photographs, artists and content creators can now immediately transfer their photograph or artwork into a rich 3D world,” says computing science professor Yağız Aksoy, who leads the Computational Photography Lab.
Aksoy points to an interesting application of being able to estimate depth from a single image—the same method can also generate realistic depth maps of paintings! This is not possible with traditional geometry estimation methods in computer vision which require multiple cameras looking at the same object. Since a painting is a 2D object, only single-image depth estimation can ‘see’ depth within it.
We are already seeing artists around the world utilize the applications enabled by Aksoy’s lab’s research. Akira Saito, a visual artist based in Japan, is creating videos that take viewers into fantastic 3D worlds dreamed up in 2D artwork. To do this he combines tools such as Houdini, a computer animation software, with the depth map generated by Aksoy and his team.
Creative content creators on TikTok are already using similar research to express themselves in new ways. This tutorial explains the process of how high-resolution depth maps can be used to create impressive 3D visualizations of photographs. It also shows how this research can be used to create interesting snippets.
“It’s a great pleasure to see independent artists make use of our technology in their own way,” says Aksoy.
“We have made great leaps in computer vision and computer graphics in recent years, but the adoption of these new AI technologies by the artist community needs to be an organic process, and that takes time.”
The next steps in this lab’s research are extending this work to videos and developing new tools that will make depth maps more useful for artists.
*This research was done in collaboration with Sylvain Paris and Long Mai from Adobe Research.