Out of shape? Why deep learning works differently than we thought

Unlike humans, AI object recognition pays a lot of attention to small details—and misses the broader picture. But there are ways to close the gap.

If you look at the image below, which animal do you see?

You probably won’t have any trouble identifying a cat in the image above. Here is what a top-notch deep learning algorithm sees: an elephant!

This story is about why artificial neural networks see elephants where humans see cats. Moreover, it’s about a paradigm shift in how we think about object recognition in deep neural networks — and how we can leverage this perspective to advance neural networks. It is based on our recent paper at ICLR 2019, a major deep learning conference.


How do neural networks recognize a cat? A widely accepted answer to this question is: by detecting its shape. Evidence for this hypothesis comes from visualization techniques like DeconvNet (examples shown below), suggesting that along the different stages of processing (called layers), networks seek to identify increasingly larger patterns in an image, from simple edges and contours in the first layers to more complex shapes such as a car wheel — until the object, say, a car, can be readily detected.

Different shapes recognized by a neural network: from small patterns during the early stages of processing (layers 1 and 2) to more complex shapes (car wheel, layer 3) and finally objects (car, layer 5). Image credit: Kriegeskorte (2015).

This intuitive explanation has entered the status of common knowledge. Modern deep learning textbooks such as the classic “Deep Learning” book by Ian Goodfellow explicitly refer to shape-based visualization techniques when explaining how deep learning works, as do other influential researchers like Nikolaus Kriegeskorte (p. 9):

“The network acquires complex knowledge about the kinds of shapes associated with each category. […]
High-level units appear to learn representations of shapes occurring in natural images, such as faces, human bodies, animals, natural scenes, buildings, and cars.”

But there is a problem: Some of the most important and widely used visualization techniques, including DeconvNet, have recently been shown to be misleading: instead of revealing what a network looks for in an image, they merely reconstruct image parts — that is, those beautiful human-interpretable visualizations have little to do with how a network arrives at a decision.

That leaves little evidence for the shape hypothesis. Do we need to revise the way we think about how neural networks recognize objects?


What if the shape hypothesis is not the only explanation? Beyond the shape, objects typically have a more or less distinctive color, size and texture. All of these factors could be harnessed by a neural network to recognize objects. While color and size are usually not unique to a certain object category, almost all objects have texture-like elements if we look at small regions — even cars, for instance, with their tyre profile or metal coating.

And in fact, we know that neural networks happen to have an amazing texture representation — without ever being trained to acquire one. This becomes evident, for example, when considering style transfer. In this fascinating image modeling technique, a deep neural network is used to extract the texture information from one image, such as the painting style. This style is then applied to a second image, enabling one to “paint” a photograph in the style of a famous painter. (You can try it out yourself here!)

Left: arbitrary photograph | Middle: style=texture image (“starry night” by Van Gogh) | Right: photograph rendered in the style=texture of the painting by a deep neural network. Image credit: deepart.io.

The fact that neural networks acquire such a powerful representation of image textures despite being trained only on object recognition suggests a deeper connection between the two. It’s a first evidence for what we call the texture hypothesis: textures, not object shapes, are the most important aspects of an object for AI object recognition.


How do neural networks classify images: based on shape (as commonly assumed) or texture? In order to settle this dispute, I came up with a simple experiment to find out which explanation is more plausible. The experiment is based on images like these ones below, where shape and texture provide evidence for distinctly different object categories:

cat with elephant texture | car with clock texture | bear with bottle texture

In these three example images, texture and shape are no longer from the same category. We created them with style transfer: the same technique used to “paint” a photograph in the style of van Gogh can be used to create a cat with the texture of an elephant, if the input is a photograph of elephant skin instead of a painting.

Using images like these, we can now investigate shape or texture biases by looking at classification decisions from deep neural networks (and humans for comparison). Consider the following analogy: We would like to find out whether someone speaks Arabic or Chinese, but we are not allowed to talk to them. What could we do? One possibility would be to take a piece of paper, write “go left” in Arabic, next to it “go right” in Chinese, and then simply observe whether the person would walk right or left. Similarly, if we show an image with conflicting shape and texture to a deep neural network, we can find out which “language” it speaks by observing whether it makes use of the shape or the texture to identify the object (that is, whether it thinks the cat with elephant texture is a cat or an elephant).

This is precisely what we did. We conducted a series of nine experiments encompassing nearly a hundred human observers and many widely used deep neural networks (AlexNet, VGG-16, GoogLeNet, ResNet-50, ResNet-152, DenseNet-121, SqueezeNet1_1), showing them hundreds of images with conflicting shapes and textures. The results left little room for doubt: we found striking evidence in favor of the texture explanation! A cat with elephant skin is an elephant to deep neural networks, and still a cat to humans. A car with the texture of a clock is a clock to deep neural networks, as much as a bear with the surface characteristics of a bottle is recognized as a bottle. Current deep learning techniques for object recognition primarily rely on textures, not on object shapes.

Here is one exemplary result for ResNet-50, a commonly used deep neural network, showing the percentage of its first three “guesses” (classification decisions) below the image:

As you can see, the cat with elephant skin is classified as an elephant based on the texture, rather than as a cat based on its shape. Current AI object recognition seems to work a lot differently than we previously assumed, and is fundamentally different from how humans recognize objects.


Is there anything we can do about this? Can we make AI object recognition more human-like, can we teach it to use shapes instead of textures?

The answer is yes. Deep neural networks, when learning to classify objects, make use of whatever information is useful. In standard images, textures reveal a lot about object identities, hence there may simply be no need to learn a lot about object shapes. If the tyre profile and glossy surface already give the object identity away, why bother checking whether the shape matches, too? This is why we devised a novel way to teach neural networks to focus on shapes instead of textures, in the hope to eliminate their texture bias. Again using style transfer, it is possible to exchange the original texture of an image for an arbitrary different one (see figure below for examples). In the resulting images, the texture is no longer informative and thus the object shape is the only useful information left. If a deep neural network wants to classify objects from this new training dataset, it now needs to learn about shapes.

Left: normal image with both texture and shape information | Right: ten different examples of arbitrary textures, yet identical object shapes.

After training a deep neural network on thousands and thousands of these images with arbitrary textures, we found that it actually acquired a shape bias instead of a preference for textures! A cat with elephant skin is now perceived as a cat by this new shape-based network. Moreover, there were a number of emergent benefits. The network suddenly got better than its normally-trained counterpart at both recognizing standard images and at locating objects in images; highlighting how useful human-like, shape-based representations can be. Our most surprising finding, however, was that it learned how to cope with noisy images (in the real world, this could be objects behind a layer of rain or snow) — without ever seeing any of these noise patterns before! Simply by focusing on object shapes instead of easily distorted textures, this shape-based network is the first deep neural network to approach general, human-level noise robustness.

At the crossroads of human visual perception and artificial intelligence, inspiration can come from both fields. We used knowledge about the human visual system and its preference for shapes to better understand deep neural networks, learning that they primarily use textures to classify objects. This led to the creation of a network that more closely resembles robust, human-like performance on a number of different tasks. Looking ahead, if this network turns out to predict more accurately how neurons in the brain “fire” when we look at objects, it could be very useful to better understand human visual perception—thus, in this truly exciting age, inspiration from human vision has the potential to improve today’s AI technologies just as much as AI has the capabilities to advance today’s vision science!


The link below leads to the full paper on which this article is based.

ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness
Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann & Wieland Brendel.

If not stated otherwise, images and figures are taken from this publication; the respective image rights mentioned there apply accordingly.


This story is published in Noteworthy, where 10,000+ readers come every day to learn about the people & ideas shaping the products we love.

Follow our publication to see more product & design stories featured by the Journal team.