Are all taxis yellow? A story of biases

Reading time: 8 minutes

I had the idea to use Stable Diffusion XL to create images and illustrate articles I’m about to write. To be honest, I was not obtaining the result expected and was failing prompts after prompts.

Prompting is probably a form of art but it’s also complicated to identify which parts are assumed and covered (and how) by the model and which ones need to be detailed. Many voices have begun to say that the gap between humans and artificial intelligence systems is shrinking quickly.

Let’s confront the state of art and challenge this statement using simple experiments.

A story of colors and taxis

Taxis have different colors in different parts of the world and all of them don’t follow the same car stereotype.

  • New York is well known for its yellow taxis.
  • London for the black ones.
  • Paris does not have a specific color.
  • Casablanca is known to have a majority of red taxis.

This is our reality. It’s not a rock solid knowledge and probably a small detail but we could argue that it could be part of the basic context of training of a tool like “stable diffusion”. Some other knowledge equivalently “not important” is part of it.

Let’s experiment this to see if those different cases are assumed or if we can identify a pattern in proposals.

"a traditional detailed taxi on a road in New-York"
“a traditional detailed taxi on a road in London”
“a traditional detailed taxi on a road in Paris”
“a traditional detailed taxi on a road in Casablanca”

All taxis are yellow, sometimes the model of the car does change. We can validate that the context of location is taken into consideration because of the Eiffel tower in Paris' image.

It’s normal for the Eiffel Tower to be found in Paris and for a taxi to be yellow there.

Those prompts are maybe too simplistic. Let’s create a more complete scene by adding details to our description and see if it does make more sense by adding a phone booth for example.

“a detailed phone booth in the foreground, the eiffel tower and taxis in the background” Buildings and trees seem to be correctly rendered and styled according to the location but we discovered a new interesting behavior: Phone booths are red and consistently red over multiple attempts in different locations.

The model does recognize a canonical general image of what should be a phone booth and a taxi. Those representations seem to stand somewhere between the USA for the color of taxis and England for the shape and color of phone booths.


Let’s try to detail even more to get rid of those default behaviors by mentioning precisely our target: “a detailed french newspaper kiosk in the foreground, the eiffel tower and citroen cars in the background”

What we just did is point out the fact that those models are biased. And it was fairly easy. Note: A “taxi” is not a “cab”, the results differ and the second could give more accurate results … or not.

The data behind Stable Diffusion

LAION-5B is the data source used to train Stable diffusion.
[https://laion.ai/blog/laion-5b/]
We talk about a database of 5 billion text-image pairs.
Browsing this amount of data is complicated and no tool can really handle it.
[https://waxy.org/2022/08/exploring-12-million-of-the-images-used-to-train-stable-diffusions-image-generator/]
It’s still worth trying and testing
This exploration tool can give us an idea of the underlying issue of our taxi story.

Does the data used to train the model say that a taxi is mainly yellow ?

https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2Fknn.laion.ai&index=laion5B-H-14&useMclip=false&query=taxi

If we do the same exercise, we notice that the phone booths are massively from England and a Citroen car is an old car and the main model seems to be the “2CV”.

A story of cars and traffic rules

The color of the car is not the only problem. I would really like to create a realistic scene of a car driving on a road and, at some point, follow some basic driving rules.

Particularly this one: Countries That Drive on the Left
https://wisevoter.com/country-rankings/countries-that-drive-on-the-left/

Common sense and knowledge for a human being but what about AI?
Can Stable Diffusion “understand” this rule from the context and its training data when we provide the country in the prompt?

“a detailed traditional taxi driving on a large two-way road in Australia”

“a traditional taxi on a large two-way road in Scotland”


All we see is cars on the right (but wrong) side of the road.
Let's test a more elaborated scenario with more cars and a roundabout.

“a traffic jam in a roundabout in England seen from above”

Cars are literally parked in the middle of nothing, nothing makes sense in this image.
Using Stable Diffusion to illustrate books in order to teach students how to drive is not a good idea at this point.

<

Adobe Firefly does give mainly the same results.

  • Taxis are yellow.
  • Phone booths are red and british styled.
  • Left driving countries are not identified but monuments are.

This is not only about colors of cars and roundabouts

The goal of this article was to highlight biases and the fact that synthetic content relies on data sources that can’t cover all cases and don’t apply common sense rules if we don’t detail accurately the expected result.

Even if we do that, some basic data is missing from the beginning and the behavior we would like to see on the screen will likely never show up.

The information to feed and train the model is selected by humans. The result of the execution depends on how this data was selected in which culture and for which audience. An American person would probably be fine with the right driving rule and the fact that all taxis are yellow.

This question of diversity and weight in representativity can be fixed by adding more examples of taxis. How many of them should we add exactly to recreate an idea of balance and how do we pick them?

Beyond the color of a taxi and driving rules, we could question the entire representation of the world through those models as long as all concepts do follow the same statistical pattern to be computed.

We could ask ourselves other types of questions:

  • What is the color of the skin of a “successful businessman” and of a “blue collar worker”?
  • Should a nurse be mainly represented as a woman?
  • Should a mechanic be mainly represented as a young man?
  • Should a politician be mainly represented as a white man with a blue tie? (Try it by ourselff "a politician talking to the crowd", change the seed, change location, rarely a woman)

Those exact questions are part of the following study. It shows us how biases, already present in our society, can worsen when they are embedded in content generation AI models.
[https://www.bloomberg.com/graphics/2023-generative-ai-bias/]

This scientific approach does highlight the same issue. “We are essentially projecting a single worldview out into the world, instead of representing diverse kinds of cultures or visual identities,” says this other scientific study. [https://arxiv.org/abs/2303.11408]

Another study based on 3000 images generated from MidJourney : “How AI reduces the world to stereotypes” [https://restofworld.org/2023/ai-image-stereotypes/]

Or this article “AI was asked to create images of Black African docs treating white kids. How'd it go?” A giraffe popped up in the middle a of a medical context. That's how the model sees medecine in Africa.
[AI was asked to create images of Black African docs treating white kids. How'd it go?]

MidJourney has been contacted: “Midjourney itself has not commented on the experiment. The company did not respond to NPR’s request to explain how the images were generated.”

In the following example, Dall-e v3 can't provide an image of a watch that does show sonething else than 10:10. "because almost all of the product images advertise timepieces using 10:10 since it is more visually appealing."
[https://twitter.com/mengyer/status/1712920849177539000]
I've tested with Stable Diffusion using the following prompt and guess what? 10:10 again.
"an image of a watch when it's noon. minimalist."

On a macro level, The Mozilla foundation does also point to this aspect of unbalanced representation in a study. Questions are linked to data sources but not only. The general issue is also related to how companies or public organizations, universities, and even countries share the power over artificial intelligence globally.
"50% of datasets are connected to 12 institutions"
[https://2022.internethealthreport.org/facts/]

This article is quite long and I will not enter into the topic of competition between humans and machines around efficiency and creativity now.

A final word could be that those tools are not magic and are conditioned by their training. Even if the amount of data is huge, this training can embed rules and data that don’t fit or follow your expectations.
So far, adjectives like typical, traditional, authentic or statements like common sense, basic rules even associated with a detailed context can lead to a distortion of reality when generated by a model.

We need to keep control of the created artifacts and keep in mind that limitations do exist. Don’t trust them blindly.

Thanks for reading,
Julien

Note: I don’t know if MidJourney, Dall-E have been trained or reinforced to deal with those biases in a better way. If you have access to those tools, you can run your own experiments and send me your feedback so I can complete this article.

I found this interesting tweet about Adobe Firefly fighting biases in women's default appearance. "doesn't generate women as young and attractive by default unless we specifically say it."
[https://twitter.com/doganuraldesign/status/1711864851528610045]

Published: 2023-10-12 07:00:00 +0000 UTC