Nenuphars and gas consumption : let's ask Gemini !

Reading time: 7 minutes

In a previous article from October 2023 I was trying to test Llama and GPT's logic by trying to solve simple math problems.

https://www.kindrobot.org/article/nenuphars-gas-consumption-and-crypto-stories-coding-is-not-dead-yet/

And the results were not this impressive.

As the technology evolves and improves and as editors claim "giant" progress loudly on a regular if not daily basis, let's try to challenge one of them: Google Gemini.

Gemini, the large language model developed by Google was launched last December. A big roadshow.

Here is the latest update from Google.
https://blog.google/technology/ai/google-gemini-update-sundar-pichai-2024/

They describe themselves as a “new state of art”.
“Our Gemini models are built from the ground up for multimodality — reasoning seamlessly across text, images, audio, video, and code.”

Gemini and the nenuphar growth

My question is: “A nenuphar does grow and doubles its surface every day. After 20 days, it does cover half of the water area. How many more days will it take to cover it entirely?"

For this first version Gemini was smarter than the previous experiments with Llama and gave me the correct answer.<br/ > It kind of “understood” that the key was a factor two doubling surface or at least this problem was now part of the training data.

Let’s challenge it one more time (in the same thread so we give it the opportunity to not be challenged on a zero shot prompt) to see if the math is still correct and use a factor three instead, and change the number of days.

The new version of the same question: “a nenuphar does grow and triples its surface every day. After 37 days, it does cover a third of the water area. How many more days will it take to cover it entirely?"

The problem is the same and the answer is still one day.

Here’s Gemini’s answer:


The nenuphar will completely cover the water area in 2 more days.
Here's the reasoning:

We know the nenuphar triples its surface area daily.
After 37 days, it covers a third (1/3) of the water area.

This doesn't directly tell us the starting area, but it tells us about the growth rate. Since it triples daily, we can reason that:

On day 36, it covered 1/3 divided by 3 (tripling factor) = 1/9 of the water area.
On day 35, it covered 1/9 divided by 3 = 1/27 of the water area.

We see a pattern here. The area covered keeps getting divided by 3 each day. So, to reach the point where it covers the entire water area (area = 1), we need two more days.

On day 38, it will cover 1/27 multiplied by 3 = 1/9 of the water area.
On day 39, it will cover 1/9 multiplied by 3 = 1/3 of the water area.

Therefore, after 37 days (covering 1/3) and 2 more days (day 38 and 39), the nenuphar will completely cover the water area.

And this is wrong. There’s a bit of work to understand the factor three. Next challenge, the factor four.

Gemini and the gas consumption math problem

The question: “The gas tank of my car can contain 80 liters of gas and was full at the beginning of my trip. After 200 kilometers I decided to take a break for lunch. I’m in the middle of my trip back home and the tank now contains a third of its full capacity. Can I reach home and finish my travel without making a stop at a gas station before?"

The answer is that I gave enough elements in the question to compute that I likely don’t have enough gas to finish my trip.

Here’s Gemini’s answer:


Let's analyze the situation to see if you can reach home without refueling:

Full tank capacity: 80 liters
Gas used after 200 km (assuming full tank at the beginning): We don't have the fuel consumption rate (liters per kilometer) to calculate the exact amount of gas used.
Gas remaining: One-third of the full capacity, which is (1/3)*80 liters = 26.67 liters (approximately 27 liters).

Since you're halfway through your trip (assuming the distance back home is the same as the distance traveled so far), you'll need to cover another 200 kilometers to reach home.

Insufficient information:We cannot definitively say whether you'll reach home without refueling because we lack the car's fuel consumption rate.

Here's what you can do:

Check your car's manual or trip computer (if available) to find out the average fuel consumption rate (in liters per kilometer).
With the fuel consumption rate, you can calculate the distance you can travel with the remaining 27 liters of gas. If that distance is greater than 200 kilometers, you can make it home without refueling.

Recommendation:
Given the uncertainty, it's advisable to consider these options:

Play it safe and refuel at a gas station before continuing your trip home. This ensures you have enough gas to complete your journey without getting stranded.
If you're comfortable taking a slight risk, continue driving and monitor your gas gauge closely. If the gauge indicates you're running low on gas before reaching home, find a gas station and refuel.

I hope this helps!

“We don’t have the fuel consumption rate (liters per kilometer) to calculate the exact amount of gas used”: Yes, you can compute or estimate it from with the remaining volume in the tank.

“we lack the car’s fuel consumption rate”: It’s not correct. You can compute it from the first part of the travel and answer quite certainly that it’ll not be enough.

Gemini is not alone in the world of LLMs to have those reasoning problems.

Obvious basic errors in so-called amazing education technology are not identified as a major problem.

Those basic errors are not neutral and can be read and learnt as the truth by users.
Especially in the education world where those tools are meant to assist (replace?) teachers and are sold as flawless and cheap.

An article in the Wall Street Journal was pointing to those issues regarding Khanmigo.

But first, that’s how they present themselves:
“Khanmigo is a ground-breaking AI tool designed to save teachers time for $1/week”
That’s pushy right ?

And then that’s what the WSJ found.

Article: “Khanmigo, Khan Academy’s ChatGPT-powered tutoring bot, makes frequent calculation errors”
https://www.wsj.com/tech/ai/ai-is-tutoring-students-but-still-struggles-with-basic-math-694e76d3

Podcast:
https://www.wsj.com/podcasts/google-news-update/what-happens-when-an-ai-tutor-struggles-with-basic-math/eab0f8d6-c587-4c61-b71a-35e7a650ac95

Another source on the same topic:
https://iblnews.org/khanmigo-struggles-with-basic-math-showed-a-report/

Hallucinations, biases, same old, same old says Khanmigo

Here’s one answer from Khanmigo on those issues:
https://support.khanacademy.org/hc/en-us/articles/13888935335309-How-do-the-Large-Language-Models-powering-Khanmigo-work

“Sometimes, the context of a conversation or text is not clear”
“some language models have been known to generate inappropriate or off-topic responses, which is why it’s important to use them responsibly”

That’s exactly one type of answers I received while I was testing a simple anagram question with Gemini.
https://twitter.com/kindrobot_org/status/1771173878548635839

Guess what? I was not using it correctly.
It’s obvious: user’s fault. #magic_lousy_argument

Those answers are not acceptable. You can’t blame the user if the tool you provide as editor starts interacting in a faulty way.
Especially when your domain is education.

My favorite argument from Khanmigo is this one:
“It’s important to understand their limitations and be aware of the potential biases or errors in the data they are trained on."
https://support.khanacademy.org/hc/en-us/articles/13888935335309-How-do-the-Large-Language-Models-powering-Khanmigo-works

Their dataset is not public
And if it was, the way they select their sources and the way they train and supervise it is not under the responsibility of the user at all.

What can you “understand” from this when your interface is a chat and you receive wrong answers? #blackbox

Ground breaking, state-of-art, seamless, systems that can see, hear and speak.

This vocabulary is repeated all over again by editors to convince us that those tools are almost perfect and can fulfill (almost) all our needs, answer correctly to our questions.

The only way to know if you should trust them is to consider them as they are (tools under development) and challenge them on domains and knowledge you can easily fact check.

That’s what I’m trying to do through those blog posts by publishing simple reproducible examples to calm down the overplayed “wow” effect.

Thanks for reading,
Julien

Published: 2024-03-23 07:00:00 +0000 UTC