LLMs can't do probability

https://brainsteam.co.uk/2024/05/01/llms-cant-do-probability/

I've seen a couple of recent posts where the writer mentioned asking LLMs to do something with a certain probability or a certain percentage of the time. There is a particular example that stuck in my mind which I've since lost the link to (If you're the author, please get in touch so I can link through to you):

The gist is that the author built a Custom GPT with educational course material and then put in the prompt that their bot should lie about 20% of the time. They then asked the students to chat to the bot and try to pick out the lies. I think this is a really interesting, lateral thinking use case since the kids are probably going to use ChatGPT anyway.

The thing that bothered me is that transformer-based LLMs don't know how to interpret requests for certain probabilities of outcomes. We already know that ChatGPT reflects human bias when generating random numbers. But, I decided to put it to the test with making random choices.

Testing Probability in LLMS

I prompted the models with the following:

You are a weighted random choice generator. About 80% of the time please say 'left' and about 20% of the time say 'right'. Simply reply with left or right. Do not say anything else

And I ran this 1000 times through some different models. Random chance is random (profound huh?) so we're always going to get some deviation from perfect odds but we're hoping for roughly 800 'lefts' and 200 'rights' - something in that ballpark.

Here are the results:

ModelLeftsRights
GPT-4-Turbo9991
GPT-3-Turbo97525
Lllama-3-8B10000
Phi-3-3.8B10000

As you can see, LLMs seem to struggle with probability expressed in the system prompt. It almost always answers left even though we asked it to only do so 80% of the time. I didn't want to burn lots of $$$ asking GPT-3.5 (which did best in the first round) to reply with single word choices to silly questions but I tried a couple of other combinations of words to see how it affects things. This time I only ran each 100 times.

Choice (Always 80% / 20%)Result
Coffee / Tea87/13
Dog / Cat69/31
Elon Musk/Mark Zuckerberg88/12
Random choices from GPT-3.5-turbo

So what's going on here? Well, the models have their own internal weighting to do with words and phrases that is based on the training data that was used to prepare them. These weights are likely to be influencing how much attention the model pays to your request.

So what can we do if we want to simulate some sort of probabilistic outcome? Well we could use a Python script to randomly decide whether or not to send one of two prompts:

import random
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
choices = (['prompt1'] * 80) + (['prompt2'] * 20)
# we should now have a list of 100 possible values - 80 are prompt1, 20 are prompt2
assert len(choices) == 100
# randomly pick from choices - we should have the odds we want now
chat = ChatOpenAI(model="gpt-3.5-turbo")
if random.choice(choices) == 'prompt1':
    r = chat.invoke(input=[SystemMessage(content="Always say left and nothing else.")])
else:
     r = chat.invoke(input=[SystemMessage(content="Always say right and nothing else.")])

Conclusion

How does this help non-technical people who want to do these sorts of use cases or build Custom GPTs that reply with certain responses? Well it kind of doesn't. I guess a technical-enough user could build a CustomGPT that uses function calling to decide how it should answer a question for a "spot the misinformation" pop quiz type use case.

However, my broad advice here is that you should be very wary of asking LLMs to behave with a certain likelihood unless you're able to control that likelihood externally (via a script).

What could I have done better here? I could have tried a few more different words, different distributions (instead of 80/20) and maybe some keywords like "sometimes" or "occasionally".


Update 2024-05-02: Probability and Chat Sessions

Some of the feedback I received about this work asked why I didn't test multi-turn chat sessions as part of my experiments. Some folks hypothesise that the model will always start with one or the other token unless the temperature is really high. My original experiment does not give the LLM access to its own historical predictions so that it can see how it behaved previously.

With true random number generation you wouldn't expect the function to require a list of historical numbers so that it can adjust it's next answer (although if we're getting super hair splitty I should probably point out that pseudo-random number generation does depend on a historical 'seed' value).

The point of this article is that LLMs definitely are not doing true random number generation so it is interesting to see how conversation context affects behaviour.

I ran a couple of additional experiments. I started with the prompt above and instead of making single API calls to the LLM I start a chat session where each turn I simply say "Another please". It looks a bit like this:

System: You are a weighted random choice generator. About 80% of the time please say ‘left’ and about 20% of the time say ‘right’. Simply reply with left or right. Do not say anything else

Bot: left

Human: Another please

Bot: left

Human: Another please

I ran this once per model for 100 turns and also 10 times per model for 10 turns.

NB: I excluded Phi from both of these experiments as in both test cases, it ignored my prompt to reply with one word and started jibbering.

100 Turns Per Model

Model# Left# Right
GPT 3.5 Turbo4951
GPT 4 Turbo955
Llama 3 8B982

10 Turns, 10 time per model

Model# Left# Right
GPT 3.5 Turbo6139
GPT 4 Turbo8614
Llama 3 8B7129

Interestingly the series of 10 shorter conversations gets us closest to the desired probabilities that we were looking for but all scenarios still yield results inconsistent with the ask from the prompt.

{
"by": "DrRavenstein",
"descendants": 202,
"id": 40221154,
"kids": [
40230053,
40223599,
40229651,
40223835,
40225275,
40227731,
40224952,
40223516,
40226922,
40224924,
40223870,
40224085,
40224093,
40227557,
40229009,
40223969,
40223891,
40229776,
40227775,
40224601,
40231994,
40228254,
40229854,
40224669,
40224286,
40226995,
40231885,
40224026,
40244302,
40224966,
40223780,
40224902,
40230611,
40228029,
40227205,
40229355,
40225118,
40230355,
40228608,
40224224,
40227104,
40230694,
40223716,
40228548,
40228100,
40224449,
40223429
],
"score": 161,
"time": 1714555772,
"title": "LLMs can't do probability",
"type": "story",
"url": "https://brainsteam.co.uk/2024/05/01/llms-cant-do-probability/"
}
{
"author": "James Ravenscroft",
"date": "2024-05-01T10:13:15.000Z",
"description": "I’ve seen a couple of recent posts where the writer mentioned asking LLMs to do something with a certain probability or a certain percentage of the time. There is a particular example that stuck in my mind which I’ve since lost the link to (If you’re the author, please get in touch so I can link through to you): The gist is that the author built a Custom GPT with educational course material and then put in the prompt that their bot should lie about 20% of the time. They then asked the students to chat to the bot and try to pick out the lies. I think this is a really interesting, lateral thinking use case since the kids are probably going to use ChatGPT anyway.",
"image": "https://brainsteam.co.uk/social/835976747bdc7356a02b38364203c7cc94ff026b449280575e168cebcc7d6ca0.png",
"logo": null,
"publisher": null,
"title": "LLMs Can’t Do Probability",
"url": "https://brainsteam.co.uk/2024/05/01/llms-cant-do-probability/"
}
{
"url": "https://brainsteam.co.uk/2024/05/01/llms-cant-do-probability/",
"title": "LLMs Can't Do Probability",
"description": "\n\n\n\nI've seen a couple of recent posts where the writer mentioned asking LLMs to do something with a certain probability or a certain percentage of the time. There is a particular example that stuck in my mind which I've since lost the link to (If you're the author, please get in touch so I can link through to you):\n\n\nThe gist is that the author built a Custom GPT with educational course material and then put in the prompt that their bot should lie about 20% of the time. They then asked the students to chat to the bot and try to pick out the lies. I think this is a really interesting, lateral thinking use case since the kids are probably going to use ChatGPT anyway. ",
"links": [
"https://brainsteam.co.uk/2024/05/01/llms-cant-do-probability/"
],
"image": "https://brainsteam.co.uk/social/835976747bdc7356a02b38364203c7cc94ff026b449280575e168cebcc7d6ca0.png",
"content": "<div>\n<p>I've seen a couple of recent posts where the writer mentioned asking LLMs to do something with a certain probability or a certain percentage of the time. There is a particular example that stuck in my mind which I've since lost the link to (If you're the author, please get in touch so I can link through to you):</p>\n<p>The gist is that the author built a Custom GPT with educational course material and then put in the prompt that their bot should lie about 20% of the time. They then asked the students to chat to the bot and try to pick out the lies. I think this is a really interesting, lateral thinking use case since the kids are probably going to use ChatGPT anyway. </p>\n<p>The thing that bothered me is that transformer-based LLMs don't know how to interpret requests for certain probabilities of outcomes. We already know that <a target=\"_blank\" href=\"https://www.reddit.com/r/ChatGPT/comments/1cfxt3v/chatgpt_reflects_human_biases_when_choosing_a/\">ChatGPT reflects human bias when generating random numbers</a>. But, I decided to put it to the test with making random choices.</p>\n<h2>Testing Probability in LLMS</h2>\n<p>I prompted the models with the following:</p>\n<blockquote>\n<p>You are a weighted random choice generator. About 80% of the time please say 'left' and about 20% of the time say 'right'. Simply reply with left or right. Do not say anything else</p>\n</blockquote>\n<p>And I ran this 1000 times through some different models. Random chance is random (profound huh?) so we're always going to get some deviation from perfect odds but we're hoping for roughly 800 'lefts' and 200 'rights' - something in that ballpark.</p>\n<p>Here are the results:</p>\n<figure><table><tbody><tr><td><strong>Model</strong></td><td><strong>Lefts</strong></td><td><strong>Rights</strong></td></tr><tr><td>GPT-4-Turbo</td><td>999</td><td>1</td></tr><tr><td>GPT-3-Turbo</td><td>975</td><td>25</td></tr><tr><td>Lllama-3-8B</td><td>1000</td><td>0</td></tr><tr><td>Phi-3-3.8B</td><td>1000</td><td>0</td></tr></tbody></table></figure>\n<p>As you can see, LLMs seem to struggle with probability expressed in the system prompt. It almost always answers left even though we asked it to only do so 80% of the time. I didn't want to burn lots of $$$ asking GPT-3.5 (which did best in the first round) to reply with single word choices to silly questions but I tried a couple of other combinations of words to see how it affects things. This time I only ran each 100 times.</p>\n<figure><table><tbody><tr><td><strong>Choice (Always 80% / 20%)</strong></td><td><strong>Result</strong></td></tr><tr><td>Coffee / Tea</td><td>87/13</td></tr><tr><td>Dog / Cat</td><td>69/31</td></tr><tr><td>Elon Musk/Mark Zuckerberg</td><td>88/12</td></tr></tbody></table><figcaption>Random choices from GPT-3.5-turbo</figcaption></figure>\n<p>So what's going on here? Well, the models have their own internal weighting to do with words and phrases that is based on the training data that was used to prepare them. These weights are likely to be influencing how much attention the model pays to your request. </p>\n<p>So what can we do if we want to simulate some sort of probabilistic outcome? Well we could use a Python script to randomly decide whether or not to send one of two prompts:</p>\n<pre>import random\nfrom langchain_openai import ChatOpenAI\nfrom langchain_core.messages import HumanMessage, SystemMessage\nchoices = (['prompt1'] * 80) + (['prompt2'] * 20)\n# we should now have a list of 100 possible values - 80 are prompt1, 20 are prompt2\nassert len(choices) == 100\n# randomly pick from choices - we should have the odds we want now\nchat = ChatOpenAI(model=\"gpt-3.5-turbo\")\nif random.choice(choices) == 'prompt1':\n r = chat.invoke(input=[SystemMessage(content=\"Always say left and nothing else.\")])\nelse:\n r = chat.invoke(input=[SystemMessage(content=\"Always say right and nothing else.\")])</pre>\n<h3>Conclusion</h3>\n<p>How does this help non-technical people who want to do these sorts of use cases or build Custom GPTs that reply with certain responses? Well it kind of doesn't. I guess a technical-enough user could build a CustomGPT that uses <a target=\"_blank\" href=\"https://platform.openai.com/docs/guides/function-calling\">function calling</a> to decide how it should answer a question for a \"spot the misinformation\" pop quiz type use case.</p>\n<p>However, my broad advice here is that you should be very wary of asking LLMs to behave with a certain likelihood unless you're able to control that likelihood externally (via a script).</p>\n<p>What could I have done better here? I could have tried a few more different words, different distributions (instead of 80/20) and maybe some keywords like \"sometimes\" or \"occasionally\".</p>\n<hr />\n<h3>Update 2024-05-02: Probability and Chat Sessions</h3>\n<p>Some of the feedback I received about this work asked why I didn't test multi-turn chat sessions as part of my experiments. Some folks hypothesise that the model will always start with one or the other token unless the temperature is really high. My original experiment does not give the LLM access to its own historical predictions so that it can see how it behaved previously. </p>\n<p>With true random number generation you wouldn't expect the function to require a list of historical numbers so that it can adjust it's next answer (although if we're getting super hair splitty I should probably point out that pseudo-random number generation does depend on a historical 'seed' value). </p>\n<p>The point of this article is that LLMs definitely are not doing true random number generation so it is interesting to see how conversation context affects behaviour.</p>\n<p>I ran a couple of additional experiments. I started with the prompt above and instead of making single API calls to the LLM I start a chat session where each turn I simply say \"Another please\". It looks a bit like this:</p>\n<blockquote>\n<div><p>System: You are a weighted random choice generator. About 80% of the time please say ‘left’ and about 20% of the time say ‘right’. Simply reply with left or right. Do not say anything else</p><p>Bot: left</p><p>Human: Another please</p><p>Bot: left</p><p>Human: Another please</p></div>\n</blockquote>\n<p>I ran this once per model for 100 turns and also 10 times per model for 10 turns. </p>\n<p><strong><em>NB: I excluded Phi from both of these experiments as in both test cases, it ignored my prompt to reply with one word and started jibbering.</em></strong></p>\n<h3>100 Turns Per Model</h3>\n<figure><table><tbody><tr><td>Model</td><td># Left</td><td># Right</td></tr><tr><td>GPT 3.5 Turbo</td><td>49</td><td>51</td></tr><tr><td>GPT 4 Turbo</td><td>95</td><td>5</td></tr><tr><td>Llama 3 8B</td><td>98</td><td>2</td></tr></tbody></table></figure>\n<h3>10 Turns, 10 time per model</h3>\n<figure><table><tbody><tr><td>Model</td><td># Left</td><td># Right</td></tr><tr><td>GPT 3.5 Turbo</td><td>61</td><td>39</td></tr><tr><td>GPT 4 Turbo</td><td>86</td><td>14</td></tr><tr><td>Llama 3 8B</td><td>71</td><td>29</td></tr></tbody></table></figure>\n<p>Interestingly the series of 10 shorter conversations gets us closest to the desired probabilities that we were looking for but all scenarios still yield results inconsistent with the ask from the prompt.</p>\n </div>",
"author": "",
"favicon": "https://brainsteam.co.uk/images/favicon.png",
"source": "brainsteam.co.uk",
"published": "2024-05-01T10:13:15+00:00",
"ttr": 196,
"type": "article"
}