A Hawk from a Handsaw

Have an AI "guess" the right answers for multiple choice and true or false questions

A robot holding a pencil and filling out a bubble sheet

Is the answer C? latent space "photography" by Colarusso

David Colaursso
Co-director, Suffolk's Legal Innovation & Tech Lab

This is the 19th post in my series 50 Days of LIT Prompts.

You may have heard that GPT-4 passed the bar exam, and though I think this fact tells us more about the bar exam than GPT-4, it's worth thinking about how it does this. As we've noted, large language models (LLMs) are sentence completion machines. They guess the next word based on what they've seen in their training data. They can answer questions because their training data contains questions and answers. They can "answer" questions not in their training data because there are patterns to how folks answer questions, patterns it has "learned." Of course, if a question-answer pair was in its training data, that doesn't hurt, and with these models being trained on broad swaths of the internet, they very well might be in there. Today and tomorrow, we're going to do a little experiment. It turns out there was an ulterior motive behind Tuesday and Wednesday's posts. Yes, there is a method to my madness. Those posts contain a set of questions and answers we can feed into today's template. Today we are seeing if we can get an LLM to correctly answer those questions. Spoiler alert: today won't go so well. Double spoiler alert: tomorrow will go better. Homework assignment: see if you can guess how we'll make that happen.

When turning today's template on Tuesday's questions, I found that it got 2 out of 5 correct. On Wednesday's questions it did worse, getting 4 out of 10 right. All together, that's a 40%. Those scores are based on using gpt-4 which didn't prove any better than gpt-3.5-trubo which scored 2/5 and 5/10 respectively. That's 47%. I think it's safe to say they're both shooting in the dark. So, given gpt-4 is a good deal more expensive, I'd stick with 3.5 for now.

To be entierly clear, our gpt-4 isn't quite the same GPT-4 used in the MBE paper. The "GPT-4" moniker has pointed to differnt models over time. Also, I counted an answer as "correct" when it agreed with the GPT-generated answer provided in the prior posts. As I noted, I wouldn't take these answers as gospel. It did, however, seem reasonable here to measure the LLMs' performance against the answers drafted by an LLM. That is, we are measuring how well the LLMs did on a test written by LLMs. That being said...

Let's build something!

We'll do our building in the LIT Prompts extension. If you aren't familiar with the LIT Prompts extension, don't worry. We'll walk you through setting things up before we start building. If you have used the LIT Prompts extension before, skip to The Prompt Pattern (Template).

Up Next

Setup LIT Prompts

Questions or comments? I'm on Mastodon @Colarusso@mastodon.social

Setup LIT Prompts

▼ Collapse

7 min intro video

LIT Prompts is a browser extension built at Suffolk University Law School's Legal Innovation and Technology Lab to help folks explore the use of Large Language Models (LLMs) and prompt engineering. LLMs are sentence completion machines, and prompts are the text upon which they build. Feed an LLM a prompt, and it will return a plausible-sounding follow-up (e.g., "Four score and seven..." might return "years ago our fathers brought forth..."). LIT Prompts lets users create and save prompt templates based on data from an active browser window (e.g., selected text or the whole text of a webpage) along with text from a user. Below we'll walk through a specific example.

To get started, follow the first four minutes of the intro video or the steps outlined below. Note: The video only shows Firefox, but once you've installed the extension, the steps are the same.

Install the extension

Follow the links for your browser.

Firefox: (1) visit the extension's add-ons page; (2) click "Add to Firefox;" and (3) grant permissions.
Chrome: (1) visit the extension's web store page; (2) click "Add to Chrome;" and (3) review permissions / "Add extension."

If you don't have Firefox, you can download it here. Would you rather use Chrome? Download it here.

Point it at an API

Here we'll walk through how to use an LLM provided by OpenAI, but you don't have to use their offering. If you're interested in alternatives, you can find them here. You can even run your LLM locally, avoiding the need to share your prompts with a third-party. If you need an OpenAI account, you can create one here. Note: when you create a new OpenAI account you are given a limited amount of free API credits. If you created an account some time ago, however, these may have expired. If your credits have expired, you will need to enter a billing method before you can use the API. You can check the state of any credits here.

Screenshot of the OpenAI API Keys page showing where to click to create a new key.

Once you are looking at the API docs, follow the steps outlined in the image above. That is:

Select "API keys" from the left menu
Click "+ Create new secret key"

On LIT Prompt's Templates & Settings screen, set your API Base to https://api.openai.com/v1/chat/completions and your API Key equal to the value you got above after clicking "+ Create new secret key". You get there by clicking the Templates & Settings button in the extension's popup:

open the extension
click on Templates & Settings
enter the API Base and Key (under the section OpenAI-Compatible API Integration)

Once those two bits of information (the API Base and Key) are in place, you're good to go. Now you can edit, create, and run prompt templates. Just open the LIT Prompts extension, and click one of the options. I suggest, however, that you read through the Templates and Settings screen to get oriented. You might even try out a few of the preloaded prompt templates. This will let you jump right in and get your hands dirty in the next section.

If you receive an error when trying to run a template after entering your Base and Key, and you are using OpenAI, make sure to check the state of any credits here. If you don't have any credits, you will need a billing method on file.

If you found this hard to follow, consider following along with the first four minutes of the video above. It covers the same content. It focuses on Firefox, but once you've installed the extension, the steps are the same.

The Prompt Pattern (Template)

A slide showing the George Box quote: All models are wrong, but some models are useful.

Maps are models; they don't show everything. That's okay as long as you don't confuse the map for the territory.

When crafting a LIT Prompts template, we use a mix of plain language and variable placeholders. Specifically, you can use double curly brackets to encase predefined variables. If the text between the brackets matches one of our predefined variable names, that section of text will be replaced with the variable's value. Today we'll be using our old friend {{highlighted}}. See the extension's documentation.

The {{highlighted}} variable contains any text you have highlighted/selected in the active browser tab when you open the extension. The idea is that you highlight the text of the question, not including the answer, trigger the extension, and get an answer.

Here's the template's title.

Answer the selected question

Here's the template's text.

I'm going to show you a "multiple choice" or "true or false" question. Then I'm going to ask you to provide the correct answer. 

{{highlighted}}

Now, provide the correct answer:

And here are the template's parameters:

Output Type: LLM. This choice means that we'll "run" the template through an LLM (i.e., this will ping an LLM and return a result). Alternatively, we could have chosen "Prompt," in which case the extension would return the text of the completed template.
Model: gpt-4o-mini. This input specifies what model we should use when running the prompt. Available models differ based on your API provider. See e.g., OpenAI's list of models.
Temperature: 0. Temperature runs from 0 to 1 and specifies how "random" the answer should be. Since we're seeking fidelity to a text, I went with the least "creative" setting—0.
Max Tokens: 500. This number specifies how long the reply can be. Tokens are chunks of text the model uses to do its thing. They don't quite match up with words but are close. 1 token is something like 3/4 of a word. Smaller token limits run faster.
JSON: No. This asks the model to output its answer in something called JSON. We don't need to worry about that here, hence the selection of "No."
Output To: Screen Only. We can output the first reply from the LLM to a number of places, the screen, the clipboard... Here, we're content just to have it go to the screen.
Post-run Behavior: FULL STOP. Like the choice of output, we can decide what to do after a template runs. To keep things simple, I went with "FULL STOP."
Hide Button: unchecked. This determines if a button is displayed for this template in the extension's popup window.

Working with the above templates

To work with the above template, you could copy it and its parameters into LIT Prompts one by one, or you could download a single prompts file and upload it from the extension's Templates & Settings screen. This will replace your existing prompts.

Screenshot of the LIT Prompts Templates and Settings page showing where to upload prompts files.

You can download a prompts file (the above template and its parameters) suitable for upload by clicking this button:

Download prompts file

Kick the Tires

It's one thing to read about something and another to put what you've learned into practice. Let's see how this template performs.

What's going wrong? Try the extension out on the questions we created on Tuesday and Wednesday. See if you can come up with some reason(s) for why it didn't do better. Now edit the template to address any issues. Keep in mind, it might be helpful to make use of more than just one of those predefined variables.

TL;DR References

ICYMI, here are blubs for a selection of works I linked to in this post. If you didn't click through above, you might want to give them a look now.

Figure 1 from GPT-4 Passes the Bar Exam showing the performance over various LLMs on the multi-state bar exam. Click to enlarge.

GPT-4 Passes the Bar Exam by Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo. When this paper came out it caused quite a stir in legal academia. As the title suggests, it demonstrated that an LLM could pass the Multi State Bar Exam. Don't confuse this with the arrival of AI lawyers. What's undeniable is that such an accomplishment says something interesting. I tend to think it says more about the way we test lawyers than most commentary on it would suggest, but it's the source of something you may have heard somewhere else, "AI Passes the Bar!!!"
Hamlet, Prince of Denmark by William Shakespeare. Technically, I didn't link to this above, but I did allude to it a couple of times. Either way, I'll take any chance I can to share the fact that Project Gutenberg has a great selection of public domain works available to read on the web or with your e-reader. The above link will get you the whole play.
ChatGPT Is a Blurry JPEG of the Web by Ted Chiang. Writing at the beginning of ChatGPT's rise to prominence, this article discusses the analogy between language models like ChatGPT and lossy compression algorithms. Chiang argues that while models can repackage/compress web information, they lack true understanding. Ultimately, Chiang concludes that starting with a blurry copy is not ideal when creating original content and that the struggling to express thoughts is an essential element of the writing process.