Rubric-Based Machine Evaluation of Assignments

Have students check their work against your rubric before they turn things in (unit testing for written work)

A red hand-written A+ atop a paper next to the tip of a red pen

A+, latent space "photography" by Colarusso

David Colaursso
Co-director, Suffolk's Legal Innovation & Tech Lab

This is the 45th post in my series 50 Days of LIT Prompts.

I teach a project-based class at Suffolk called Coding the Law. Part of the class involves teaching students to code, not because I want to turn law students into coders, but because if you can build a thing, chances are you have a good functional understanding of that thing. It's about preparing students for the world in which they will practice, giving then a lay of the land so they know what's possible and when to call BS. A nice thing about coding projects, if you're a student, is the fact that you can get really concrete feedback about how well you did before you turn in your work. Because for the most part, you can test your work and know if it's doing the job. With writing, it's not always as clear. A large part of this has to do with the ambiguity of success criteria. Grading rubrics can help here, but in my experience, some students still turn in work that doesn't fulfill the rubric's minimum standards. When this happens, I often start to plead with the student in my head, "No, no, you had to have addressed this point somewhere, the rubric, which you've had all semester, which we've covered in class multiple times, said it was 10% of the grade. How did you miss this?" I've often wished my students could run their writing through some automatic tool that warned them that they had missed this or that point. I wish they had access to unit testing for written work. Next year I think they will, and today we'll talk about what that might look like and show you how to build such a tool.

I want to be clear about something, I don't think these tools are at the point where they should be grading student work on their own, and this comes from someone who has a provisional patent on machine-based scoring of free response questions using pre-LLM technology which I also don't trust to do grading on its own. As I've observed in every one of the previous 44 posts, a model's output should start, not end discussion. What I'm advocating for here is using LLMs to provide feedback much in the spirit of our interactive style guide or our logical fallacy detector. The point is to provide feedback to an author so they can improve their work, not to evaluate their work product. If Week 4 didn't make it clear: formative assesment 👍, summative assesment 😝.

There is a larger point worth making here. Sometimes the same tool used in different ways can take on different moral dimensions. Consider the possible uses of risk scores, mathematical models that attempt to classify how likely someone is to find themselves caught up in anti-social activities. Such models are wrong, because as we know, all models are wrong. The question is, "Are they useful?" To answer this, we have to consider context and the costs of getting it wrong. If such risk models are used to triage a limited set of social services the worst that happens is you provide help to someone who might not need it as much as someone else. This might be a real tragedy but when faced with how to provision scarce resources, it is a forgivable mistake, and one that leaves the ill effected no worse off than they would have been absent any intervention. If, however, you use these models to justify holding someone in jail pretrial (while they are presumed innocent), the cost of a false positive is suddenly the placement of someone in a cage who shouldn't be there. This comes with a very different moral salience, esp. when we consider the mistakes such systems are likely to make (i.e., those that reflect the biases in their training data).

The use of an automated rubric by students to help them improve their work is very different than having a machine decide if they have done the job. One is morally defensible even in the face of an imperfect evaluation the other is not. Of course, one could change this calculus a bit by having the machine help with grading as long is it wasn't the final answer, but we would have to think very carefully about fighting automation bias. Maybe instead of providing a provisional score it could simpily flag those it "thinks" are in trouble. As I like to say, context matters. But there is ample evidence to suggest we don't want these models making the ultimate judgment, and they should be used with an understanding of how and why they go wrong. All models are wrong, the question, as always is, "Are they useful?"

I've taken the punctuation exercise from The Elements of Style that we saw back in week 6 and imagined it as a student answer to the prompt "Describe the circumstances and actions that led to the storming of the Bastille."

Describe the circumstances and actions that led to the storming of the Bastille.

In 1788 the King's advisers warned him that the nation was facing bankruptcy therefore he summoned a body called the States-General believing that it would authorize him to levy new taxes. The people of France however were suffering from burdensome taxation oppressive social injustice and acute scarcity of food and their representatives refused to consider projects of taxation until social and economic reforms should be granted. The King who did not realize the gravity of the situation tried to overawe them collecting soldiers in and about Versailles where the sessions were being held. The people of Paris seeing the danger organized militia companies to defend their representatives. In order to supply themselves with arms they attacked the Invalides and the Bastille which contained the principal supplies of arms and munitions in Paris.

Here is the response I got from today's template for the above text. Note: the score is out of 5.

Score: 4

Justification: The student provides a clear explanation of the events and factors that led to the storming of the Bastille, including the nation's financial troubles, social injustice and food scarcity, the King's summoning of the States-General, and the organization of militia companies by the people of Paris. The answer demonstrates a good understanding of the historical context. The student connects most of the important causes and effects, such as the King's actions leading to the people's responses, but does not delve into all aspects in depth. For instance, they could have elaborated further on the significance and outcomes of the storming of the Bastille.

You'll notice this when you read the template below, but it's worth saying that the rubric does not contain any of the content found in the answer. Rather it is concerned with form and structure. That is, it is a rubric that can be shared with students before they answer. That being said, the LLM is making some judgement calls, even though it doesn't have any judgement. For example, what counts as an "oversimplified or incorrect explanation of the events." My general sense is this tool is stronger when it acts like our interactive style guide, not when it tries to evaluate the correctness of an answer. We saw some of the problems with this back in Week 4 generally, and in Week 5 when discussing tone. The point is, it can make suggestions, nothing more. Learning when and when not to act on those suggestions can be made part of the lesson, and that benefits students.

Why not just provide a rubric? Isn't it worth their time to apply it to their work? If your goal is teaching them to produce a certain structure, isn't such engagement more likely to have them internalize it? Maybe, but I can also see an argument for repetition of feedback leading to such an internalization, and I would wonder if someone making such a suggestion would also have concerns about suggesting a writer get a second set of eyes on their work / work with an editor. I don't want to suggest that any of these tools are slot-in replacements for things that already work. I want to ask, can they be a supplement, can they make up for an existing gap? For more of my thinking on this, I suggest checking out Build an AI-Augmented Word Processor. That being said...

Let's build something!

We'll do our building in the LIT Prompts extension. If you aren't familiar with the LIT Prompts extension, don't worry. We'll walk you through setting things up before we start building. If you have used the LIT Prompts extension before, skip to The Prompt Pattern (Template).

Up Next

Setup LIT Prompts

Questions or comments? I'm on Mastodon @Colarusso@mastodon.social

Setup LIT Prompts

▼ Collapse

7 min intro video

LIT Prompts is a browser extension built at Suffolk University Law School's Legal Innovation and Technology Lab to help folks explore the use of Large Language Models (LLMs) and prompt engineering. LLMs are sentence completion machines, and prompts are the text upon which they build. Feed an LLM a prompt, and it will return a plausible-sounding follow-up (e.g., "Four score and seven..." might return "years ago our fathers brought forth..."). LIT Prompts lets users create and save prompt templates based on data from an active browser window (e.g., selected text or the whole text of a webpage) along with text from a user. Below we'll walk through a specific example.

To get started, follow the first four minutes of the intro video or the steps outlined below. Note: The video only shows Firefox, but once you've installed the extension, the steps are the same.

Install the extension

Follow the links for your browser.

Firefox: (1) visit the extension's add-ons page; (2) click "Add to Firefox;" and (3) grant permissions.
Chrome: (1) visit the extension's web store page; (2) click "Add to Chrome;" and (3) review permissions / "Add extension."

If you don't have Firefox, you can download it here. Would you rather use Chrome? Download it here.

Point it at an API

Here we'll walk through how to use an LLM provided by OpenAI, but you don't have to use their offering. If you're interested in alternatives, you can find them here. You can even run your LLM locally, avoiding the need to share your prompts with a third-party. If you need an OpenAI account, you can create one here. Note: when you create a new OpenAI account you are given a limited amount of free API credits. If you created an account some time ago, however, these may have expired. If your credits have expired, you will need to enter a billing method before you can use the API. You can check the state of any credits here.

Screenshot of the OpenAI API Keys page showing where to click to create a new key.

Once you are looking at the API docs, follow the steps outlined in the image above. That is:

Select "API keys" from the left menu
Click "+ Create new secret key"

On LIT Prompt's Templates & Settings screen, set your API Base to https://api.openai.com/v1/chat/completions and your API Key equal to the value you got above after clicking "+ Create new secret key". You get there by clicking the Templates & Settings button in the extension's popup:

open the extension
click on Templates & Settings
enter the API Base and Key (under the section OpenAI-Compatible API Integration)

Once those two bits of information (the API Base and Key) are in place, you're good to go. Now you can edit, create, and run prompt templates. Just open the LIT Prompts extension, and click one of the options. I suggest, however, that you read through the Templates and Settings screen to get oriented. You might even try out a few of the preloaded prompt templates. This will let you jump right in and get your hands dirty in the next section.

If you receive an error when trying to run a template after entering your Base and Key, and you are using OpenAI, make sure to check the state of any credits here. If you don't have any credits, you will need a billing method on file.

If you found this hard to follow, consider following along with the first four minutes of the video above. It covers the same content. It focuses on Firefox, but once you've installed the extension, the steps are the same.

The Prompt Pattern (Template)

A slide showing the George Box quote: All models are wrong, but some models are useful.

Maps are models; they don't show everything. That's okay as long as you don't confuse the map for the territory.

When crafting a LIT Prompts template, we use a mix of plain language and variable placeholders. Specifically, you can use double curly brackets to encase predefined variables. If the text between the brackets matches one of our predefined variable names, that section of text will be replaced with the variable's value. Today we'll be using {{highlighted}}. See the extension's documentation.

The {{highlighted}} variable contains any text you have highlighted/selected in the active browser tab when you open the extension.

To use this template, highlight your question and the answer you want to evaluate, then trigger the template.

FWIW, the rubric below was made with the aid of GPT-4.

Here's the template's title.

Run rubric

Here's the template's text.

You're a teaching assistant helping evaluate student work against a rubric. In a moment I will provide you with a rubric for answering shot-answer questions, a short-answer question, and a student answer to that question. You should then respond with a score for that question along with a justification for that score based on the rubric. 

---

RUBRIC

Score Range: 0 - 5 Points

    5 Points (Excellent)
        The answer provides a comprehensive and detailed explanation of the historical events in question, including all relevant factors, actions, and outcomes.
        It demonstrates a deep understanding of the historical context and the significance of these events.
        The response includes specific details and examples that enrich the explanation.
        It clearly and accurately connects the causes and effects within the historical events being discussed.

    4 Points (Good)
        The answer provides a clear explanation of the key events and factors but may lack some minor details or examples.
        It demonstrates a good understanding of the historical context and significance of the events.
        The response connects most of the important causes and effects but may not cover all aspects in depth.

    3 Points (Satisfactory)
        The answer mentions key events and factors but lacks detail and depth in the explanation.
        It demonstrates a basic understanding of the historical context but may not fully elaborate on the significance of the events.
        The response makes an effort to connect causes and effects but may miss important connections or details.

    2 Points (Needs Improvement)
        The answer provides a vague or incomplete overview of the historical events in question.
        It shows a limited understanding of the historical context and struggles to connect relevant causes and effects.
        Key events or factors are mentioned but not effectively explained or connected.

    1 Point (Poor)
        The answer provides an oversimplified or incorrect explanation of the events.
        It demonstrates a lack of understanding of the historical context.
        There is minimal or no attempt to connect causes and effects, or the information provided is largely irrelevant.

    0 Points (No Attempt)
        The answer does not address the question or is left blank.

---

QUESTION AND ANSWER 

{{highlighted}}

And here are the template's parameters:

Output Type: LLM. This choice means that we'll "run" the template through an LLM (i.e., this will ping an LLM and return a result). Alternatively, we could have chosen "Prompt," in which case the extension would return the text of the completed template.
Model: gpt-4. This input specifies what model we should use when running the prompt. Available models differ based on your API provider. See e.g., OpenAI's list of models.
Temperature: 0.7. Temperature runs from 0 to 1 and specifies how "random" the answer should be. Here I'm using 0.7 because I'm happy to have the text be a little "creative."
Max Tokens: 250. This number specifies how long the reply can be. Tokens are chunks of text the model uses to do its thing. They don't quite match up with words but are close. 1 token is something like 3/4 of a word. Smaller token limits run faster.
JSON: No. This asks the model to output its answer in something called JSON. We don't need to worry about that here, hence the selection of "No."
Output To: Screen Only. We can output the first reply from the LLM to a number of places, the screen, the clipboard... Here, we're content just to have it go to the screen.
Post-run Behavior: FULL STOP. Like the choice of output, we can decide what to do after a template runs. To keep things simple, I went with "FULL STOP."
Hide Button: unchecked. This determines if a button is displayed for this template in the extension's popup window.

Working with the above template

To work with the above template, you could copy it and its parameters into LIT Prompts one by one, or you could download a single prompts file and upload it from the extension's Templates & Settings screen. This will replace your existing prompts.

Screenshot of the LIT Prompts Templates and Settings page showing where to upload prompts files.

You can download a prompts file (the above template and its parameters) suitable for upload by clicking this button:

Download prompts file

Kick the Tires

It's one thing to read about something and another to put what you've learned into practice. Let's see how this template performs.

Make it your own. Swap in you own rubric, give it a spin.

TL;DR References

ICYMI, here are blubs for a selection of works I linked to in this post. If you didn't click through above, you might want to give them a look now.

Unsupervised Machine Scoring of Free Response Answers—Validated Against Law School Final Exams by David Colarusso. This paper presents a novel method for unsupervised machine scoring of short answer and essay question responses, relying solely on a sufficiently large set of responses to a common prompt, absent the need for pre-labeled sample answers—given said prompt is of a particular character. That is, for questions where “good” answers look similar, “wrong” answers are likely to be “wrong” in different ways. Consequently, when a collection of text embeddings for responses to a common prompt are placed in an appropriate feature space, the centroid of their placements can stand in for a model answer, providing a lodestar against which to measure individual responses. This paper examines the efficacy of this method and discusses potential applications.
Dialect prejudice predicts AI decisions about people's character, employability, and criminality by Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, Sharese King This research paper investigates the presence of covert racism in language models, specifically in relation to dialect prejudice. The study finds that language models exhibit covert stereotypes about speakers of African American English that are more negative than any human stereotypes recorded. In contrast, the language models' overt stereotypes about African Americans are more positive. The paper also demonstrates the potential harmful consequences of dialect prejudice by showing that language models are more likely to suggest assigning less prestigious jobs, convicting of crimes, and sentencing to death to speakers of African American English. The study further reveals that existing methods for mitigating racial bias in language models do not alleviate dialect prejudice and may even exacerbate the discrepancy between covert and overt stereotypes. The findings have significant implications for the fair and safe use of language technology in employment. Summary based on a draft from our day one template.
The Elements of Style by William Strunk Jr. The Elements of Style is a style guide for writing American English. It was originally written by William Strunk Jr. in 1918 and published in 1920. The book includes eight rules of usage, ten principles of composition, some matters of form, a list of commonly misused words and expressions, and a list of often misspelled words. Summary based on a draft from our day one template.