Taming Texts or: How To Turn Unstructured Prose Into Structured Data

Diagram selected sentence, output JSON

Robot at a chalkboard 'diagraming' a sentence.

See Spot run, latent space "photography" by Colarusso. Looks like our robot instructor isn't the best at diagraming sentences. Hopefully, their text-based counterpart fairs better.

David Colaursso
Co-director, Suffolk's Legal Innovation & Tech Lab

This is the 12th post in my series 50 Days of LIT Prompts.

A good deal of social science, and empirical legal work for that matter, relies on the parsing of large texts. Say you want to explore how often a court sides with the "government" over "industry," and more to the point, when and how much money is involved in such cases. Unless someone else has done the work, there isn't a spreadsheet you can consult. Someone will have to read through all of that court's cases and figure out who was who, who won, and how much money was involved. You might be able to make some progress with something like regular expressions, but unless the text is very regular in its presentation, you'll miss things. Regular expressions let you search for well-defined patterns in text. Instead of saying, "find 867-5309," you can say, "find me all three-digit numbers adjacent to a dash followed by a four-digit number. (i.e., find the phone numbers in this text, not just a single phone number)." The problem is, what if the there's a phone number of the form 555.5555? That darn pretentious period can break everything. So, a more robust method is often called for. The point being, it can be useful to transform prose into something like a spreadsheet, what is sometimes called structured data. Such data is easy for computers to consume. That is, it's easy to sort, count, and connect.

To illustrate how one might go about having an LLM turn prose into structured data, I settled upon diagraming sentences. Hopefully, the reasons why will become clear once we start going through the prompt template below. So, let's build something!

We'll do our building in the LIT Prompts extension. If you aren't familiar with the LIT Prompts extension, don't worry. We'll walk you through setting things up before we start building. If you have used the LIT Prompts extension before, skip to The Prompt Pattern (Template).

Up Next

Setup LIT Prompts

Questions or comments? I'm on Mastodon @Colarusso@mastodon.social

Setup LIT Prompts

▼ Collapse

7 min intro video

LIT Prompts is a browser extension built at Suffolk University Law School's Legal Innovation and Technology Lab to help folks explore the use of Large Language Models (LLMs) and prompt engineering. LLMs are sentence completion machines, and prompts are the text upon which they build. Feed an LLM a prompt, and it will return a plausible-sounding follow-up (e.g., "Four score and seven..." might return "years ago our fathers brought forth..."). LIT Prompts lets users create and save prompt templates based on data from an active browser window (e.g., selected text or the whole text of a webpage) along with text from a user. Below we'll walk through a specific example.

To get started, follow the first four minutes of the intro video or the steps outlined below. Note: The video only shows Firefox, but once you've installed the extension, the steps are the same.

Install the extension

Follow the links for your browser.

Firefox: (1) visit the extension's add-ons page; (2) click "Add to Firefox;" and (3) grant permissions.
Chrome: (1) visit the extension's web store page; (2) click "Add to Chrome;" and (3) review permissions / "Add extension."

If you don't have Firefox, you can download it here. Would you rather use Chrome? Download it here.

Point it at an API

Here we'll walk through how to use an LLM provided by OpenAI, but you don't have to use their offering. If you're interested in alternatives, you can find them here. You can even run your LLM locally, avoiding the need to share your prompts with a third-party. If you need an OpenAI account, you can create one here. Note: when you create a new OpenAI account you are given a limited amount of free API credits. If you created an account some time ago, however, these may have expired. If your credits have expired, you will need to enter a billing method before you can use the API. You can check the state of any credits here.

Screenshot of the OpenAI API Keys page showing where to click to create a new key.

Once you are looking at the API docs, follow the steps outlined in the image above. That is:

Select "API keys" from the left menu
Click "+ Create new secret key"

On LIT Prompt's Templates & Settings screen, set your API Base to https://api.openai.com/v1/chat/completions and your API Key equal to the value you got above after clicking "+ Create new secret key". You get there by clicking the Templates & Settings button in the extension's popup:

open the extension
click on Templates & Settings
enter the API Base and Key (under the section OpenAI-Compatible API Integration)

Once those two bits of information (the API Base and Key) are in place, you're good to go. Now you can edit, create, and run prompt templates. Just open the LIT Prompts extension, and click one of the options. I suggest, however, that you read through the Templates and Settings screen to get oriented. You might even try out a few of the preloaded prompt templates. This will let you jump right in and get your hands dirty in the next section.

If you receive an error when trying to run a template after entering your Base and Key, and you are using OpenAI, make sure to check the state of any credits here. If you don't have any credits, you will need a billing method on file.

If you found this hard to follow, consider following along with the first four minutes of the video above. It covers the same content. It focuses on Firefox, but once you've installed the extension, the steps are the same.

The Prompt Pattern (Template)

A slide showing the George Box quote: All models are wrong, but some models are useful.

Maps are models; they don't show everything. That's okay as long as you don't confuse the map for the territory.

When crafting a LIT Prompts template, we use a mix of plain language and variable placeholders. Specifically, you can use double curly brackets to encase predefined variables. If the text between the brackets matches one of our predefined variable names, that section of text will be replaced with the variable's value. Today we'll make use of our old friend {{highlighted}}. See the extension's documentation.

The {{highlighted}} variable contains any text you have highlighted/selected in the active browser tab when you open the extension. Our goal is to take in this selected text and return a sentence diagram of sorts. Like yesterday's template, this one is a good deal more complex than what we saw in weeks one and two. It is an exercise in data extraction and labeling. However, this template is here mostly to show off the JSON parameter (I'm not sure how much I really trust it as a sentence diagraming tool). By setting the JSON parameter to Yes, we are asking the LLM to construct output in JSON. Consequently, the LLM should produce well-structured JSON output. If you haven't seen JSON before, you might want to read up on it here: https://en.wikipedia.org/wiki/JSON. The prompt below does an okay job of telling you what to expect. As we discussed above, the ability to make nice machine-readable output—structured data—can be very useful. For us, this will prove helpful when working with some of our more complex interactions. FWIW, I had ChatGPT create the specifications below.

Here's the template's title.

"Diagram" selected sentence

Here's the template's text.

Below I will provide you with a string of text. Your job is to produce a JSON representation of its sentence structure. 

1. Representation and JSON Structure:

The JSON representation of sentence structure consists of the following key-value pairs:

a) "subject": This key represents the subject of the sentence and contains an object describing the subject. The subject object can include properties such as "type" (to specify the type of subject, e.g., noun or pronoun) and "value" (to store the actual subject word or phrase).

b) "predicate": This key represents the predicate of the sentence and contains an object describing the predicate. The predicate object can include properties such as "type" (to specify the type of predicate, e.g., verb or verb phrase) and "value" (to store the actual predicate word or phrase).

c) "object": This key represents the object of the sentence and contains an object describing the object. The object can include properties such as "type" (to specify the type of object, e.g., noun or pronoun) and "value" (to store the actual object word or phrase).

d) "complement": This key represents the complement of the sentence and contains an object describing the complement. The complement object can include properties such as "type" (to specify the type of complement, e.g., adjective or noun phrase) and "value" (to store the actual complement word or phrase).

e) "modifiers": This key represents any modifiers or additional information associated with the sentence. It contains an array of objects, where each object describes a specific modifier. Each modifier object can include properties such as "type" (to specify the type of modifier, e.g., adverbial or prepositional phrase) and "value" (to store the actual modifier word or phrase).

2. Example JSON Structure:

{
  "subject": {
    "type": "noun",
    "value": "cat"
  },
  "predicate": {
    "type": "verb",
    "value": "jumped"
  },
  "object": {
    "type": "noun",
    "value": "fence"
  },
  "complement": {
    "type": "adjective",
    "value": "high"
  },
  "modifiers": [
    {
      "type": "adverbial",
      "value": "quickly"
    },
    {
      "type": "prepositional phrase",
      "value": "over the wall"
    }
  ]
}

In this example, the JSON structure represents a sentence where the subject is "cat," the predicate is "jumped," the object is "fence," the complement is "high," and there are two modifiers: "quickly" (an adverbial modifier) and "over the wall" (a prepositional phrase modifier). 

3. Conclusion:
The JSON representation of sentence structure provides a standardized way to describe sentence elements such as subject, predicate, object, complement, and modifiers. It allows for the structured representation of sentence components, making it easier to process and analyze sentence structures programmatically.

Now that I've given you these specifications, your job is to make such an object for the following text string:

{{highlighted}}

Now provide your JSON object:

And here are the template's parameters:

Output Type: LLM. This choice means that we'll "run" the template through an LLM (i.e., this will ping an LLM and return a result). Alternatively, we could have chosen "Prompt," in which case the extension would return the text of the completed template.
Model: gpt-4o-mini. This input specifies what model we should use when running the prompt. Available models differ based on your API provider. See e.g., OpenAI's list of models.
Temperature: 0. Temperature runs from 0 to 1 and specifies how "random" the answer should be. Since we're seeking fidelity to a text, I went with the least "creative" setting—0.
Max Tokens: 300. This number specifies how long the reply can be. Tokens are chunks of text the model uses to do its thing. They don't quite match up with words but are close. 1 token is something like 3/4 of a word. Smaller token limits run faster.
JSON: Yes. This asks the model to output its answer in something called JSON, which is a nice machine-readable way to structure data. See https://en.wikipedia.org/wiki/JSON
Output To: Screen Only. We can output the first reply from the LLM to a number of places, the screen, the clipboard... Here, we're content just to have it go to the screen.
Post-run Behavior: FULL STOP. Like the choice of output, we can decide what to do after a template runs. Here we're happy just to get our recipe. So, "FULL STOP" it is.
Hide Button: unchecked. This determines if a button is displayed for this template in the extension's popup window.

Working with the above templates

To work with the above templates, you could copy it and its parameters into LIT Prompts one by one, or you could download a single prompts file and upload it from the extension's Templates & Settings screen. This will replace your existing prompts.

Screenshot of the LIT Prompts Templates and Settings page showing where to upload prompts files.

You can download a prompts file (the above template and its parameters) suitable for upload by clicking this button:

Download prompts file

Kick the Tires

It's one thing to read about something and another to put what you've learned into practice. Let's see how this template performs.

Find some worksheets. The real point of today's prompt is to get you thinking about how to extract structured data from texts and how you might be able to leverage JSON output, but if you really want to see how well the template is at diagraming sentences, find some worksheets, and see how it does.