Practical LLM: use cases, techniques, and costs

Like many folks, I'm exploring the edges of what LLMs can do. We often hear about the more dramatic use cases for AI and the changes it'll have on our world, but honestly it's incredibly useful right now and saves me time every day. The goal of this post is to inspire and help us all think of ways to make better products.

Just a robot

Here are 3 ways I'm using LLMs in my everyday code today:

Better Defaults

Let's say you're trying to add an emoji to indicate the sentiment of game reviews. In the future you hope reviewers will select it themselves, but for now maybe you want to backfill the data a bit. The simple approach is to guess the sentiment by picking from 3 emojis based on the score, but honestly this is likely to turn out a bit bland. If you have a bigger appetite for fun and want to really juice the product you should try introducing an LLM into the mix. The steps are: give the LLM a prompt to pick some emojis, and have it choose an emoji matching the tone of your brand.


A prompt for something like this could be as simple as:

You help summarize game reviews using a single emoji. I'll give you the game title, an overall score and 5 reviews. You should give the reviews an emoji and the overall game an emoji as well. The emoji choice should be a little sassy.

Game name: Cyberpunk
Overall Score: 86

Review #1: Cyberpunk 2077 isn't perfect, but it is ambitious. It marries a gripping story with a ....
Review #2: Cyberpunk 2077 is a cynical look at a future, fallen America...
Review #3: Cyberpunk’s running time is littered with familiar problems and...


ChatGPT responds with something like this

Game: Cyberpunk 😎Overall Score: 86

Review #1: 😁Phenomenal, that's what Cyberpunk 2077 is.
Review #2: 😏A cynical look at a future, fallen America with a few bugs.
Review #3: 😬Searching in vain for the actual cyberpunk in a sea of bland sci-fi.

After some tuning of the prompt you'll likely get it to produce some JSON, and it'll usually get it right enough to produce usable output. Just in case, I usually write a function like this to find the JSON blob in the response:


const extractJSONFromCompletion = (completion: string) => {
  if (completion?.startsWith("{")) {
    return completion;
  }
  const matcher = /```(json|javascript)?\n([^`]+)\n```/g;
  let json = (completion.match(matcher) ?? [""])[0];
  json = json
    .replace("```javascript\n", "")
    .replace("```json\n", "")
    .replace("```", "");
  return json;
};


You'll probably want to fine tune that a bit based on your prompt - or maybe just ask an LLM to write the code lol.

Some further improvements if you really want to take it the next level is to give it a large but more specific set of emojis to choose from. Or perhaps replace the default emoji sets in favor of something you collaborated on with your designers. This starts ending up looking a bit more like specifying a taxonomy/categorization.... speaking of which...

Categorization

Categorization feels like a traditional ML/NLP problem. I've at times tried to build a model from scratch to help with categorization, but it's never really felt like its been worth the effort compared to a more heuristic based approach. These day LLMs fill these gaps easily, and after you've proved that you *can* categorize in this way you could fine tune more with your own model.

Let's say you're running an online store with thousands of items and you want to start showing "similar items." A simple way to do this is to categorize each item with one or more *tags* (a has and belongs to many type relationship).

In my opinion there are 3 important questions here:

  1. Should we pre-generate the list of tags once?
  2. Should the LLM just continually make new tags?
  3. If we're continually generating tags, should we have an approval or merging tool?

These questions are hard to answer, it depends on your project, risk tolerance and the data you have.

Let's assume GPT takes the wheel with a completely naive prompt:

You assist with generating tags for furniture in an online store. I'll provide the name, category and description. You should provide just a JSON blob representing the tags.

Name: Malta Leather Sofa
Category: Sofa
Description: Refined yet casual, our Malta Collection features a welcoming aesthetic with soft, clean lines and rustic iron accents. Malta’s cozy bench-style seats, plush side bolsters, and rich, artisan-finished leather will infuse contemporary style into any living space.

Response:

Here's a JSON blob representing tags for the "Malta Leather Sofa" in the "Sofa" category based on the provided description:

```json
{
 "Name": "Malta Leather Sofa",
 "Category": "Sofa",
 "Tags": [
   "Malta Leather Sofa",
   "Sofa",
   "Leather Sofa",
   "Refined",
   "Casual",
   "Welcoming",
   "Clean Lines",
   "Rustic Iron Accents",
   "Bench-Style Seats",
   "Plush Side Bolsters",
   "Artisan-Finished Leather",
   "Contemporary Style",
   "Living Space"
 ]
}
```

Personally, I don't think this produces a good set of tags. Let's try one more time, this time we give it more guidance on which tags are acceptable

You assist with generating tags for furniture in an online store. I'll provide the name, category and description. You should provide just a JSON blob representing the tags. The types of tags that are appropriate include the collection name, the primary material used, the accent material, then finally the object type. Feel free to include other highly relevant tags but limit it 5.  Example:

```json{
 tags: ["Malta", "Leather", "Rustic Iron", "Sofa"]
}```

...[product description here]

Response

Here's a JSON blob representing the tags for the "Malta Leather Sofa" based on the provided information:

```json
{
 "tags": ["Malta", "Leather", "Rustic Iron", "Sofa", "Contemporary"]
}
```

In addition to the collection name, primary material, accent material, and object type, I included the "Contemporary" tag as it's a relevant style descriptor based on the description provided.

Now this is feeling like it could be smarter than me at categorizing things, and I'm starting to get more comfortable with letting it take the wheel, and I very well might....

Text Generation and Summaries

2 quick examples here, because this one is just straight forward.

  • I recently launched a game - whenever I needed just a little extra bit of story, I'd reach to chatgpt for a little inspiration, then tuned it.
  • I've also been making some simple websites here and there. Often I've had to provided previews of pages in a sentence for OG (Open Graph) tags, site meta data and subtitles before users click on a page. This type of content often gets forgotten and goes unoptimized. It's generally better to take a good guess at the content rather than wait for the client to give you copy - especially for content like this. They'll definitely edit what they don't like later but 90% of the time what you (or you + an LLM) come up with is probably pretty good.

Overall, when I've seen LLMs write copy for websites they tend to go off the rails and the tone is never quite right, but it has inspired me, especially for marketing pages it tends to "hype" more often than I would, and that's probably a good thing.

Considerations

but what about cost?

I recently ran the categorization example above. With around 15k entries (a title and a paragraph) it cost me around $12. I then had to fix subsets of the data for $1-$2 a couple times because I found issues. I guess for that amount of the data the rule of thumb in my mind is for 1000 paragraphs I can summarize for around $1 - not too bad. But what if you have a million or a billion rows?

At that point you can can weigh the engineering + hardware + electricity costs of setting up your own servers and using something like llama2 vs rely on 3rd party APIs. You could probably buy a machine with a nice graphics card for $3k-$5k throw it in a room somewhere in your office, spend a week setting it up and allow yourself to make API requests to it. I don't quite know what the electricity cost would be for something like this, but I imagine fairly negligible since it'll be inactive most of the time. So for sub $50k a year with maintenance I bet you could run this at your company - if your company is into scrappy solutions like this.

Ultimately, as you're thinking about cost your should ask your self - Is it cheaper than more complicated engineering/design solutions? Is it cheaper than running ads or making more webpages to promote new features or your product?

but what about accuracy?

As I mentioned I spent some time/money correcting some of my rows. One of the issues I had to address was rows that had some missing data. My current strategy to improve accuracy is to just retry the prompt.

I'll end up writing something like this:


const messages = [...originalPrompt];
const max_retries = 3;
for (let i = 0; i <= 3; i++) {
  const completion = await openai.createChatCompletion({
    model: "gpt-3.5-turbo",
    messages,
  });
  const message = completion.data.choices[0].message;
  const result = extractJSONFromCompletion(message?.content ?? "");
  // includes important content for the next loop
  messages.push({ role: "assistant", content: message?.content ?? "" });
  // time to validate, if it's valid, break the loop
  if (result.hasSomeImportantKey) {
    break;
  }
  // make sure to check if nothing worked, and flag it in some way
}

Nowwww, if you want to double your costs you could also make a separate prompt that helps to validate the results of the original prompt. I may use this technique eventually, but for now simple checks work just fine.

what if we get it wrong?

I think the apt questions are "what's the worst that can happen", "do we care" and if you do "how do you mitigate that risk"? Even the emoji example above can end up precarious. What if the emoji makes sarcastic eyes at the post content and makes fun of the reviewer. You can reduce that case by prompting a 2nd time just to check, or you can try to determine the likelihood of a "dangerous selection". There will always be some risk with using this tech, but I don't think it's worth halting entire projects over that risk especially when it makes your whole product way better. Another way to think about it - if it messes up 1/100 choices that's probably fine, if it's embarrassing 1/10000 times that's probably *still OK* depending on your company/context.

but what about privacy?

This one is tricky. Who owns the data sent in to your company? It really depends on your terms of service and the content you're using. If you're sending sensitive data over to OpenAI, you may want to switch to their enterprise tier and have a DPA (Digital Privacy Agreement). This is pretty standard agreement and almost every vendor I've ever worked with is down to agree to terms that don't sell data nor reuse data in some way - if they violate this, time to go to court. I don't know what OpenAI charges for this, but I bet this will cost at least on the order of $30-50k a year.

If that answer isn't good enough from a cost or privacy perspective, you may want to take that week to set up some internal version of Llama2 to receive API calls.

Final thoughts?

Ya this is a good tool, and uhhhh I should end this article right now and go try out the new Cyberpunk update ✌️