Guide

Fine-tuning GPT-4o-mini for Spam Detection

Avatar for Ted Spare

November 29, 2024

TLDR: We fine-tuned a small LLM on a few dozen spam tags and it's working well.

Disclaimer on naming: we called it ros-spam, which is hopefully more scalable than OpenAI's gpt-3-3.5-turbo-4-4o-06-23-2024-11-20 or Anthropic's sonnet-3-3.5-3.5-but-better. Hindsight is 20/20.

What We Built

We created a purpose-built spam detection model specifically for our schema:

type Contact = {
  company: string;
  email: string;
  message: string;
  // ...
}

While it's currently tailored to our needs, the approach could be generalized for broader email spam detection.

The Problem

We get lots of inbound messages. Most are spam. Our workflow for triaging them was simple but inefficient:

  1. Someone submits a message on rubriclabs.com/contact

  2. We get a Slack notification

  3. Someone on our team manually flags it as spam or legitimate

Here's the kicker: even at just 30 seconds per day, this adds up to hours per year (not to mention the lingering cost of context-switching) making it worthwhile to automate in a post-Cursor world.

The Solution: Fine-tuning

The data from our spam flags was simply stored in Postgres, creating what would become our training dataset:

{ "message": "We sell the best leather couches", "status" : "πŸ‘Ž" }
// ...
{ "message": "Looking to build an agentic flight booking system", "status": "πŸ‘" }

Given the hundreds of upvotes/downvotes, deduped and cleaned (a 10-minute process, given the simple schema), fine-tuning on OpenAI was a straightforward process.

The Technical Details

The fine-tuning schema follows a standard chat message format:

type Message = {
  role: {
    role: "user" | "assistant" | "system";
    content: string;
  }
}

The actual examples are stored as JSONL, a file format where each line is valid JSON.

The actual process was refreshingly simple:

  1. Write our array of examples to a .jsonl file

  2. Upload the file

  3. Wait ~10 minutes

  4. Pay ~$1

  5. Profit?

Does It Work?

Quantitative Evaluation

We did a head-to-head comparison between GPT-4o and ros-spam:

  • We held back 10% of our dataset for testing

  • We ran comparisons in both OpenAI playground and OpenPipe evals

The result: ros-spam achieved 100% accuracy vs ~80% for a frontier model, even with prompt engineering.

Qualitative Assessment

We shipped it to prod with a feedback loop:

  • Each run appears with the message in Slack as πŸ‘/πŸ‘Ž

  • We can immediately spot and correct errors

  • When needed, we re-tune the model πŸ”ƒ

Deployment

The implementation was surprisingly painless. Accessing the model requires just a single-line change from standard GPT-4o calls, whether you're using:

or any other standard method.

For those interested in alternatives, you could also host this on:

  • Fireworks

  • Together

  • OpenPipe

or self-host on bare metal.

Conclusion

The ROI of this exercise was clear: human-level spam tagging running 24/7 for a couple hours of dev.

Have questions or feedback? Drop us a message at hello@rubriclabs.com.