How we optimized Dash's relevance judge with DSPy
Dropbox Dash brings your files, messages, and teamâs knowledge together in one place, so you can ask questions and get useful answers that are actually grounded in your companyâs context. Under the hood, that experience relies heavily on one deceptively simple capability: reliably judging which results are relevant to a query at scale. Relevance judges are used across multiple pipelines like ranking, training data generation, and offline evaluation. Without systematic optimization, they can become a primary source of regressions, cost blowups, and loss of trust as models change.
Making a relevance judge work in production is harder than it looks. A prototype might lean on a state-of-the-art model, but real systems have latency and cost budgets, which usually means migrating to smaller or cheaper models. The catch is that prompts often donât transfer cleanly across models. We ran into this while scaling our LLM-as-a-judge work: manual prompt tuning got us to a functioning judge, but quality plateaued early and every model swapâor even a small prompt editârisked regressions in unexpected cases.
To address prompt brittleness and scale up relevance label generation for the long tail of candidates, we brought in DSPy. DSPy is an open-source framework for systematically optimizing prompts against a measurable objective, turning a manual, fragile process into a repeatable optimization loop. In this article, weâll show how we defined that objective, used DSPy to adapt our judge across models, and made the judge both cheaper and more reliable in production.
How to measure agreement with humans
Before we can improve a relevance judge, we need a clear definition of what âgoodâ means. At its core, the judgeâs job is straightforward: given a query and a document, it assigns a relevance score from 1 to 5, where 5 indicates a perfect match and 1 indicates no meaningful connection to the query and user intent. To evaluate how well the judge performs, we compare its scores to those assigned by human annotators performing the same task.
In our evaluation dataset, humans are shown a query and a candidate document and asked to rate its relevance on that same 1â5 scale. They also provide a short explanation describing why they chose that score. These human judgments serve as our reference point. For more details on the annotation process, see our LLM-as-a-judge blog. (Dropbox conducts these reviews with limited, non-sensitive internal datasets; no customer data is reviewed by humans as part of this process.)
We then measure how far the modelâs ratings deviate from the human ratings using normalized mean squared error (NMSE), a metric that summarizes the modelâs average disagreement with humans as a single number. If a human assigns a 5 and the model assigns a 4, thatâs a small disagreement; if the human assigns a 5 and the model assigns a 1, thatâs a much larger one. NMSE captures those differences across the entire dataset by computing the average squared gap between the modelâs score and the human score, scaled to a 0â100 range. An NMSE of 0 indicates perfect agreement, while higher values indicate worse alignment.
We also account for structural reliability. The judgeâs output is formatted as JSON; if the model returns broken JSON or fails to follow the expected structure, that output cannot be parsed and therefore cannot be used. In those cases, we treat the response as fully incorrect. These formatting failures arenât cosmetic: if the output cannot be read, examples may be dropped, batches can fail, and evaluation metrics become unreliable.
Taken together, this framework gives us a clear and measurable objective: minimize disagreement with human relevance judgments while ensuring that outputs remain consistently usable in production systems. Thatâs the objective DSPy optimizes against.
Adapting our relevance judge for large-scale use
Our best-performing relevance judge was built on the most powerful proprietary model at the time (OpenAIâs o3). It produced high-quality scores and aligned closely with human ratings, but it was expensive to run at scale. As Dash grew, we needed to score orders of magnitude more queryâdocument pairs. Running the most expensive model for every judgment wasnât sustainable. We wanted to move to a lower-cost, open-weight model that we could run at scale.
We chose gpt-oss-120b, an open model that offered a strong balance between cost and performance. In simple terms, it was much cheaper to run, but still capable of following complex instructions. The problem was that our carefully tuned prompt for o3 did not transfer cleanly. When we applied it to the cheaper model, quality dropped under our evaluation metric. Manual prompt rewriting could eventually recover performance, but it would require weeks of iteration and regression chasing. Instead of starting over by hand, we used DSPy to systematically adapt the judge to the new model.
How DSPy helped us adapt the judge
We already had everything needed to define the problem clearly. The task was fixed: given a query and a document, assign a relevance score from 1 to 5. The dataset was fixed: human-annotated examples with ratings and explanations. And the metric was fixed: NMSE, which measures how far the modelâs ratings deviate from human ratings.
DSPy allows you to define that setupâtask, data, and metricâand then systematically search for prompt variants that improve performance on that metric. We used DSPyâs GEPA optimizer (a method that iteratively improves prompts by analyzing where the model disagrees with humans and generating feedback) to adapt and optimize the relevance-judging program for a specific target modelâin this case, gpt-oss-120b.
Rather than treating evaluation as a single score, GEPA generates structured feedback for each example where the model disagrees with a human annotator. In our case, we combined the size and direction of the gap with the human explanation and the modelâs reasoning, producing concrete signals about what went wrong and why.
This feedback powers the DSPy reflection loop. The prompt is evaluated, its failure modes are surfaced in plain language, the prompt is revised, and the cycle repeatsâall while directly optimizing against the human-alignment metric defined earlier. Instead of trying to infer improvements from a single number, the system can respond to specific patterns, such as underweighting recency relative to the human explanation or overvaluing keyword matches. To make this more concrete, here is a simplified version of how we construct that textual feedback:
diff = predicted_rating - expected_rating
direction = "higher" if diff > 0 else "lower"
feedback_parts = [
f"Predicted rating {int(predicted_rating)} but expected {int(expected_rating)}.",
f"Model rated {abs(diff):.0f} point(s) {direction} than the expected human rating.",
]
# Include human explanation if available
if gold.explanation:
feedback_parts.append(f"Human rationale: {gold.explanation}")
# Include model's explanation for comparison
if pred.explanation:
feedback_parts.append(f"Model's reasoning: {pred.explanation}")
feedback_parts.append(
"Remember: when adapting the prompt, avoid overfitting to specific
example(s). Do not include exact examples or keywords from them in the prompt.
Also ensure you do not change the basic parameters of the task (e.g. changing the
rating range to be anything but 1-5). Try to add a general rule to an execution
plan to rate similar documents in the future."
)
feedback = "\n".join(feedback_parts)
There were important caveats. In early experiments, we observed that the optimizer could overfit by copying specific keywords, usernames, or verbatim document phrases directly into prompts. That behavior improved performance on the training examples but did not generalize. To address this, we added explicit guardrails to forbid direct inclusion of example-specific content. We also found that candidate prompts sometimes modified key task parameters, such as changing the rating scale from 1â5 to 1â3 or 1â4. Additional constraints ensured that the task definition remained stable throughout optimization.
With this setup in place, we could move beyond intuition and measure the impact directly. Because the task, dataset, and metric were fixed, we could compare the optimized prompt to our original manually tuned prompt under identical conditions. That gave us a clear view of what changed and by how much.
Comparing the best-performing DSPy-optimized prompt to the original manually written prompt, we reduced NMSE by 45 percent (from 8.83 to 4.86). That means the judgeâs scores tracked human ratings much more closely, increasing our confidence in using it for evaluation and training signals. Model adaptation time dropped from one to two weeks of manual iteration to one to two days. That allowed us to swap in newly released models with less regression risk and keep the judge aligned with evolving product needs.
Because the optimized judge could run on a much cheaper model than our production o3 judge, we were also able to label 10â100 times more data at the same cost. That increased coverage and statistical power, enabled larger experiments, and reduced the risk of downstream models overfitting to a small evaluation set. Those results showed that DSPy could preserve quality while dramatically reducing cost.
However, optimizing for cost and human alignment still leaves an important question: can the judge behave reliably when its outputs are consumed programmatically in automated pipelines? In Dash, the relevance judge doesnât run in isolation. It sits inside systems that score large candidate sets, generate training data, and run offline simulations. That means its outputs arenât just read by people; theyâre parsed and acted on by other components. This introduces a second requirement: operational reliability.
Improving operational reliability
When we talk about judge quality, itâs easy to focus only on how closely the modelâs scores match human ratings. But in practice, the judge also has to consistently produce JSON outputs that downstream systems can read and use.
To stress test this dimension of reliability, we introduced gemma-3-12b, a much smaller and cheaper model. Smaller models reduce cost and enable broader scaling, but they are more brittle about formatting and instruction-following. By adapting our judge to a significantly smaller model, we could measure and directly optimize what was effectively the systemâs weakest link: whether a low-cost judge could produce valid, machine-readable outputs consistently enough to be usable in Dashâs pipelines.
In the baseline configuration, more than 40 percent of gemma-3-12bâs responses were malformed JSON. Under our evaluation rules, those responses were treated as fully incorrect. This meant that even before considering alignment with human ratings, the judge was unreliable from an operational standpoint. After DSPy optimization, malformed outputs dropped by more than 97 percent, and NMSE improved substantially:
| Version | NMSE | Valid Response Format | Invalid Response Format |
|---|---|---|---|
| Original Prompt (Baseline) | 46.88 | 498 | 358 |
| DSPy prompt (MIPROv2) | 17.26 | 847 | 9 |
This result showed that DSPy was not only improving alignment with human judgments, but also strengthening structural reliability. Even a smaller, weaker model could become operationally dependable when optimized against the right objective.
At the same time, this experiment reinforced another benefit of the approach: iteration speed. Although gemma-3-12b was ultimately too weak for our highest-quality production judge paths, DSPy allowed us to reach that conclusion quickly and with measurable evidence. Instead of prolonged debate or manual trial and error, we could test the model directly against our evaluation framework and make a confident decision.
Incrementally improving our o3 model
One finding emerged across our explorations: DSPy let us control the scope of changes, from small prompt edits to broader adjustments. When adapting to a new, cheaper model (like gpt-oss-120b or gemma-3-12b), we were comfortable with full prompt rewrites, prioritizing broad exploration and end-to-end optimization. But when the target was our production o3 judgeâalready strong and widely depended onâthe constraint flipped. Our goal was to make targeted improvements without destabilizing behavior relied on across multiple pipelines.
When it came to optimizing the o3-based judge, we werenât starting from scratch. We already had a high-performing baseline. Large prompt rewrites were too risky; even small wording changes could shift behavior in corner cases, and the blast radius was high. So instead of rewriting the prompt end-to-end, we limited changes to a small, predefined set of safe edits.
We introduced an instruction library layer to make prompt improvement more targeted and easier to control. When we found cases where the judgeâs score differed substantially from the human rating, humans wrote short explanations describing what the judge misunderstood and what it should have paid attention to instead. We then distilled those explanations into single-line instruction bullets, or small, reusable ârules of thumbâ the model can follow. In this setup, the optimization module is responsible only for selecting the best bullet-instructions. DSPy canât rewrite the entire prompt from scratch; instead, its job is to choose which instruction bullets to include (e.g. select common themes of errors), and how to combine them, so the prompt grows by assembling the most helpful additional guidance rather than be constantly rewritten.
This turned optimization into something closer to âsmall PRs with testsâ than a large-scale refactor: improvements were incremental, regressions were easier to diagnose, and we could keep the baseline behavior stable while still pushing agreement upward.
For example, if a disagreement was explained as âthe document is older than a year, so itâs less relevant for this query,â we translated that into a bullet like: âDocuments older than a year should be rated at least one point lower unless they are clearly evergreen.â DSPy could then learn whether including that bullet improved alignment on the eval set without unintended side effects.
We can see the cumulative effect of these incremental changes in the evaluation results below:
Each step represents a small, testable change, but together they produce a substantial improvement over the initial prompt.
Conclusion
In Dash, relevance scoring is a core capability that shapes ranking, training data generation, and offline simulation. Because it sits at the center of multiple pipelines, even small changes in how we score relevance can ripple outward. If every new model or prompting idea requires manual prompt surgery, progress becomes slow and risky.
With DSPy, we define the objectiveâalignment with human relevance judgmentsâand systematically optimize toward it. With the task and dataset held fixed, we can swap in new models and adapt them quickly, with measurable evidence instead of intuition. The workflow becomes less about rewriting prompts and more about improving against a clear metric. Just as importantly, DSPy lets us choose how to improve depending on our risk tolerance. We can run full end-to-end optimization when exploring new, cheaper models, or apply constrained, incremental updates when stability matters for production systems like o3.
In a system like Dash, where relevance scoring touches ranking, training data generation, offline simulation, and costâlatency tradeoffs, prompt optimization canât be a one-off effort. DSPy turns it into a repeatable loop: define the task, measure against human labels, optimize, and ship changes with confidence as models evolve.
Acknowledgments: This work was made possible by close collaboration across Dropbox. Weâd like to thank Eider Moore, Mingming Liu, Stella Xiang, Sean Chang, Prasang Upadhyaya, Hans Sayyadi, and Josh Clemm for their thoughtful reviews, technical feedback, and help shaping both the system and the story.
Weâre also grateful to the DSPy community for their engagement and support. In particular, weâd like to thank Isaac Miller, Drew Breunig, Lakshya A. Agrawal, and Omar Khattab for their guidance, discussions, and responsiveness as we applied DSPy to real production systems at Dropbox.
~ ~ ~
If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.
How it works
Once you click Generate, Ollama reads this article and crafts 5 comprehension questions. Your answers are graded against the article content â general knowledge won't be enough. Score 70+ to count toward your certificate.
Questions are cached â you'll always get the same 5 for this article.