BigHat Biosciences

TL;DR we compared a bunch of protein language models in the lab. While there are winners (see below!), how you use the model matters about as much as which model you choose, and having even a small amount of data is far better than any zero-shot method.

Protein Language Models and Antibody Drug Engineering

Generative protein language models (pLMs) have become a core part of the modern AI/ML-driven protein engineering workflow, and so at BigHat, these are one of the many tools we deploy in our weekly design-build-test loop. These models have the same architecture and pre-training as the LLMs we have all come to know/love/hate, but they learn to ‘speak’ the language of amino acids.

There has been a proliferation of these models in recent years, with some focusing on antibody-specific training data, some playing the same scaling game that has been so successful in natural-language LLMs, and others tweaking the training process and objectives. Because we’ve often seen a difference between the metrics reported in publications and real-world performance, and because it’s increasingly clear that bigger isn’t always better with pLMs¹, we decided to perform an evaluation of our own.

The Bakeoff

Our goal was to find the model that was best at diversifying existing antibody sequences, without ‘breaking’ the molecule; we want to be able to introduce multiple mutations in the CDRs (the region responsible for binding the target) and still have the resulting variants be producible in the lab and retain binding. This sort of task is useful for ‘hit expansion’ - turning one hit into many - as well as for initial exploration and dataset generation at the beginning of an optimization campaign.

We surveyed the literature, focusing on sequence-only, open-source pLMs, and identified >30 models of interest. From these, we identified 11 models that covered a range of important properties - antibody-specific or general, autoregressive or masked language model, large or small, etc. These models underwent an initial in silico evaluation, including some baseline models for comparison (one structure model and one lightweight model). We used these models to propose mutations for leads from an optimization campaign, and then used a suite of in-house predictive models to evaluate the resulting models across 6 metrics (Figure 1). We also tested different methods for generating variants with the same models (compare e.g., Gibbs and Beam Search - more on this later!).

Figure 1. Our protein language model “bakeoff” started with in silico evaluation against predictive developability and binding models, then took the top models through two rounds of evaluation in the lab.

From this, we selected 7 models to send to the lab, designing ≥20 variants of both scFvs and Fabs for each (along with structure and lightweight model baselines). The best-performing models from the first round of in vitro testing were included in a second round, and the final results are interesting from several perspectives (Figure 2). It’s worth noting that an evaluation like this, including many models and many plates of variants, is an easy, routine task rather than a chore thanks to RADS.

Figure 2. The full results from both rounds of lab-validation (a) show a range of performance across models. Focusing on the models from the last round, as well as a lightweight Sapiens baseline (b), we see that Ablang2 has a marginal lead, but that the sampling method (Gibbs vs. Beam) is as important as the model choice.

What did we learn?

Ablang2 has a narrow lead over other models, though the differences are quite small despite the years of pLM development and dozens (hundreds?) of publications.

Sampling method matters as much or more than the choice of model - every model is better with beam search, and the per-model improvement from sampling method is bigger than the delta between most models. See our publication “How to make the most of your masked language model for protein engineering” for more details on sampling methods.

Models vs. Data

So far, all the results we’ve discussed could be described as “zero-shot”: the model is pre-trained on a large public dataset, and then we apply it to sequences from a particular drug development campaign that certainly differ from the public training data. However, at BigHat, campaign-specific data generation is our superpower, so it makes sense to ask how helpful this data might be. Given how expensive training large pLMs can be, this may even be the cheaper option!

Figure 3. Comparing supervision with a small amount of data (Ablang2 + supervision) to the best zero-shot methods from round 2 confirms that having campaign-relevant data beats relying on pre-training by a wide margin.

We find that using even a small amount of data (~400-700 samples², less than a week of BigHat platform capacity) to guide the generative model and filter its outputs yields by far the best performing model, and provides a performance improvement bigger than the past few years of pLM model development (Figure 3. See “How to make the most of your masked language model for protein engineering” for more details).

So what?

There are a few high-level points worth taking away from this (though see below for some more in-the-weeds but interesting findings!):

Choice of model does matter somewhat, and Ablang2 is a great starting point for CDR diversification.
How you use the model matters as much or more than which model you use! Try beam search with your favorite pLM for a free performance boost.
If you can collect data, do it! Incorporating data from your drug development program is by far your best option and may save you time and money compared to the GPU hours you might burn pre-training a better pLM.

Aside: Structure vs. Sequence

We also evaluated a model (DiffAb+) that includes structural information for comparison to sequence-only models. The overall success rate for this model was modest, but perhaps more interestingly, the variants for which it was successful seemed to be biased towards tighter binding. Is this because of the structural information, or could it be simply a difference in the training data, because this model is trained on the PDB instead of e.g., OAS? Comparing the results to a pLM trained only on the sequences from the PDB (no structures) provides some insight: at least some of the improvement in binding comes solely from the bias in the sequences contained in the PDB! (Figure A1). While only a tantalizing hint, this adds to a growing body of evidence that the biases in what people bother to determine structures for can also result in biases in models.

Figure A1. While CDR diversification success rates (left) are similar, we see trends towards higher affinity in the structure-driven model (DiffAb+), and we can replicate these higher affinities in a language model by simply training on only the antibody sequences in the PDB (SAbDab MLM) rather than the broader set of Abs in OAS (Sapiens).

‍For a more in-depth look at our work, please see our recent publication:

https://arxiv.org/abs/2603.10302

‍

References

See for example:
“Feature Reuse and Scaling: Understanding Transfer Learning with Protein Language Models” (Li*, Amini*, Yue, Yang, Lu) — ICML 2024
“Protein Language Models: Is Scaling Necessary?” (Fournier, Vernon, van der Sloot, Schulz, Chandar, Christopher J. Langmead) — bioRxiv (2024)
Results shown are supervised using 729 datapoints, but a retrospective in silico analysis indicates that approximately 400 would likely have been sufficient.

Posted:

3/12/2026

Read Time:

mins

Written by:

Hunter Elliott & Calvin McCarter

Newsroom