Rivaling Frontier Models with an Open-Weight System on FrontierScience-Research

Seunghyun Moon, Johyun Park, Hojin Yoo, Suyoung Hwang · 2026-05-28


The application of large language models (LLMs) to scientific research has rapidly matured from a speculative prospect into an active area of development, with a growing number of systems now targeting tasks across scientific research. While empirical validation in real research settings remains the gold standard, its cost, time horizon, and domain specificity make it ill-suited as a primary signal for tracking progress across the rapidly evolving LLM landscape. Standardized benchmarks have therefore become a complementary and indispensable mechanism for comparative assessment.

As a team working on AI for science, we are focusing on both achieving high benchmark scores and obtaining wet-lab validation results. It is in this context that we are pleased to report that our recently developed on-premise, open-weight LLM-based system Spacer 1.0 has ranked among state-of-the-art proprietary models on FrontierScience-Research, which we regard as one of the most meaningful LM benchmarks currently available for evaluating genuine scientific reasoning abilities. In this post, we provide a concise overview of the system and some remarks on the benchmark results.

About FrontierScience

FrontierScience is a benchmark released by OpenAI in December 2025, designed to measure expert-level scientific capabilities. It is written and verified by experts in physics, chemistry, and biology, comprising over 700 original problems.

FrontierScience includes two tracks of questions: the Olympiad track, which measures precise problem-solving capabilities in constrained settings, and the Research track, which we are focusing on, measuring real-world scientific research abilities. The Research track comprises open-ended problems representative of sub-problems encountered in actual research, with 45 PhD scientists involved in both question design and an explicit review process, grounding the benchmark in genuine domain expertise. A notable feature of the Research track is that it is graded on a rubric-based system. Each problem is scored out of 10 points, with individual scores assigned across multiple evaluation criteria. The response is marked correct if it earns 7 or more points. As we believe that exact string-match or multiple-choice formats do not sufficiently account for the contextual complexities of the natural sciences (see our previous blog post), we consider rubric-based grading, which allows each reasoning step to be scored independently, as a reasonable alternative.

As OpenAI themselves acknowledge, the benchmark has clear limitations, such as its omission of novel hypothesis generation, which represents a significant dimension of scientific research. Nevertheless, we consider FrontierScience, especially its Research track, as the most reasonable option currently available for evaluating practical scientific research capabilities.

Our Approach

Spacer 1.0 represents our most recent approach for designing an LLM-based system capable of complex reasoning and problem-solving relevant to real-world scientific research processes. It is a multi-agent framework built upon open-weight models only, primarily Z.ai’s GLM-5.1.

Spacer 1.0 actively orchestrates its execution according to the demands of each query, grounding its reasoning in relevant evidence. Each model inside is supplied with appropriate domain corpora and specialized databases without the use of an internet search. This results in a response shaped by the right disciplinary lens at each stage, rather than one drawn solely from what the base model has internalized.

This system is a result of focusing less on what the model inherently knows, but more on how to bring it into genuine contact with the context — grounding its reasoning in the right structure at every step. (Please refer to our previous blog post for the philosophy behind this architectural choice.)

Results

We measured the FrontierScience-Research track scores of Spacer 1.0 along with representative open-weight and proprietary models ourselves via Inspect — a framework for large language model evaluations created by the UK AI Security Institute. As officially reported scores for publicly released models are not completely reproducible due to run-to-run variance, we reran the evaluation for all models simultaneously for a fair comparison.

The figure below presents the evaluation results, where the tested models are listed in descending order of pass@ep, that is, the single-run pass rate for each model. Spacer 1.0 scored 33.9%, ranking in second place, right after OpenAI’s GPT-5.4 Pro, followed by other proprietary OpenAI models and Claude Opus 4.6, with GLM-5.1 and other open-weight models trailing further behind. The Pass@3 scores denote the score corresponding to the maximum over 3 independent runs for each problem, where Spacer 1.0 tied with GPT-5.4 Pro as first place. The scores we obtained for OpenAI models are consistent with the officially reported score and other independently evaluated results.

Bar chart comparing FrontierScience-Research pass rates for tested models in descending order by pass@ep; Spacer 1.0 ranks second at 33.9%, closely trailing GPT-5.4 Pro, while open-weight base models lag further behind.

Despite ranking slightly below GPT-5.4 Pro, Spacer 1.0 possess its distinct capacity to conduct a different spectra of tasks. We present two cases where our system scored higher than GPT-5.4 Pro as an example.

Case Study 1

Context: 195 Platinum is the only naturally occurring isotope of platinum that is considered spin-active with a spin number of I = 1/2. In the 1H NMR of platinum complexes, the 195Pt-1H satellite peaks are commonly observed due to the presence of the spin-active 195Pt nucleus.

Question: There are four complexes \cis-[Pt(N\\( _3 \\))\\( _2 \\)(pyridine)\\( _2 \\)\], \trans-[Pt(pyridine)\\( _2 \\)(N\\( _3) \\)\\( _2 \\)\], \[Pt(NH\\( _3 \\))\\( _2 \\)(N\\( _3 \\))\\( _2 \\)\] and \[Pt(N\\( _3 \\))\\( _2 \\) (pyridine)\\( _2 \\)(OH)\\( _2 \\)\]. Among these four complexes, which complex is expected to have the sharpest and most defined 195Pt-1H satellite peak in 1H NMR obtained from a 600MHz NMR machine under the same temperature in the same solvent? Explain your reasoning in detail; include all necessary mathematical formulas and steps.

This chemistry problem concerns the ¹⁹⁵Pt–¹H satellite peaks commonly observed in platinum complexes. One must determine which among cis-[Pt(N₃)₂(pyridine)₂], trans-[Pt(pyridine)₂(N₃)₂], [Pt(NH₃)₂(N₃)₂], and [Pt(N₃)₂(pyridine)₂(OH)₂] would exhibit the sharpest and most defined ¹⁹⁵Pt–¹H satellite peak in ¹H NMR.

The solution begins by recognizing that the sharpness and resolution of satellite peaks are closely related to the relaxation behavior of the coupled ¹⁹⁵Pt nucleus, which the model answer attributes primarily to the chemical shift anisotropy (CSA) relaxation mechanism. From the specific principles and formula of CSA relaxation, one can deduce that terms such as the gyromagnetic ratio and the external magnetic field are identical across all four complexes, while ¹⁹⁵Pt CSA tensor term Δσ², which depends on the symmetry of the electronic shielding environment, can be treated as the principal source of the difference. A more symmetric electronic shielding environment at the platinum center is generally associated with a smaller CSA magnitude, slower CSA-driven relaxation, and consequently narrower and better-resolved satellite peaks. Applying this logic to the given series, [Pt(N₃)₂(pyridine)₂(OH)₂], which is an octahedral Pt(IV) complex with the most symmetric Pt coordination environment among the candidates, is expected to show the sharpest ¹⁹⁵Pt–¹H satellite peaks in the ¹H NMR spectrum.

The level of reasoning demanded by this problem can be characterized as a challenging, research-oriented graduate-level problem, since a complete solution requires integrating the knowledge of CSA relaxation mechanism, complex geometry, and their relationship to NMR satellite-peak resolution. Such multilayered reasoning is relevant to practical research contexts, where the structural verification of newly synthesized compounds routinely relies on NMR spectroscopy, and where anomalous peak patterns may need to be interpreted through this kind of mechanistic understanding.

GPT-5.4 Pro received 0.5 points. Analyzing its response, we found that the primary source of error was a failure to account for the role of the ¹⁹⁵Pt nucleus: the model instead approached the problem as one of optimizing proton observation in a typical ¹H NMR context. While this framing may be appropriate for routine proton NMR analysis, it neglects the influence of a heavy spin-active nucleus such as ¹⁹⁵Pt on the observed satellite peak characteristics, particularly through CSA-driven relaxation. This omission leaves the response unable to capture the central mechanism emphasized in the rubric and renders the analysis inadequate for this class of problem.

Spacer 1.0, by contrast, achieved a near-perfect score of 9 points. The model correctly classified the geometries of all four complexes, accurately identified CSA relaxation of the ¹⁹⁵Pt nucleus as the principal mechanism underlying the satellite-peak sharpness and resolution, and completed the full logical chain — from the high symmetry of the octahedral Pt(IV) complex, through the reduction of the CSA contribution, to the consequent decrease in peak broadening — satisfying the majority of the rubric criteria.

Case Study 2

Context: The misfolding and aggregation of specific proteins are central to the pathology of many neurodegenerative diseases. While these proteins are often soluble and functional under normal physiological conditions, cellular stress can trigger their accumulation into aberrant assemblies. Biomolecular condensates, formed through liquid-liquid phase separation (LLPS), are increasingly recognized as important cellular compartments, but their precise role in the initiation of pathological protein aggregation remains an area of intense investigation. Understanding the specific microenvironment within these condensates and the molecular events that tip the balance from dynamic, functional assemblies to irreversible, pathological aggregates is critical for developing therapeutic strategies.

Question: The protein TDP-43, implicated in ALS and FTD, is known to partition into stress granules (SGs).

1. Describe the proposed molecular pathway by which TDP-43 transitions to a pathological aggregate within SGs, highlighting the significance of intra-condensate demixing.

2. Consider a motor neuron experiencing chronic oxidative stress (inducing SGs with TDP-43 recruitment). This neuron simultaneously exhibits two specific molecular alterations:

- A genetic variant conferring significantly increased structural stability to TDP-43’s RRM1 domain.

- A sustained five-fold elevation in the intracellular concentration of short-chain (C6-C10) saturated fatty acids. Analyze and predict the combined consequence of these alterations on TDP-43’s aggregation propensity within SGs, compared to the wild-type scenario. Provide a mechanistic justification for your prediction.

This biology problem addresses the aggregation of TDP-43 protein within stress granules (SGs) and is structured around two related tasks: first, to delineate the molecular pathway by which TDP-43 transitions from a SG-resident protein into pathological aggregates, and second, to predict the combined consequence of two specified molecular alterations within a hypothetical scenario.

A complete solution proceeds as follows. Following the recruitment of TDP-43 into SGs, two distinct events must occur concurrently to initiate intra-condensate demixing: local up-concentration of TDP-43 within the condensate and the presence of oxidative stress. At the molecular level, the β4–β5 region of RRM1 undergoes partial unfolding under oxidative conditions, exposing the C173 and C175 cysteine residues; these then form intermolecular disulfide bonds. Simultaneously, homotypic α-helical interactions mediated by the hydrophobic patch (HP) of the C-terminal domain (CTD) act in concert with the disulfide bond formation to reinforce phase separation. Once demixing has occurred, the α-helices within the HP region convert into cross-β-sheet structures, driving an irreversible liquid-to-solid transition that culminates in pathological aggregation. In the hypothetical motor neuron scenario, hyper-stabilization of RRM1 would markedly reduce local unfolding and cysteine exposure, thereby substantially decreasing the likelihood or kinetics of intermolecular disulfide-bond formation. Although elevated fatty acids may enhance HP-mediated interactions, the loss of disulfide–HP synergy is sufficient to attenuate the demixing pathway, leading to an overall reduction in pathological aggregation.

The problem demands a precise understanding of intra-condensate demixing, a relatively recently established concept, with its underlying molecular mechanism. A correct solution requires not merely the recognition that TDP-43 aggregates within SGs, but the reconstruction of a multi-step causal chain: RRM1 partial unfolding → cysteine exposure → disulfide bond formation → synergistic reinforcement by HP interactions. The hypothetical scenario further requires deductive reasoning about the system-level consequences when a specific step in this chain is blocked. This form of reasoning — predicting how a complex molecular system responds to targeted perturbation or altered cellular conditions — is foundational to experimental design and data interpretation in biological research.

GPT-5.4 Pro received 3 points. Although the model recognized the conceptual relevance of intra-condensate demixing and made a qualitatively correct prediction of reduced pathological aggregation, its response lacked the molecular-level detail required by the rubric. In particular, it did not identify the core molecular events underlying demixing, including partial unfolding of RRM1, exposure of C173/C175, and intermolecular disulfide-bond formation. As a result, the response could not adequately justify the specific role of oxidative stress, the effect of RRM1 hyper-stabilization on this cysteine/disulfide bonding pathway, or the loss of disulfide–HP synergy as the causal basis for reduced aggregation. Moreover, the model incorrectly interpreted fatty acids as chemical chaperones and treated both molecular alterations as acting synergistically to reduce aggregation, whereas the expected reasoning requires recognizing that elevated fatty acids may enhance HP-mediated interactions but cannot compensate for the loss of RRM1-mediated disulfide bonding. Thus, while the response captured part of the high-level conceptual framing, it failed to provide the molecular explanation needed for a complete justification.

Spacer 1.0 achieved a perfect score of 10 points. The response reconstructed the relevant mechanistic pathway, accurately covering partial unfolding of the RRM1 β4–β5 region, exposure of C173/C175, intermolecular disulfide bond formation, HP-mediated homotypic α-helical interactions, and the synergistic relationship between these interaction modes. In the hypothetical scenario, the model correctly arrived at an overall reduction in pathological aggregation despite elevated fatty acids, and clearly justified this conclusion by identifying the loss of disulfide–HP synergy as the key determining factor.

Both cases share a common structure: they require not only the recall of deep knowledge, but also the construction of an extended causal chain across a hierarchy of mechanistic phenomena.

Discussion

Spacer 1.0 scored 33.9% on the benchmark, significantly higher than GLM-5.1(10.0%) and rivaling with current frontier models. This implies that in targeting scientific research capabilities, the structural design is equally important as the model weight or training data itself. By introducing active agentic workflow orchestration, we enabled the system to construct significantly more complex and multifaceted logic beyond the base model’s capabilities, even pinpointing some highly domain-specific knowledge and methodologies that GPT-5.4 Pro could not find, as shown in the case studies.

Of course, a single benchmark result does not completely represent the scientific research capabilities of a model. Still, we consider the results here a concrete example of how we can guide a system to achieve the reasoning depth required in a specific field of scientific research, while not relying only on the general reasoning capabilities of the base models. This is the direction we intend to keep developing and measuring. Beyond benchmark performance, we remain committed to demonstrating the practical utility of our systems in real-world research — not merely as a system that scores well on standardized evaluations, but as one that delivers meaningful contributions to science.