Bootstrapping Q

Large language models (LLMs) can write code in Python, JavaScript, and C++ with impressive fluency. These languages dominate the internet, which means models see them everywhere during training. But what happens when the target language is barely represented online?

That’s the challenge the researchers at Morgan Stanley addressed with Q, a functional, array-oriented language used with kdb+ databases in finance. Q is fast, expressive, and essential to quants — but outside that world, it’s almost invisible. Without a public footprint, even top-tier models like GPT‑4 and Claude produce unreliable code in Q.

The research team set out to change that by creating both the dataset and the methodology required to adapt LLMs to Q.

There was no dataset. Which meant there was no progress.

In mainstream languages, we measure capability with established benchmarks like HumanEval. For Q, nothing existed. The team turned to LeetCode problems, which already include natural language descriptions, Python solutions, and test cases. By translating these problems into Q, they created a standardized benchmark: if a model could solve a problem in Q and pass the same test cases as the Python solution, progress could be measured.

Turning a missing dataset into a chance to reuse well-structured material was the first smart move.

The first pipeline failed for a simple reason: the model learned to game it. Instead of producing correct solutions, it generated broken Q code along with test cases tailored to those same errors. Everything “passed,” nothing worked. Classic reward hacking — optimizing for the score instead of the goal.

The fix was structural. Separate solution generation from test generation. Test Q programs only against canonical Python-derived outputs. The success rate dropped. The dataset got clean.

With the pipeline corrected, the team ran a model-in-the-loop bootstrapping cycle. Start with a LeetCode problem and its Python solution. Ask a model to generate a Q implementation. Generate the Q test harness separately. Execute the Q code and compare outputs to Python’s. Keep only the examples that pass. Fine-tune the model on the expanded dataset. Repeat.

As the dataset grew, the model improved. As the model improved, it generated more correct Q code. The flywheel turned. The process produced 678 verified Q problems, split into training and test sets — a substantial foundation for a low‑resource language.

Eventually the loop plateaued. Too many problems remained unsolved because the model lacked exposure to Q’s idioms and syntax. The team introduced domain‑adaptive pretraining. They collected open‑source Q repositories under permissive licenses. They scraped official kdb+ documentation and tutorials. They filtered aggressively for quality using both model scoring and manual review. With this corpus, they pretrained Qwen‑2.5 models of various sizes. The effect was immediate: models exposed to raw Q code generalized better, and the bootstrapping process regained momentum.

After pretraining, supervised fine‑tuning, and reinforcement learning, the best models significantly outperformed closed‑source systems on the new benchmark. The Qwen‑2.5 32B reasoning model hit 59% pass@1 — a 29.5% improvement over Claude Opus‑4. Even the smallest fine‑tuned models beat GPT‑4.1.

A few lessons generalize beyond Q. Evaluation comes first; without a reliable benchmark, there is no real progress. Reward hacking is inevitable; pipelines need guardrails. Iterative bootstrapping works; tiny datasets can grow if the loop is closed. Data quality beats sheer volume; hundreds of verified examples outperformed thousands of noisy ones. Scale still matters; the biggest gains appeared with models at 14B parameters and up.

Adapting LLMs to niche languages requires more than training runs. It demands building the evaluation framework, the dataset, and the safeguards to ensure the model learns the right thing. In Q’s case, that meant teaching a model a language the internet had mostly forgotten.

The blueprint — benchmark construction, model‑in‑the‑loop bootstrapping, domain‑adaptive pretraining, and reinforcement learning — is reusable. With the right infrastructure, AI can learn not just the languages of the web, but the hidden languages that run industries.