Prompting has evolved - GPT-5 proves it

16 September 2025

GPT-5 makes one thing clear - prompting is not universal. To maintain quality and trust, prompts must evolve with the model and be rigorously evaluated along the way.

Since its release, GPT-5 has generated wide discussion across the developer and research communities. Some describe it as weaker or less engaging than GPT-4o. Others point to shifts in reasoning style and alignment. But beneath these differing views lies a deeper truth: the challenge is not just with the model itself but with how we design and evaluate prompts.

Why GPT-5 feels different

GPT-5 is more than an incremental upgrade. It introduces changes in reasoning depth, safety alignment and adaptive behaviour. In practice, its auto-mode acts like a dynamic router, selecting between instant, thinking and pro models depending on the prompt it’s given.

The implication is clear. Prompts optimised for GPT-4o may no longer perform as expected. Left unchanged, they can produce outputs that feel shallow or generic. To unlock GPT-5’s full capability – from deep analysis to sustained reasoning or creative problem-solving – prompts must be explicit about intent and effort.

Prompting is model-specific

This change underlines a broader truth: prompting is not one-size-fits-all. Just as teams would not use the same schema for a relational database and a key-value store, they cannot expect a prompt built for one model to deliver the same results on another.

Each frontier model – GPT-4o, GPT-5, Claude 4 Sonnet – has different assumptions about verbosity, reasoning depth, context handling and tone. Migration requires reframing instructions, adapting reasoning requests and re-tuning formatting to suit the target model.

Fortunately, model providers are starting to reflect this reality. OpenAI’s GPT-4/5 Prompting Guide and Anthropic’s Claude 4 Prompting Best Practices are becoming essential references for organisations building LLM-enabled workflows.

Both also provide evaluation tools – OpenAI’s Prompt Evaluator (GPT-5) and Anthropic’s Eval Tool – that help teams systematically test and validate their prompts to ensure these best practices are being met.

Why prompt evaluations matter more than ever

As models evolve, intuition alone is no longer enough to ensure prompts remain effective. At Instil, we have already seen this in practice. For a client developing an Intelligent Document Processing proof-of-concept, we implemented prompt evaluation suites directly into their CI/CD pipeline.

These suites use an LLM-as-judge approach where a secondary model verifies behaviour across use cases and flags subtle regressions before changes reach production. This gives us confidence that prompts still deliver the right outcomes even as we iterate quickly.

Crucially, these evaluations aren’t just for minor tweaks. They become indispensable during bigger shifts, such as migrating from one model generation to another, where regressions are otherwise difficult to detect until it’s too late.

Looking ahead

Prompting has always required adaptation. Already with models like o3 and GPT-4o you couldn’t rely on a single universal style. With GPT-5, this shift is even more pronounced: what works today may fail tomorrow, and intuition alone is no longer sufficient. Continuous benchmarking and rigorous evaluation are now essential to keep pace with the speed of change.

For organisations building with LLMs, the next step is straightforward but urgent: invest in structured prompt evaluation and make it part of your software pipeline. Those that do will safeguard reliability, ease the pain of migrations like that of GPT-4o to GPT-5, and accelerate adoption while staying ahead as the LLM landscape evolves.

Key Takeaways

Prompts aren’t transferrable – a single “universal prompt” won’t work across models. Each new frontier model (GPT-4o, GPT-5, Claude, etc.) demands its own approach.
Follow the model’s guidance – use the official prompting best practices (OpenAI, Anthropic, etc.) as your baseline when architecting instructions.
Document and evaluate – treat prompts like code: document intended functionality and continuously test them with evaluation suites. These act as regression tests, easing migrations and reducing risk.