Why Neural Networks Solve the Wrong Equation

Something is quietly wrong with how we train neural networks to solve physics equations. Not wrong in a way that triggers alarm bells. Wrong in a way that produces plausible-looking outputs, low training losses, smooth curves, and still gets published.

The field is called scientific machine learning. The core promise is that neural networks can solve partial differential equations: the equations that govern heat flow, fluid dynamics, structural stress, electromagnetic fields, and a dozen other physical systems that classical numerical solvers struggle with at scale. Over the last five years, papers have demonstrated this across an impressive range of problems, and the results are often genuinely good.

But there’s a specific thing almost all of these methods handle badly, and it’s the kind of thing that’s hard to notice unless you’ve spent time with classical numerical analysis. They treat boundary conditions as an optimization target. And boundary conditions are not optimization targets. They’re constraints. The difference matters in ways that the training loss will never show you.

What neural PDE solvers actually do

The standard setup, introduced by Raissi, Perdikaris, and Karniadakis in 2019, works like this. You have a PDE of the form $D[u](x) = 0$ over some domain $\Omega$ , with boundary conditions $B[u](x) = g(x)$ on the boundary $\partial\Omega$ . A neural network $u_\theta(x)$ parameterized by weights $\theta$ approximates the solution. You train it to minimize:

\mathcal{L}(\theta) = \underbrace{\frac{1}{N_r}\sum_{i=1}^{N_r} \left|D[u_\theta(x_r^i)]\right|^2}_{\text{interior residual}} + \underbrace{\frac{\lambda}{N_b}\sum_{j=1}^{N_b} \left|B[u_\theta(x_b^j)] - g(x_b^j)\right|^2}_{\text{boundary residual}}

Random points $\{x_r^i\}$ sample the interior, $\{x_b^j\}$ sample the boundary. Automatic differentiation computes whatever derivatives the PDE requires. The whole setup is mesh-free, flexible, and can in principle handle geometries that would make a finite-element engineer sweat.

The paper demonstrated this on Burgers’ equation, the Schrödinger equation, and Navier-Stokes in idealized geometries. The framework was clean enough that an entire research community immediately started extending it. There are now variational versions, domain-decomposed versions, operator-learning versions. Fourier Neural Operators learn solution mappings in spectral space and can generalize across entire PDE families. DeepONet learns to map input functions to output solution fields using two interleaved sub-networks. Physics-informed variants of both exist.

All of these methods share one structural habit: boundary conditions are handled as weighted penalty terms in a loss function.

Why that’s the wrong framing

Here’s something most PINN tutorials skip past:

The PDE alone does not have a unique solution.

Take the simplest possible case: $-u''(x) = 0$ on $[0, 1]$ . Every linear function $u = ax + b$ satisfies this equation, for any $a$ and $b$ . Infinitely many solutions. Add the boundary conditions $u(0) = 0$ and $u(1) = 1$ , and now there’s exactly one: $u(x) = x$ .

This isn’t a technicality. It’s the entire structure of boundary value problems. Well-posedness (existence, uniqueness, stability) requires the PDE and its boundary conditions together. The Lax-Milgram theorem guarantees unique solutions in Sobolev spaces that are defined by the boundary conditions. The Hadamard conditions for well-posedness cannot be checked from the PDE in isolation.

In classical finite element methods, this is treated as non-negotiable. Dirichlet boundary conditions are imposed by modifying the global stiffness matrix. Certain degrees of freedom are pinned to the prescribed values, and the system is solved with no mechanism for violating them. The numerical problem lives in the same function space as the exact solution, with boundary conditions baked in from the start.

Neural PDE methods abandoned this almost entirely when they adopted the penalty loss formulation.

The specific way it fails

The penalty formulation solves a different problem than the one you wrote down. What you want is:

$u^* = \arg\min_{u \in \mathcal{H} : B[u] = g} \int_\Omega |D[u]|^2 \, dx$

What actually gets optimized is:

$u^*_\lambda = \arg\min_{u_\theta} \int_\Omega |D[u_\theta]|^2 \, dx + \lambda \int_{\partial\Omega} |B[u_\theta] - g|^2 \, dx$

These are only equivalent in the limit $\lambda \to \infty$ , which is numerically degenerate. For any finite $\lambda$ , the network finds a solution to a regularized surrogate, not to the original problem.

The practical consequence is a tuning problem with no good answer. Wang et al. measured the gradient magnitudes of the PDE residual and boundary residual terms during training and found they can differ by several orders of magnitude. The optimizer is spending most of its effort on whichever term happens to produce larger gradients, while the other term floats. They proposed NTK-based adaptive weighting to dynamically rebalance these terms during training. It helped. It didn’t solve the problem. And it added its own tuning overhead.

The deeper issue is that the two terms are not naturally commensurable. The PDE residual involves second-order differential operators evaluated pointwise in the interior. The BC residual involves function values (or first derivatives, for Neumann conditions) at the boundary. These quantities have different units, different scales, different sensitivities to network outputs. Collapsing them into a scalar sum with a single weight $\lambda$ is throwing away all of that structure.

There’s a thread on the NeuralPDE.jl forum I keep thinking about. A user trained a PINN on a coupled PDE system. The training loss dropped cleanly. Residuals looked good. When they plotted the solution, one boundary condition was simply not satisfied. The network had found a function that satisfied the PDE in the interior and accidentally matched one boundary value, because that equation forced a linear form that happened to agree with the data at one point. The other boundary condition was treated as negotiable. The optimizer found a lower total loss by partially ignoring it.

Low training loss, wrong answer. No warning.

The fix, and why it’s not being used

The alternative is to satisfy boundary conditions by architecture, not by loss. The idea is to construct a modified network output:

$\hat{u}_\theta(x) = \phi(x) \cdot N(x, \theta) + \psi(x)$

where $\phi(x)$ vanishes identically on $\partial\Omega$ (a distance function), $N(x, \theta)$ is the raw network output, and $\psi(x)$ is any function satisfying $B[\psi] = g$ on $\partial\Omega$ .

Because $\phi(x) = 0$ on the boundary, the network term contributes nothing there. The entire boundary contribution comes from $\psi(x)$ , which satisfies the BC by construction. So $\hat{u}_\theta$ satisfies $B[\hat{u}_\theta] = g$ identically, for any $\theta$ , regardless of what the network outputs. The loss simplifies to:

$\theta^* = \arg\min_\theta \int_\Omega |D[\hat{u}_\theta(x)]|^2 \, dx$

No boundary term. No weight to tune. No tradeoff.

For simple domains (an interval, a disk, a box), $\phi$ is trivial. For the interval $[0,1]$ with Dirichlet conditions at both endpoints, $\phi(x) = x(1-x)$ works exactly. For arbitrary geometries, Sukumar and Srivastava developed a systematic approach using R-function theory: algebraic functions that encode geometric predicates using smooth, differentiable formulas, producing a differentiable distance field for essentially any domain described by its boundary curves. For inhomogeneous boundary data, you extend the data into the interior and write:

$\hat{u}_\theta(x) = g_e(x) + \phi(x) \cdot N(x, \theta)$

where $g_e(x)$ is any smooth function matching $g$ on $\partial\Omega$ .

Berrone et al. (2023) compared this approach against the standard penalty method, Nitsche’s variational formulation, and a hybrid on Poisson and diffusion-reaction equations with varying difficulty. The distance-function method produced lower $L^2$ errors, trained faster, and required no hyperparameter tuning for the BC weight. The penalty method required extensive tuning and still underperformed. Nitsche’s method, despite its theoretical elegance, fell between the two and added implementation complexity without commensurate accuracy gains.

So why isn’t everyone doing this?

Partly because computing a differentiable distance function for an arbitrary 3D geometry is nontrivial, even though the theory exists. Partly because most PINN tutorials and codebases default to penalties, and defaults are sticky. DeepXDE, the most widely used PINN library, supports hard enforcement but recommends penalties in its introductory notebooks. Most people learning PINNs learn the penalty approach first and never look further.

Dirichlet conditions are the easy case

The method above handles Dirichlet conditions (prescribed values at the boundary). For Neumann conditions (prescribed normal derivatives) and Robin conditions (a linear combination of both), the situation gets harder and is, honestly, not well-solved.

For Neumann BCs in a variational setting, the weak form does the work for you. The Poisson equation’s integration-by-parts identity:

$\int_\Omega \nabla u \cdot \nabla v \, dx = \int_\Omega fv \, dx + \int_{\partial\Omega_N} h v \, ds$

makes the Neumann condition $\partial u / \partial n = h$ appear automatically on the right-hand side. If you’re using variational PINNs or the Deep Ritz method, Neumann conditions come for free. That’s a real advantage of the variational approach, and it’s underappreciated.

For strong-form PINNs, there’s no free lunch. You compute the normal derivative explicitly (doable via automatic differentiation, but more expensive) and penalize the mismatch. The same tuning problems apply, and the normal derivative is harder to represent accurately near boundaries than the function value.

Hard enforcement of Neumann conditions is significantly more complex. You cannot simply multiply by a distance function, because the BC involves the derivative of the solution, not its value. Göschel et al. (2025) address this for physics-informed neural operators, developing specialized solution structures that satisfy piecewise Neumann and Robin conditions by exploiting the local geometry of boundary segments. Their methods outperform penalty enforcement on 3D Darcy flow and Navier-Stokes. But they require careful construction and are not yet available as off-the-shelf toolbox features. There’s no clean recipe here yet.

Robin BCs ( $\alpha u + \beta \partial_n u = c$ ) inherit the worst of both worlds. The best current approach seems to be hybrid enforcement: handle the Dirichlet-like part via distance functions, weakly penalize the Neumann-like part. Göschel et al. show this outperforms fully weak approaches while being more tractable than purely hard enforcement. It’s a patch more than a solution.

What the error theory confirms

Müller and Zeinhofer (2022) derived error bounds for the Deep Ritz method on second-order elliptic problems. Their analysis decomposes the total error into three parts:

$\|u - u_\theta\|_{H^1(\Omega)} \leq \underbrace{\inf_{v \in \mathcal{H}_\theta} \|u - v\|_{H^1(\Omega)}}_{\text{approximation error}} + \underbrace{\mathcal{L}_{\text{train}}(\theta)^{1/2}}_{\text{optimization error}} + \underbrace{C/\sqrt{\lambda}}_{\text{BC enforcement error}}$

The third term is the formal statement of the tuning dilemma. Larger $\lambda$ reduces BC error but makes optimization harder. With hard enforcement, the third term vanishes entirely, and the total error depends only on how expressive the network is and how well training minimizes the energy.

Lorenza et al. (2026) derived generalization bounds for PINNs on semilinear wave equations and showed that the BC loss value at termination directly controls the error bound. If the boundary loss isn’t close to zero, the bound degrades even when the interior residual is small. You cannot trade BC accuracy for interior accuracy and call it convergence.

I’m less sure about how well these theoretical results transfer to the messy cases: non-smooth domains, mixed boundary conditions, sharp boundary layers. The analyses assume smooth domains and specific PDE types. The community is working to close those gaps, but the gap is still large enough that practitioners should treat the theoretical results as motivation rather than guarantees.

The implicit bias problem

Neural networks have a well-documented bias toward smooth, low-frequency solutions. During gradient-based training, networks fit low-frequency components of the target function first. This is useful when your solution is smooth, which many interior PDE solutions are.

Boundary conditions are often where the smooth assumption breaks down. A no-slip wall in a high-Reynolds-number flow. A clamped edge in a plate bending problem. A heated surface adjacent to an insulated one. These create sharp variations that the network’s frequency bias naturally deprioritizes.

Even when the BC is correctly specified in the loss, the frequency principle means the network will focus on smooth interior behavior first. The boundary loss may decrease inconsistently, not because the formulation is wrong but because the network architecture is fighting against it. Hard enforcement sidesteps this entirely. BC satisfaction is guaranteed before training begins, and the frequency bias then affects only the interior solution, where it’s less harmful.

What actually works in practice

If your problem has Dirichlet BCs and you can compute a distance function for your domain, use hard enforcement. It’s not much harder to implement than penalty methods once you have the infrastructure, and every systematic comparison that has been run shows it wins. The PINNacle benchmark (10 PINN variants on 22 PDE tasks) and PDEBench both provide the reference solutions you’d need to verify this on your own problems.

If your problem has Neumann BCs, consider whether a variational formulation (Deep Ritz, VPINNs) is feasible. The weak form handles Neumann conditions for free, which is worth a lot. The limitation is that variational methods only apply to PDEs with an energy form, which rules out most time-dependent and hyperbolic problems.

If you’re stuck with penalty enforcement, monitor boundary residuals separately from your training loss. Not the weighted boundary loss. The actual pointwise error at the boundary, in physical units. A low total loss with a high boundary residual means your model has converged to something, but not to the problem you specified.

The discipline that got lost

We built neural PDE solvers because classical methods have real limitations. Finite element methods scale badly with dimension. Mesh generation for complex geometries is expensive. Parametric PDEs requiring thousands of solutions make grid-based methods prohibitively slow. These are real problems.

But classical numerical analysis had a discipline that came with it: boundary conditions were structural constraints, not optimization targets. Dirichlet conditions defined the function space the solution lived in. The numerical method was designed so that the approximation couldn’t violate them. This wasn’t a design choice. It was forced by the mathematics of well-posedness.

When PINNs replaced this structure with a weighted sum, they made the method easier to implement and harder to trust. A penalty term can always be balanced out by the other terms in the loss. The optimizer has no obligation to satisfy the constraint, only to weigh it against competing objectives, at a tradeoff determined by a scalar weight with no physical interpretation.

The PDE community spent decades building rigor around this. Well-posedness. Lax-Milgram. Sobolev trace theorems. Inf-sup conditions. All of it exists to ensure that the mathematical problem you’re solving has a unique, stable, physically meaningful answer. Neural PDE solvers don’t inherit any of this automatically. You have to build it back in.

Most people don’t. The defaults are wrong, and the training loss won’t tell you.

The boundary condition isn’t a footnote to the PDE. It’s the half that makes the problem have an answer.