Hypothetical scenarios provide a popular alternative to field experiments for scholars interested in nudging behavior change, comprising a substantial proportion of such studies in the domains of finance, transportation, and sustainability. Yet their validity as proxies for real-world contexts is unclear. To investigate, we designed four styles of hypothetical scenarios to approximate five recent field studies of nudges in distinct domains, running a total of 20 pre-registered experiments (N=16,071, n>200 per cell). This design allows clear comparison of old field data with new hypothetical data. We find that hypothetical outcomes are consistently biased upwards – participants engage more in target behaviors by a median factor of 3.81 compared to the original field experiment – while their estimations of treatment effects are unpredictable: sometimes bigger, sometimes smaller, sometimes calibrated. Further, none of our four hypothetical designs reliably reduced estimation error. Without a gold standard approach to constructing hypothetical scenarios, behavioral researchers and practitioners should use caution when employing this low-cost but unreliable tool to evaluate nudge interventions.