mastodon.world is one of the many independent Mastodon servers you can use to participate in the fediverse.
Generic Mastodon server for anyone to use.

Server stats:

9.2K
active users

#ITP

26 posts7 participants1 post today

Modelling and verifying neuronal archetypes in Coq. ~ Abdorrahim Bahrami, Rébecca Zucchini, Elisabetta De Maria, Amy Felty. arxiv.org/abs/2505.05362 #ITP #Coq #Rocq

arXiv logo
arXiv.orgModelling and Verifying Neuronal Archetypes in CoqFormal verification has become increasingly important because of the kinds of guarantees that it can provide for software systems. Verification of models of biological and medical systems is a promising application of formal verification. Human neural networks have recently been emulated and studied as a biological system. In this paper, we provide a model of some crucial neuronal circuits, called "archetypes", in the Coq Proof Assistant and prove properties concerning their dynamic behavior. Understanding the behavior of these modules is crucial because they constitute the elementary building blocks of bigger neuronal circuits. We consider seven fundamental archetypes (simple series, series with multiple outputs, parallel composition, positive loop, negative loop, inhibition of a behavior, and contralateral inhibition), and prove an important representative property for six of them. In building up to our model of archetypes, we also provide a general model of "neuronal circuits", and prove a variety of general properties about neurons and circuits. In addition, we have defined our model with a longer term goal of modelling the composition of basic archetypes into larger networks, and structured our libraries with definitions and lemmas useful for proving the properties in this paper as well as those to be proved as future work.

Beyond theorem proving: Formulation, framework and benchmark for formal problem-solving. ~ Qi Liu et als. arxiv.org/abs/2505.04528 #LLMs #ITP #LeanProver

arXiv logo
arXiv.orgBeyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-SolvingAs a seemingly self-explanatory task, problem-solving has been a significant component of science and engineering. However, a general yet concrete formulation of problem-solving itself is missing. With the recent development of AI-based problem-solving agents, the demand for process-level verifiability is rapidly increasing yet underexplored. To fill these gaps, we present a principled formulation of problem-solving as a deterministic Markov decision process; a novel framework, FPS (Formal Problem-Solving), which utilizes existing FTP (formal theorem proving) environments to perform process-verified problem-solving; and D-FPS (Deductive FPS), decoupling solving and answer verification for better human-alignment. The expressiveness, soundness and completeness of the frameworks are proven. We construct three benchmarks on problem-solving: FormalMath500, a formalization of a subset of the MATH500 benchmark; MiniF2F-Solving and PutnamBench-Solving, adaptations of FTP benchmarks MiniF2F and PutnamBench. For faithful, interpretable, and human-aligned evaluation, we propose RPE (Restricted Propositional Equivalence), a symbolic approach to determine the correctness of answers by formal verification. We evaluate four prevalent FTP models and two prompting methods as baselines, solving at most 23.77% of FormalMath500, 27.47% of MiniF2F-Solving, and 0.31% of PutnamBench-Solving.

FormalMATH: Benchmarking formal mathematical reasoning of large language models. ~ Zhouliang Yu et als. arxiv.org/abs/2505.02735 #LLMs #Autoformalization #Math #ITP #LeanProver

arXiv logo
arXiv.orgFormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language ModelsFormal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.