mastodon.world is one of the many independent Mastodon servers you can use to participate in the fediverse.
Generic Mastodon server for anyone to use.

Server stats:

8.1K
active users

#autoformalization

0 posts0 participants0 posts today

LOGICPO: Efficient translation of NL-based logical problems to FOL using LLMs and preference optimization. ~ Koushik Viswanadha, Deepanway Ghosal, Somak Aditay. arxiv.org/abs/2506.18383v1 #LLMs #Math #Autoformalization

arXiv logo
arXiv.orgLOGICPO: Efficient Translation of NL-based Logical Problems to FOL using LLMs and Preference OptimizationLogical reasoning is a key task for artificial intelligence due to it's role in major downstream tasks such as Question Answering, Summarization. Recent methods in improving the reasoning ability of LLMs fall short in correctly converting a natural language reasoning problem to an equivalent logical formulation, which hinders the framework's overall ability to reason. Towards this, we propose to use finetuning on a preference optimization dataset to learn to parse and represent a natural language problem as a whole to a consistent logical program by 1) introducing a new supervised and preference optimization dataset LogicPO, and 2) adopting popular techniques such as Direct Preference Optimization (DPO), Kahneman-Tversky optimization (KTO) to finetune open-source LLMs. Our best model with Phi-3.5 consistently outperforms GPT-3.5-turbo's (8-shot) by producing 10% more logically correct and with 14% less syntax errors. Through the framework and our improved evaluation metrics, we offer a promising direction in improving the logical reasoning of LLMs by better representing them in their logical formulations.

FormalMATH: Benchmarking formal mathematical reasoning of large language models. ~ Zhouliang Yu et als. arxiv.org/abs/2505.02735 #LLMs #Autoformalization #Math #ITP #LeanProver

arXiv.orgFormalMATH: Benchmarking Formal Mathematical Reasoning of Large Language ModelsFormal mathematical reasoning remains a critical challenge for artificial intelligence, hindered by limitations of existing benchmarks in scope and scale. To address this, we present FormalMATH, a large-scale Lean4 benchmark comprising 5,560 formally verified problems spanning from high-school Olympiad challenges to undergraduate-level theorems across diverse domains (e.g., algebra, applied mathematics, calculus, number theory, and discrete mathematics). To mitigate the inefficiency of manual formalization, we introduce a novel human-in-the-loop autoformalization pipeline that integrates: (1) specialized large language models (LLMs) for statement autoformalization, (2) multi-LLM semantic verification, and (3) negation-based disproof filtering strategies using off-the-shelf LLM-based provers. This approach reduces expert annotation costs by retaining 72.09% of statements before manual verification while ensuring fidelity to the original natural-language problems. Our evaluation of state-of-the-art LLM-based theorem provers reveals significant limitations: even the strongest models achieve only 16.46% success rate under practical sampling budgets, exhibiting pronounced domain bias (e.g., excelling in algebra but failing in calculus) and over-reliance on simplified automation tactics. Notably, we identify a counterintuitive inverse relationship between natural-language solution guidance and proof success in chain-of-thought reasoning scenarios, suggesting that human-written informal reasoning introduces noise rather than clarity in the formal reasoning settings. We believe that FormalMATH provides a robust benchmark for benchmarking formal mathematical reasoning.

Diverse inference and verification for advanced reasoning. ~ Iddo Drori et als. arxiv.org/abs/2502.09955 #LLMs #ITP #LeanProver #Autoformalization

arXiv.orgDiverse Inference and Verification for Advanced ReasoningReasoning LLMs such as OpenAI o1, o3 and DeepSeek R1 have made significant progress in mathematics and coding, yet find challenging advanced tasks such as International Mathematical Olympiad (IMO) combinatorics problems, Abstraction and Reasoning Corpus (ARC) puzzles, and Humanity's Last Exam (HLE) questions. We use a diverse inference approach that combines multiple models and methods at test time. We find that verifying mathematics and code problems, and rejection sampling on other problems is simple and effective. We automatically verify correctness of solutions to IMO problems by Lean, and ARC puzzles by code, and find that best-of-N effectively answers HLE questions. Our approach increases answer accuracy on IMO combinatorics problems from 33.3% to 77.8%, accuracy on HLE questions from 8% to 37%, and solves 80% of ARC puzzles that 948 humans could not and 26.5% of ARC puzzles that o3 high compute does not. Test-time simulations, reinforcement learning, and meta-learning with inference feedback improve generalization by adapting agent graph representations and varying prompts, code, and datasets. Our approach is reliable, robust, and scalable, and in the spirit of reproducible research, we will make it publicly available upon publication.

Formalizing complex mathematical statements with LLMs: A study on mathematical definitions. ~ Lan Zhang, Marco Valentino, Andre Freitas. arxiv.org/abs/2502.12065 #LLMs #Math #Autoformalization #ITP #IsabelleHOL

arXiv.orgFormalizing Complex Mathematical Statements with LLMs: A Study on Mathematical DefinitionsThanks to their linguistic capabilities, LLMs offer an opportunity to bridge the gap between informal mathematics and formal languages through autoformalization. However, it is still unclear how well LLMs generalize to sophisticated and naturally occurring mathematical statements. To address this gap, we investigate the task of autoformalizing real-world mathematical definitions -- a critical component of mathematical discourse. Specifically, we introduce two novel resources for autoformalisation, collecting definitions from Wikipedia (Def_Wiki) and arXiv papers (Def_ArXiv). We then systematically evaluate a range of LLMs, analyzing their ability to formalize definitions into Isabelle/HOL. Furthermore, we investigate strategies to enhance LLMs' performance including refinement through external feedback from Proof Assistants, and formal definition grounding, where we guide LLMs through relevant contextual elements from formal mathematical libraries. Our findings reveal that definitions present a greater challenge compared to existing benchmarks, such as miniF2F. In particular, we found that LLMs still struggle with self-correction, and aligning with relevant mathematical libraries. At the same time, structured refinement methods and definition grounding strategies yield notable improvements of up to 16% on self-correction capabilities and 43% on the reduction of undefined errors, highlighting promising directions for enhancing LLM-based autoformalization in real-world scenarios.

Language models for verifiable mathematical automation (Interaction, integration, and autoformalization). ~ Qiaochu Jiang. repository.cam.ac.uk/items/6cc #ITP #IsabelleHOL #LLMs #Autoformalization

www.repository.cam.ac.ukLanguage models for verifiable mathematical automation: Interaction, integration, and autoformalizationStronger automation in formal mathematical reasoning provides scalability, trust-worthiness, and accessibility: it enables efficient verification of complex proofs, reduces the likelihood of errors in intricate calculations, allows non-experts to engage with mathematical concepts, and potentially facilitates the discovery of novel mathematical insights through machine-driven exploration. Traditionally, automated reasoners only have formal-language repositories at their disposal and cannot use the much larger natural-language corpora, since natural language comprehension is difficult. While controlled natural language interfaces with formal syntax have been designed (e.g., Naproche), they are ultimately too rigid to be widely useful. Large language models, on the other hand, excel at language comprehension and have exposure to innumerable mathematical documents humans accumulated throughout the ages, rendering them an ideal candidate for an automation tool. Modern language models have been trained on a vast amount of data, and can be instructed to carry out an impressive number of tasks. The versatility of language models makes them complementary to traditional reasoners, whose rigidity severely limits wider adoption and easier use. The goal of this thesis is to answer how we can achieve trustworthy automation in mathematical reasoning with language models. To this end, we introduce a novel evaluation of language models to determine their suitable scenarios in mathematical reasoning. With the insights obtained, we then train them to write formal proofs in those scenarios with assistance from automated theorem provers, and use them to translate mathematical theorems and proofs from LATEX to formal languages. Concretely, (1) We conduct the first interactive evaluation of language models on informal mathematical reasoning. The evaluation reveals language models’ inherent strength in memorisation and having flexible inputs/outputs, and weaknesses in complex reasoning and algebraic manipulations. (2) With LLMs’ strengths and weaknesses in mind, we develop the first approach combining automated theorem provers and language models for proof search: Thor. Thor uses language models for creative proof steps, and calls Sledgehammer (Isabelle’s tool with access to many automated theorem provers) to fill in premise selection details. We also develop Magnushammer, a hammer-like system with a transformer- based relevance filter, and show that Magnushammer outperforms Sledgehammer both individually and as a plugin for Thor. (3) All existing approaches for proof search operate either over natural or formal language, without bringing together their respective expressiveness and rigour. To remedy this dichotomy, we use language models to perform autoformalization, the process of turning informal mathematical theorems and proofs into formal ones. We validate the usefulness of autoformalization by showing autoformalized theorems can serve as training data to improve proof search systems. We also take repositories of formal theorems in multiple languages and informalise them into natural language. This results in a large parallel dataset of informal and formal mathematical theorems, MMA, which induces significant autoformalization capability boost when language models are fine-tuned on it. (4) Finally, we construct Draft, Sketch, and Prove, a methodology that uses language models to translate informal proofs into formal proof sketches, and then calls automated theorem provers to fill in the gaps. This neuro-symbolic method can empower the informal proofs with rigour by effectively translating them into formal ones and verifying them. The research in this thesis presents a new paradigm of theorem proving research in which language models constitute both a scalable power engine for automating numerous tasks and an accessible bridge between human users and machines for reasoning.

ATLAS: Autoformalizing theorems through lifting, augmentation, and synthesis of data. ~ Xiaoyang Liu, Kangjie Bao, Jiashuo Zhang, Yunqi Liu, Yu Chen, Yuntian Liu, Yang Jiao, Tao Luo. arxiv.org/abs/2502.05567 #LLMs #ITP #LeanProver #Math #Autoformalization

arXiv.orgATLAS: Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of DataAutoformalization, the process of automatically translating natural language mathematics into machine-verifiable formal language, has demonstrated advancements with the progress of large language models (LLMs). However, a key obstacle to further advancements is the scarcity of paired datasets that align natural language with formal language. To address this challenge, we introduce ATLAS (Autoformalizing Theorems through Lifting, Augmentation, and Synthesis of Data), an iterative data generation framework designed to produce large-scale, high-quality parallel theorem statements. With the proposed ATLAS running for 10 iterations, we construct an undergraduate-level dataset comprising 300k theorem statements and develop the ATLAS translator, achieving accuracies of 80.59% (pass@8) and 92.99% (pass@128) on ProofNet, significantly outperforming the base model (23.99% and 47.17%) and InternLM2-Math-Plus-7B (50.94% and 80.32%). Furthermore, the ATLAS translator also achieves state-of-the-art performance on both the high-school-level miniF2F dataset and the graduate-level MathQual dataset introduced in this work. The datasets, model, and code will be released to the public soon.

Formal mathematical reasoning: A new frontier in AI. ~ Kaiyu Yang, Gabriel Poesia, Jingxuan He, Wenda Li, Kristin Lauter, Swarat Chaudhuri, Dawn Song. arxiv.org/abs/2412.16075 #AI #Math #Reasoning #ITP #Coq #IsabelleHOL #LeanProver #Autoformalization

arXiv.orgFormal Mathematical Reasoning: A New Frontier in AIAI for Mathematics (AI4Math) is not only intriguing intellectually but also crucial for AI-driven discovery in science, engineering, and beyond. Extensive efforts on AI4Math have mirrored techniques in NLP, in particular, training large language models on carefully curated math datasets in text form. As a complementary yet less explored avenue, formal mathematical reasoning is grounded in formal systems such as proof assistants, which can verify the correctness of reasoning and provide automatic feedback. In this position paper, we advocate for formal mathematical reasoning and argue that it is indispensable for advancing AI4Math to the next level. In recent years, we have seen steady progress in using AI to perform formal reasoning, including core tasks such as theorem proving and autoformalization, as well as emerging applications such as verifiable generation of code and hardware designs. However, significant challenges remain to be solved for AI to truly master mathematics and achieve broader impact. We summarize existing progress, discuss open challenges, and envision critical milestones to measure future success. At this inflection point for formal mathematical reasoning, we call on the research community to come together to drive transformative advancements in this field.