publications
Papers by members of Diffusion @ IESL, in reverse chronological order.
2026
- Learned Relay Representations for Forward-Thinking Discrete Diffusion ModelsBenjamin Rozonoyer, Jacopo Minniti, Dhruvesh Patel, and 4 more authorsIn Structured Probabilistic Inference & Generative Modeling (SPIGM) and Frontiers in Generative AI (FoGen) Workshops at ICML, Jul 2026Accepted to SPIGM @ ICML and FoGen @ ICML
When Masked Diffusion Models (MDMs) generate sequences through iterative refinement, the rich internal computation over masked positions is discarded—forcing every subsequent refinement step to recompute the valuable internal information stored as model representations. To avoid a hard reset between denoising rounds, we propose Learned Relay Representations (Relay), a method that allows MDMs to be “forward-thinking” when denoising—explicitly learning how to propagate latent information for the benefit of future denoising steps. Relay introduces a differentiable per-token channel that passes information between forward passes and is trained via truncated backpropagation through time (BPTT). We show that this framework can be scaled to state-of-the-art Diffusion Language Models (DLMs), and is seamlessly compatible with techniques like block diffusion and KV caching. We first provide a thorough justification of the design choices in Relay on a challenging Sudoku-based planning task. We then scale Relay to Fast-dLLM v2, a state-of-the-art DLM, outperforming standard supervised finetuning on coding tasks while reducing the inference latency by up to 32%. Our empirical results demonstrate that state-of-the-art DLMs can be explicitly trained to relay latent information forward across decoding steps, advancing the performance-latency Pareto frontier. We provide code for all our experiments.
@inproceedings{rozonoyer2026relay, title = {Learned Relay Representations for Forward-Thinking Discrete Diffusion Models}, author = {Rozonoyer, Benjamin and Minniti, Jacopo and Patel, Dhruvesh and Band, Neil and Bose, Joey and Rudner, Tim G. J. and McCallum, Andrew}, booktitle = {Structured Probabilistic Inference \& Generative Modeling (SPIGM) and Frontiers in Generative AI (FoGen) Workshops at ICML}, year = {2026}, month = jul, note = {Accepted to SPIGM @ ICML and FoGen @ ICML}, url = {https://arxiv.org/abs/2605.22967}, } - EACL DemoxLM: A Python Package for Non-Autoregressive Language ModelsDhruvesh Patel, Durga Prasad Maram, Sai Sreenivas Chintha, and 2 more authorsIn Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations), Mar 2026
In recent years, there has been a resurgence of interest in non-autoregressive text generation in the context of general language modeling. Unlike the well-established autoregressive language modeling paradigm, which has a plethora of standard training and inference libraries, implementations of non-autoregressive language modeling have largely been bespoke, making it difficult to perform systematic comparisons of different methods. Moreover, each non-autoregressive language model typically requires its own data collation, loss, and prediction logic, making it challenging to reuse common components. In this work, we present the xLM Python package, designed to make implementing small non-autoregressive language models faster, with a secondary goal of providing a suite of small pre-trained models (through a companion package) that can be used by the research community.
@inproceedings{patel2026xlm, title = {{xLM}: A {P}ython Package for Non-Autoregressive Language Models}, author = {Patel, Dhruvesh and Maram, Durga Prasad and Chintha, Sai Sreenivas and Rozonoyer, Benjamin and McCallum, Andrew}, booktitle = {Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 3: System Demonstrations)}, year = {2026}, month = mar, pages = {445--456}, address = {Rabat, Morocco}, publisher = {Association for Computational Linguistics}, doi = {10.18653/v1/2026.eacl-demo.31}, url = {https://aclanthology.org/2026.eacl-demo.31/}, } - AISTATS🔦 SpotlightA Continuous Time Markov Chain Framework for Insertion Language ModelsDhruvesh Patel, Benjamin Rozonoyer, Soumitra Das, and 3 more authorsIn Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS), 2026
Insertion Language Models (ILMs) offer several advantages over left-to-right generation and mask-based generation. However, existing formulations of insertion-based generation have largely been ad-hoc. In this paper, we derive a diffusion-style denoising objective for ILMs from first principles by formulating the noising process as a continuous-time Markov chain on the space of variable-length sequences. We show that previous formulations of ILMs can be viewed as special cases of this denoising framework. Through empirical evaluation on a synthetic planning task, we show that the proposed approach retains the benefits of insertion-based generation over left-to-right generation and masked diffusion models. In language modeling, our diffusion-based approach is competitive with left-to-right generation and masked diffusion models, while offering additional flexibility in sampling compared to existing insertion language models.
@inproceedings{patel2026ctmc, title = {A Continuous Time {M}arkov Chain Framework for Insertion Language Models}, author = {Patel, Dhruvesh and Rozonoyer, Benjamin and Das, Soumitra and Naseem, Tahira and Rudner, Tim G. J. and McCallum, Andrew}, booktitle = {Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS)}, year = {2026}, note = {Spotlight}, url = {https://openreview.net/forum?id=nCyV21FmUI}, } - Insertion Based Sequence Generation with Learnable Order DynamicsDhruvesh Patel, Benjamin Rozonoyer, Gaurav Pandey, and 3 more authorsIn Proceedings of the 43rd International Conference on Machine Learning, Jul 2026
In many domains, generating variable-length sequences through insertions provides greater flexibility over autoregressive models. However, the action space of insertion models is much larger than that of autoregressive models, making learning challenging. To address this, we incorporate trainable order dynamics into the target rates for discrete flow matching, and show that with suitable choices of parameterizations, joint training of the target order dynamics and the generator is tractable without the need for numerical simulation. As the generative insertion model, we use a variable-length masked diffusion model that generates by inserting and filling mask tokens. On graph traversal tasks for which a locally optimal insertion order is known, we explore the choices of parameterization empirically and demonstrate the trade-offs between flexibility, training stability and generation quality. On de novo small molecule generation, we find that the learned order dynamics lead to an increase in the number of valid molecules generated, when compared to uniform order dynamics.
@inproceedings{patel2026learnableorder, title = {Insertion Based Sequence Generation with Learnable Order Dynamics}, author = {Patel, Dhruvesh and Rozonoyer, Benjamin and Pandey, Gaurav and Naseem, Tahira and Fernandez Astudillo, Ram{\'o}n and McCallum, Andrew}, booktitle = {Proceedings of the 43rd International Conference on Machine Learning}, year = {2026}, month = jul, url = {https://arxiv.org/abs/2602.18695}, }
2025
- Insertion Language Models: Sequence Generation with Arbitrary-Position InsertionsDhruvesh Patel, Aishwarya Sahoo, Avinash Amballa, and 3 more authorsIn Structured Probabilistic Inference & Generative Modeling Workshop at NeurIPS, 2025
Autoregressive models (ARMs) predict subsequent tokens one-by-one “from left to right.” Masked Diffusion Models (MDMs) can generate tokens in arbitrary order, but unmasking multiple tokens simultaneously can introduce incoherence, and MDMs cannot handle arbitrary infilling constraints when the number of tokens to be filled is not known in advance. We introduce Insertion Language Models (ILMs), which learn to insert tokens at arbitrary positions in a sequence—jointly selecting both the position and the vocabulary element to be inserted. By inserting tokens one at a time, ILMs can represent strong dependencies between tokens, and their ability to generate sequences in arbitrary order allows them to accurately model sequences whose token dependencies do not follow a left-to-right sequential structure. To train ILMs, we propose a tailored network parameterization and use a simple denoising objective. Our empirical evaluation demonstrates that ILMs outperform both ARMs and MDMs on common planning tasks. Furthermore, ILMs outperform MDMs and perform on par with ARMs on unconditional text generation while offering greater flexibility than MDMs in arbitrary-length text infilling.
@inproceedings{patel2025ilm, title = {Insertion Language Models: Sequence Generation with Arbitrary-Position Insertions}, author = {Patel, Dhruvesh and Sahoo, Aishwarya and Amballa, Avinash and Naseem, Tahira and Rudner, Tim G. J. and McCallum, Andrew}, booktitle = {Structured Probabilistic Inference \& Generative Modeling Workshop at NeurIPS}, year = {2025}, url = {https://arxiv.org/abs/2505.05755}, } - Improved Sampling from Masked Diffusion Models with Position Contrastive GuidanceDhruvesh Patel, Tahira Naseem, Gaurav Pandey, and 3 more authorsIn Structured Probabilistic Inference & Generative Modeling Workshop at NeurIPS, 2025
Masked Diffusion Models (MDMs), which generate multiple tokens at a time, hold the promise of accelerating text generation. However, the performance of MDMs is sensitive to the order in which the tokens are generated. We observe that MDMs are overconfident about the masked positions on the extreme ends of the output sequence. MDMs also express uncertainty by producing similar probability scores for tokens regardless of the query position. Utilizing these insights, we propose Position Contrastive Guidance, which has two components: a soft order bias that favors left-to-right decoding, and a novel classifier-free-guidance that renormalizes the probabilities using position uncertainty to generate more informative tokens earlier in the generation. Our approach can be easily plugged into any existing uncertainty-guided sampling strategy. Experiments on GSM8k, MATH500, and HumanEval show that PCG improves both accuracy and throughput for the base and instruct versions of DREAM-7B and LLaDA-8B models.
@inproceedings{patel2025pcg, title = {Improved Sampling from Masked Diffusion Models with Position Contrastive Guidance}, author = {Patel, Dhruvesh and Naseem, Tahira and Pandey, Gaurav and Sultan, Md Arafat and McCallum, Andrew and Fernandez Astudillo, Ram{\'o}n}, booktitle = {Structured Probabilistic Inference \& Generative Modeling Workshop at NeurIPS}, year = {2025}, url = {https://openreview.net/forum?id=e0WmOrWbtc}, }