MarkItDown
MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to textract, but with a focus on…
An index of datasets, SDKs, APIs and other open source code created by Microsoft researchers and shared with the broader academic community. We also maintain a collection highlighting some of the tools you’ll find here.
MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to textract, but with a focus on…
Magentic-UI is a research prototype of an agentic web interface for solving complex web tasks. An Orchestrator coordinates four AutoGen agents—WebSurfer, Coder, FileSurfer, and UserProxy—to handle browsing, coding, file management, and user feedback, etc. It…
Code associated with the CVPR 2025 paper “Magma: A Foundation Model for Multimodal AI Agents”
These are the prompts and qrels used for the experiments in Thomas et al., “System Comparison using Automated Generation of Relevance Judgements in Multiple Languages”, SIGIR 2025.
Time series generation technology plays a vital role in alleviating data scarcity, especially in scenarios where collecting real-world data is expensive, time-consuming, or impractical. It also enables privacy-preserving analysis by producing realistic but non-identifiable synthetic…
This is the official code repository for the paper “Unearthing Skill-level Insights for Understanding Tradeoffs of Foundation Models”. All rationales, localized skills, and skill-slices for the 12 datasets studied in the paper can also be accessed…
debug-gym is a text-based interactive debugging framework, designed for debugging Python programs.
MatterGen is a generative model for inorganic materials design across the periodic table that can be fine-tuned to steer the generation towards a wide range of property constraints.