A unified interface for state-of-the-art multimodal and document AI models. Select a model, upload an image or video, and enter a query to begin.

Select Model⚡
Examples
Query Upload Image
1 5120
0.1 2
0.05 1
1 1000
1 2

Output

  • Nanonets-OCR-s: nanonets-ocr-s is a powerful, state-of-the-art image-to-markdown ocr model that goes far beyond traditional text extraction. it transforms documents into structured markdown with intelligent content recognition and semantic tagging.

  • SmolDocling-256M: SmolDocling is a multimodal Image-Text-to-Text model designed for efficient document conversion. It retains Docling's most popular features while ensuring full compatibility with Docling through seamless support for DoclingDocuments.

  • MonkeyOCR-Recognition: MonkeyOCR adopts a Structure-Recognition-Relation (SRR) triplet paradigm, which simplifies the multi-tool pipeline of modular approaches while avoiding the inefficiency of using large multimodal models for full-page document processing.

  • Typhoon-OCR-7B: A bilingual document parsing model built specifically for real-world documents in Thai and English inspired by models like olmOCR based on Qwen2.5-VL-Instruction. Extracts and interprets embedded text (e.g., chart labels, captions) in Thai or English.

  • Thyme-RL: Think Beyond Images. Thyme transcends traditional ``thinking with images'' paradigms by autonomously generating and executing diverse image processing and computational operations through executable code, significantly enhancing performance on high-resolution perception and complex reasoning tasks.

  • ⚠️Note: Performance on video inference tasks is experimental and may vary between models.

Report a Bug