# Model Serving - Active Models

This is the list of AI models currently live and available on the Phoeniqs Model Service. All models are served on Phoeniqs infrastructure through an OpenAI-compatible API and are ready for production use.


# Active Models

Input and Output: credits per million tokens. TPM: tokens per minute. Context window: maximum tokens per API call.

Model Name Model Type Input Output TPM Context window Description
inference-apertus-70b RedHatAI/Apertus-70B-Instruct-2509-quantized.w4a16 Chat 0.6195 2.2201 188'040 56'384 Optimized for multilingual dialogue use cases.
inference-deepseek-v32 deepseek-ai/DeepSeek-V3.2 Chat 0.6156 1.8469 358'873 163'840 Optimized for Reasoning chat completions.
inference-glm45-air-110b zai-org/GLM-4.5-Air-FP8 Chat 0.4232 1.6853 242'506 8'192 Optimized for Reasoning chat completions.
inference-glm-51-754b zai-org/GLM-5.1 Reasoning & Coding 1.077 3.386 237'343 163'800 Optimized for agentic engineering and long-horizon coding. Designed for sustained performance across extended agentic sessions with stronger coding, terminal, and software-engineering task capabilities.
inference-gpt-oss-120b openai/gpt-oss-120b Chat 0.1154 0.4617 278'178 131'072 Optimized for powerful reasoning, agentic tasks, and versatile developer use cases.
inference-granite-33-8b ibm-granite/granite-3.3-8b-instruct Chat 0.1539 0.1539 599'909 32'768 Optimized for Reasoning and instruction-following capabilities.
inference-mistral-v03-7b mistralai/Mistral-7B-Instruct-v0.3 Chat 0.1539 0.1539 686'144 32'768 Optimized for multilingual dialogue use cases.
inference-llama4-maverick RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16 Chat and multi-modal 0.2693 1.0773 427'080 1'048'576 Optimized for text and multimodal experiences. Max images per prompt is 4 and no video prompts
inference-llama4-scout-17b RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16 Chat and multi-modal 0.1924 0.6387 280'680 62'256 Optimized for text and multimodal experiences.
inference-qwen3-8b RedHatAI/Qwen3-8B-quantized.w4a16 Reasoning 0.0269 0.1062 582'420 40'960 Optimized for thinking and reasoning.
inference-qwq-32b RedHatAI/QwQ-32B-quantized.w8a8 Reasoning 0.9234 0.9234 498'215 32'768 Optimized for thinking and reasoning.
inference-gemma-12b-it RedHatAI/gemma-3-12b-it-quantized.w4a16 Multimodal 0.2693 0.4309 328'184 131'072 Optimized for handling text and image input and generating text output.
inference-gemma4-31b RedHatAI/gemma-4-31B-it-FP8-block Multimodal 0.118 0.325 271'465 131'072 Optimized for handling text and image input and generating text output.
inference-granite-vision-2b ibm-granite/granite-vision-3.2-2b Multimodal 0.0770 0.0770 459'595 8'192 Optimized for compact and efficient vision-language model
inference-qwen3-vl-235b RedHatAI/Qwen3-VL-235B-A22B-Instruct-FP8-dynamic Multimodal 0.7003 2.0000 486'660 65'536 Optimized for text and multimodal experiences.
inference-bge-m3 BAAI/bge-m3 Embedding 0.4309 --- 63'180 Optimized for Embeddings and sparse retrieval with support for Multi-Functionality, Multi-Linguality, and Multi-Granularity.
inference-granite-emb-278m ibm-granite/granite-embedding-278m-multilingual Embedding 0.0770 --- 54'660 Optimized for Embeddings.
inference-bge-reranker BAAI/bge-reranker-v2-m3 Reranker 0.0077 --- 542'850 Optimized for Reranker to get relevance score.
inference-deepseek-ocr inference-deepseek-ocr ocr 0.3848 1.5391 67'283 Optimized for Contexts Optical Compression.
inference-miner-u25 opendatalab/MinerU2.5-2509-1.2B vision-language 0.38 0.23 54'785 8'192 Optimized for document parsing that achieves state-of-the-art accuracy with high computational efficiency.
inference-whisper-large-v3 openai/whisper-large-v3 Speech to Text 0.006 per minute NA TBD Optimized for automatic speech recognition (ASR) and speech translation. WhisperX model support high throughput, ASR, long audio files support, speaker diarization & attribution and word-level alignments.

Token Throughput Disclaimer : The token-per-minute throughput figures provided are based on controlled testing conditions and are intended for benchmarking and comparison purposes only. Actual performance in production environments may vary significantly depending on workload characteristics, system configuration, model hosting provider, network conditions, and other operational factors. These results should not be interpreted as a guarantee of real-world performance.

Model Updates and Deprecation Disclaimer We reserve the right to modify, upgrade, or replace any AI models used in our services at any time. This may include deprecating older models and introducing newer versions as we deem necessary to maintain performance, security, and service quality. While we aim to provide notice when feasible, changes may occur without prior notification.

NOTE Pricing is subject to change at our discretion.


# Active Models by Use Case

The groups below mirror the Active Models table: every live model in that table appears exactly once, grouped by primary workload.

  • Chat, assistants, and general text (instruction-following)
    Multilingual dialogue, coding help, and broad assistant-style tasks: Apertus 70B, DeepSeek V3.2, GLM 4.5 Air 110B, GLM 4.6 (chat; deployed on demand), GPT OSS 120B, Granite 3.3 8B, Mistral 7B Instruct v0.3.

  • Reasoning and chain-of-thought
    Deliberate, step-heavy tasks: GLM 5.1, Qwen3 8B, QwQ 32B.

  • Multimodal (images and text)
    Image+text input and text output, or compact vision-language: Gemma 3 12B IT, Gemma 4 31B, Granite Vision 3.2 2B, Llama 4 Maverick, Llama 4 Scout 17B, Qwen3 VL 235B.

  • RAG (retrieval-augmented generation)
    Embeddings, reranking, and retrieval stacks: BGE M3 (embedding), BGE Reranker v2 M3, Granite Embedding 278M.

  • OCR, layout, and document parsing
    Optical compression, layout, and structured document workflows: DeepSeek OCR, MinerU 2.5.

  • Speech
    Transcription and speech-centric workflows: Whisper Large v3.

Agents, tool calling, and orchestration usually pick one primary LLM from the chat or reasoning groups, then add other pieces only when the workflow needs them: multimodal models for image inputs; OCR or MinerU to extract text or structure from documents; and RAG (embedding, reranking) when the agent must retrieve relevant passages from content you have already chunked and indexed.


# Active Models by Risk

Risk Level Typical Models Main Risks
Low Risk Embedding models, OCR, translation, sentiment analysis Limited misuse potential
Medium Risk General LLMs, code generation, image generation Hallucinations, insecure code, copyright issues
High Risk Unmoderated LLMs, NSFW models, unsafe diffusion models, executable pickle-based models Malware, harmful content, prompt injection, data leakage

# Using the models

Looking for ready-to-run examples? See the Model Service Guides: