# Model Serving - Active Models

This is the list of AI models currently live and available on the Phoeniqs Model Service. All models are served on Phoeniqs infrastructure through an OpenAI-compatible API and are ready for production use.

For real-time health and availability, see the model status dashboard.

# Active Models

Input and Output: credits per million tokens. TPM: tokens per minute. Context window: maximum tokens per API call.

Model Name	Model	License type	Type	Input	Output	TPM	Context window	Description
`inference-apertus-70b`	RedHatAI/Apertus-70B-Instruct-2509-quantized.w4a16	Apache 2.0	Chat	0.6195	2.2201	188'040	56'384	Optimized for multilingual dialogue use cases.
`inference-apertus-v15-70b`	swiss-ai/Apertus-v1.5-70B	Apache 2.0	Chat	0.6195	2.2201	188'040	56'384	Optimized for multilingual dialogue, tool use, and agent workflows.
`inference-deepseek-v32`	deepseek-ai/DeepSeek-V3.2	MIT	Chat	0.6156	1.8469	358'873	163'840	Optimized for Reasoning chat completions.
`inference-glm45-air-110b`	zai-org/GLM-4.5-Air-FP8	MIT	Chat	0.4232	1.6853	242'506	8'192	Optimized for Reasoning chat completions.
`inference-glm5`	zai-org/GLM-5.2	MIT	Reasoning & Coding	1.077	3.386	135'060	131'072	Optimized for agentic engineering and long-horizon coding. Designed for sustained performance across extended agentic sessions with stronger coding, terminal, and software-engineering task capabilities.
`inference-gpt-oss-120b`	openai/gpt-oss-120b	Apache 2.0	Chat	0.1154	0.4617	278'178	131'072	Optimized for powerful reasoning, agentic tasks, and versatile developer use cases.
`inference-granite-33-8b`	ibm-granite/granite-3.3-8b-instruct	Apache 2.0	Chat	0.1539	0.1539	599'909	32'768	Optimized for Reasoning and instruction-following capabilities.
`inference-mistral-v03-7b`	mistralai/Mistral-7B-Instruct-v0.3	Apache 2.0	Chat	0.1539	0.1539	686'144	32'768	Optimized for multilingual dialogue use cases.
`inference-llama4-maverick`	RedHatAI/Llama-4-Maverick-17B-128E-Instruct-quantized.w4a16	Llama 4 Community License (custom)	Chat and multi-modal	0.2693	1.0773	427'080	1'048'576	Optimized for text and multimodal experiences. Max images per prompt is 4 and no video prompts
`inference-llama4-scout-17b`	RedHatAI/Llama-4-Scout-17B-16E-Instruct-quantized.w4a16	Llama 4 Community License	Chat and multi-modal	0.1924	0.6387	280'680	62'256	Optimized for text and multimodal experiences.
`inference-qwen3-8b`	RedHatAI/Qwen3-8B-quantized.w4a16	Apache 2.0	Reasoning	0.0269	0.1062	582'420	40'960	Optimized for thinking and reasoning.
`inference-qwq-32b`	RedHatAI/QwQ-32B-quantized.w8a8	Apache 2.0	Reasoning	0.9234	0.9234	498'215	32'768	Optimized for thinking and reasoning.
`inference-gemma-12b-it`	RedHatAI/gemma-3-12b-it-quantized.w4a16	Gemma License	Multimodal	0.2693	0.4309	328'184	131'072	Optimized for handling text and image input and generating text output.
`inference-gemma4-31b`	RedHatAI/gemma-4-31B-it-FP8-block	Apache 2.0	Multimodal	0.118	0.325	271'465	131'072	Optimized for handling text and image input and generating text output.
`inference-granite-vision-2b`	ibm-granite/granite-vision-3.2-2b	Apache 2.0	Multimodal	0.0770	0.0770	459'595	8'192	Optimized for compact and efficient vision-language model
`inference-qwen3-vl-235b`	RedHatAI/Qwen3-VL-235B-A22B-Instruct-FP8-dynamic	Apache 2.0	Multimodal	0.7003	2.0000	486'660	65'536	Optimized for text and multimodal experiences.
`inference-bge-m3`	BAAI/bge-m3	MIT	Embedding	0.4309	---	63'180	—	Optimized for Embeddings and sparse retrieval with support for Multi-Functionality, Multi-Linguality, and Multi-Granularity.
`inference-granite-emb-278m`	ibm-granite/granite-embedding-278m-multilingual	Apache 2.0	Embedding	0.0770	---	54'660	—	Optimized for Embeddings.
`inference-bge-reranker`	BAAI/bge-reranker-v2-m3	Apache 2.0	Reranker	0.0077	---	542'850	—	Optimized for Reranker to get relevance score.
`inference-deepseek-ocr`	inference-deepseek-ocr	MIT	ocr	0.3848	1.5391	67'283	—	Optimized for Contexts Optical Compression.
`inference-miner-u25`	opendatalab/MinerU2.5-2509-1.2B	AGPL 3.0	vision-language	0.38	0.23	54'785	8'192	Optimized for document parsing that achieves state-of-the-art accuracy with high computational efficiency.
`inference-whisper-large-v3`	openai/whisper-large-v3	Apache 2.0	Speech to Text	0.006 per minute	NA	TBD	—	Optimized for automatic speech recognition (ASR) and speech translation. WhisperX model support high throughput, ASR, long audio files support, speaker diarization & attribution and word-level alignments.

Token Throughput Disclaimer : The token-per-minute throughput figures provided are based on controlled testing conditions and are intended for benchmarking and comparison purposes only. Actual performance in production environments may vary significantly depending on workload characteristics, system configuration, model hosting provider, network conditions, and other operational factors. These results should not be interpreted as a guarantee of real-world performance.

Model Updates and Deprecation Disclaimer We reserve the right to modify, upgrade, or replace any AI models used in our services at any time. This may include deprecating older models and introducing newer versions as we deem necessary to maintain performance, security, and service quality. While we aim to provide notice when feasible, changes may occur without prior notification.

NOTE Pricing is subject to change at our discretion.

# Active Models by Use Case

The groups below mirror the Active Models table: every live model in that table appears exactly once, grouped by primary workload.

Chat, assistants, and general text (instruction-following)
Multilingual dialogue, coding help, and broad assistant-style tasks: Apertus 70B, DeepSeek V3.2, GLM 4.5 Air 110B, GLM 4.6 (chat; deployed on demand), GPT OSS 120B, Granite 3.3 8B, Mistral 7B Instruct v0.3.
Reasoning and chain-of-thought
Deliberate, step-heavy tasks: GLM 5, Qwen3 8B, QwQ 32B.
Multimodal (images and text)
Image+text input and text output, or compact vision-language: Gemma 3 12B IT, Gemma 4 31B, Granite Vision 3.2 2B, Llama 4 Maverick, Llama 4 Scout 17B, Qwen3 VL 235B.
RAG (retrieval-augmented generation)
Embeddings, reranking, and retrieval stacks: BGE M3 (embedding), BGE Reranker v2 M3, Granite Embedding 278M.
OCR, layout, and document parsing
Optical compression, layout, and structured document workflows: DeepSeek OCR, MinerU 2.5.
Speech
Transcription and speech-centric workflows: Whisper Large v3.

Agents, tool calling, and orchestration usually pick one primary LLM from the chat or reasoning groups, then add other pieces only when the workflow needs them: multimodal models for image inputs; OCR or MinerU to extract text or structure from documents; and RAG (embedding, reranking) when the agent must retrieve relevant passages from content you have already chunked and indexed.

# Active Models by Risk

Risk Level	Typical Models	Main Risks
Low Risk	Embedding models, OCR, translation, sentiment analysis	Limited misuse potential
Medium Risk	General LLMs, code generation, image generation	Hallucinations, insecure code, copyright issues
High Risk	Unmoderated LLMs, NSFW models, unsafe diffusion models, executable pickle-based models	Malware, harmful content, prompt injection, data leakage

# Using the models

Looking for ready-to-run examples? See the Model Service Guides:

How to inference an AI model — what you need to make a call (Base URL, Model Name, API Key).
Sample API calls — cURL examples for chat, embeddings, multimodal, OCR, and more.

maas ai model serving