Rate Limits #

Rate limits are restrictions placed on the number of API requests or the amount of data that can be processed within a specified time period. These limits are enforced to ensure fair usage, maintain service quality, and protect our infrastructure from potential abuse or overload.

Types of Rate Limits #

We employ three types of rate limits for our inference API:

Concurrent Requests: The maximum number of simultaneous API requests allowed at any given time.
Requests per Minute (RPM): The maximum number of API requests that can be made within a 60-second sliding window.
Tokens per Minute (TPM): The maximum number of tokens (combined input and output) that can be processed within a 60-second sliding window.

Inference -- Dedicated Deployments #

No rate limits are applied when calling inference API endpoints for dedicated deployments.

Inference -- Serverless #

Rate limits are applied on a per-model basis.

Important: Rate limits are intentionally set low during the Early Access phase and will be updated once General Availability is released

Model Name (`model` string in API calls)	Concurrent requests	Requests per minute (RPM)	Tokens per minute (TPM)
`llama-3.1-70b-instruct`	5	30	8192
`llama-3.1-8b-instruct`	5	30	8192
`qwen2.5-coder-32b-instruct`	5	30	8192
`mistral-nemo-instruct-2407`	5	30	8192
`pixtral-12b-2409`	5	30	8192

Enforcement #

Our rate limits are enforced per model and per account. If a limit is exceeded, subsequent requests will be denied until the usage falls back below the threshold.

If you require higher rate limits for your use case, please contact our sales team: sales@etkos.ai.