Using Dedicated Deployments #
Deploy any model with a dedicated GPU and API endpoint in seconds. Customize your deployment with options to select quantization, context length, and GPU accelerator tailored to your exact needs. Get maximum performance with no rate limits.
Management API #
Dedicated deployments are created, listed and deleted using the HTTP API with the following endpoints:
- List all deployments: GET /deployments.
- Create a new deployment: POST /deployments.
- Retrieve deployment information: GET /deployments/{deploymentID}.
- Delete a deployment: DELETE /deployments/{deploymentID}.
Available models & configuration options #
All models are always available to be used for dedicated deployments. The complete list can be found here.
Quantization #
Quantization is a method used to decrease the computational and memory demands of model inference by converting the model’s weights and activations from high-precision data types, such as 32-bit floating point (float32), to lower-precision formats like 8-bit integers (int8).
By reducing the precision, the model becomes more memory-efficient, uses less power, and enables faster operations.
Quantization can improve performance metrics such as time-to-first-token and throughput (tokens per second), typically at the cost of a slight, often unnoticeable, reduction in prediction precision.
Multiple quantization options are available for each model (Available Quantizations
column here). When creating a deployment, you can select the desired quantization type using the quantization
parameter (optional). If this parameter is not provided, the model will be deployed with its default quantization setting (typically bf16
: bfloat16).
Context Length #
Context length refers to the maximum number of tokens a model can process in a single input sequence. Tokens are units of text, such as words or subwords, that the model uses to understand and generate responses. Increasing the context length allows the model to handle longer sequences of input data, enabling it to retain more information and better understand complex relationships across the text.
Extending the context length comes with trade-offs. Longer input sequences increase computational and memory requirements, as the model must process and maintain attention over a larger number of tokens. This can result in slower inference times and higher resource usage.
The context length for a model deployment can be configured using the context_length
parameter. If this parameter is not provided, the model defaults to the maximum context length supported by the selected quantization
and GPU (accelerator
) settings.
Available GPUs #
The following GPUs are available to use for dedicated deployments:
Name (accelerator field) | Number of accelerators (accelerator_count field) | VRAM per accelerator |
---|---|---|
NVIDIA L4 | 1 | 24GB |
NVIDIA L40S | 1 | 48GB |
NVIDIA H100 | 1 | 80GB |
NVIDIA H100 | 2 | 80GB |
Important: More GPUs will be available once General Availability is released.
Lifecycle & Billing #
A deployment follows this lifecycle:
created
-> deploying
-> deployed
-> application_starting
-> ready
-> destroying_registered
-> destroying
-> destroyed
- Dedicated deployments are billed per minute.
- Billing begins when the deployment reaches the
ready
status for the first time, indicating it is available to handle inference requests. - Billing ends when the deployment is marked for deletion (
destroying_registered
status), which occurs when the deletion API call is made. - We do not charge for startup time or teardown time.
User Guide #
In this tutorial, we will create a dedicate deployment for the Llama 3.1 8b Instruct model on a NVIDIA L4 GPU.
Create the deployement #
Use the POST /deployments endpoint.
1curl https://api.ektos.ai/v1/deployments \
2 --request POST \
3 --header "Authorization: Bearer $EKTOS_API_KEY" \
4 --header "Content-Type: application/json" \
5 --data '{
6 "model": "llama-3.1-8b-instruct",
7 "accelerator": "NVIDIA L4"
8 }'
1{
2 "accelerator": "NVIDIA L4",
3 "accelerator_count": 1,
4 "base_url": "",
5 "context_length": 23552,
6 "created_at": "2024-12-17T11:37:09.420060869Z",
7 "display_name": "",
8 "id": "ea4fa568-eb8d-492b-8f43-7e798f541634",
9 "model": "llama-3.1-8b-instruct",
10 "quantization": "bf16",
11 "status": "created",
12 "updated_at": "2024-12-17T11:37:09.420060869Z"
13}
Note here that we did not supply quantization
and context_length
parameters, the default values were used for the GPU: bf16
and 23552
(23k).
Retrieve deployment information #
Use the GET /deployments/{deploymentID} endpoint.
1curl 'https://api.ektos.ai/v1/deployments/ea4fa568-eb8d-492b-8f43-7e798f541634' \
2 --header "Authorization: Bearer $EKTOS_API_KEY"
1{
2 "accelerator": "NVIDIA L4",
3 "accelerator_count": 1,
4 "base_url": "ea4fa568-eb8d-492b-8f43-7e798f541634.deployments.api.ektos.ai",
5 "context_length": 23552,
6 "created_at": "2024-12-17T11:37:09.42006Z",
7 "description": "",
8 "display_name": "",
9 "id": "ea4fa568-eb8d-492b-8f43-7e798f541634",
10 "model": "llama-3.1-8b-instruct",
11 "quantization": "bf16",
12 "status": "ready",
13 "updated_at": "2024-12-17T11:45:07.64663Z"
14}
Here the deployment status
is ready
, indicating that it is ready to receive inference requests.
Let's make a simple chat completions request:
1curl --request POST \
2 --url https://ea4fa568-eb8d-492b-8f43-7e798f541634.deployments.api.ektos.ai/v1/chat/completions \
3 --header "Authorization: Bearer $EKTOS_API_KEY" \
4 --header "Content-Type: application/json" \
5 --silent \
6 -d '{
7 "model": "llama-3.1-8b-instruct",
8 "messages": [
9 {
10 "role": "user",
11 "content": "Hello, how can you help me? Very Short answer."
12 }
13 ]
14 }' | jq -r .choices'[0]'.message.content
1I can assist with questions and provide information. How can I help you specifically?
Delete the deployment #
Use the DELETE /deployments/{deploymentID}.
Once you are done using the deployment, you can delete it with a single API call.
1curl --header "Authorization: Bearer $EKTOS_API_KEY" \
2 --request DELETE 'https://api.ektos.ai/v1/deployments/ea4fa568-eb8d-492b-8f43-7e798f541634'
The billing period ends when this API call is received by the Ektos AI API.
Next steps #
- Discover the available models for Inference (Dedicated Deployments).
- Manage dedicated deployments.
- Use text models.
- Use audio models.
- Use embedding models.
- Get in touch and interact with our community on our Discord.