Amazon Bedrock: Inference Options
Gen AI applications path to production and ROI (return in investment) is largely dependent on LLM inference challenges in cost, capacity, and latency. Not all inference requests are equal and they do not demand high performance models.
Existing Inference options are limited to On-Demand / Pay as you Go models (token based pricing) and Provisioned Throughput (dedicated capacity — leading to higher cost and capacity limitations). This has led to Customers are building bespoke “Custom Routers” and trade-off logic for managing throughput, Cost and latency.
Custom routers aims to address multiple scenarios not limited to the following.
- Ability to route traffic to model hosted in another region when you hit a capacity limitation of the model in one region
- Ability to route traffic to “Low latency” end points (like self hosted, provisioned throughput) to address customer experience
- Ability to route traffic to multiple models depending on the request, latency requirements and available capacity
Implementing these custom router, implementing error handling, monitoring, managing and upgrading them increases operational overhead.
Amazon Bedrock, with the reinvent 2024, provides multiple built in model inference options to address these scenarios. The managed services nature of Bedrock provides stronger integration with Observability tools (CloudWatch, Cost Explorer)
Let us dive deeper in to each of these Inference options and what it solves.
On-Demand Model Invocation: On-Demand invocations are end points of specific Amazon Bedrock models hosted within a region. For this inference, you use the model id of the model. You are billed by token usage (input and output tokens) and they differ from model to model. Amazon Bedrock have quotas for Tokens per minute (TPM) and Requests per minute (RPM) for each of the models. For more information refer to this link here — https://docs.aws.amazon.com/bedrock/latest/userguide/models-supported.html.
In the below example we are using Claude 3 Haiku inference end point hosted in us-west-2 region.
import boto3
import json
bedrock_runtime = boto3.client("bedrock-runtime",
region_name="us-west-2",
)
## System
system_message = [{
"text": "You are an AI assistant."
}]
## User message
user_message = [{
"role": "user",
"content": [{"text": "Tell me about Amazon Bedrock in less than 100 words."}]
}]
## Use the on-demand model id
modelId = 'anthropic.claude-3-haiku-20240307-v1:0'
## invoke the model
response = bedrock_runtime.converse(
modelId=modelId,
system= system_message,
messages=user_message
)
output_message = response['output']['message']
print(output_message['content'][0]['text'])
Provisioned Throughput: Bedrock Provisioned Throughput allows you to reserve dedicated model capacity for consistent, high-performance inference. You purchase a guaranteed number of tokens per minute and requests per minute for a specific model defined my Model Units (MUs). You get predictable performance, lower inference latency, and the ability to scale your AI applications with a committed throughput rate. This is particularly valuable for production workloads that require reliable and high-speed model access. For more information on Provisioned throughput refer to the link here — https://docs.aws.amazon.com/bedrock/latest/userguide/prov-throughput.html
Cross Region Inference: Allows you to invoke the end point of the same model in an another region. Amazon Bedrock provides pre-built cross region inference endpoints. Thus increasing your overall capacity (TPM and RPM) for the model. When you invoke the cross region inference endpoint the requests will be routed to the primary region and automatically fails over to secondary region if you run out of capacity in the primary region.
- Amazon Bedrock takes into account the aspects of traffic and capacity in real-time, to make the decision on behalf of customers in a fully-managed manner without any extra costs. ability to manage bursts and spiky traffic patterns.
- Note that these Cross region inference are categorized geographically can support more than 2 regions. For e.g. “us.anthropic.claude-3-sonnet-20240229-v1:0” is the cross region inference endpoint for US geography for the Claude 3 sonnet model. Today this supports automated routing to us-east-1 and us-west-2 providing 2x capacity over the on-demand model point. As more regions gets added, you will now have additional capacity without any changes on your end.
- You can find more information on cross region inferenve profiles here — https://docs.aws.amazon.com/bedrock/latest/userguide/inference-profiles-support.html
In the below code, we are using ‘us.anthropic.claude-3–5-haiku-20241022-v1:0’, which is US geography endpoint for Claude 3.5 Haiku.
bedrock_runtime = boto3.client("bedrock-runtime",
region_name="us-west-2",
)
## Use the Inference Profile id
inference_profile_Id = 'us.anthropic.claude-3-5-haiku-20241022-v1:0'
## System message
system_message = [{
"text": "You are an AI assistant."
}]
## User message
user_message = [{
"role": "user",
"content": [{"text": "Tell me about Amazon Bedrock in less than 100 words."}]
}]
## invoke the cross region inference end point
response = bedrock_runtime.converse(
modelId=inference_profile_Id,
system= system_message,
messages=user_message
)
output_message = response['output']['message']
print(output_message['content'][0]['text'])
Latency Optimized On-Demand: Latency-optimized inference endpoint are models that are hosted on specialized hardware optimized for low latency requests. Thus providing faster response times and improved responsiveness without compromising quality. Accessing the latency optimization capability requires you to set the “Latency” parameter to “optimized” while calling the Bedrock runtime API. The requests are automatically re-routed to the “Standard” end point if you run out of capacity on the “latency” optimized end point…
In the below code, we add the parameter, performanceConfig and set it to {‘latency’ : ‘optimized’} when you invoke the model.
bedrock_runtime = boto3.client("bedrock-runtime",
region_name="us-east-2",
)
## Use the on-demand model id
modelId = 'us.anthropic.claude-3-5-haiku-20241022-v1:0'
## Set the performanceconfig parameter to optimized
latency_optimized = {'latency' : 'optimized'}
## System message
system_message = [{
"text": "You are an AI assistant."
}]
## User message
user_message = [{
"role": "user",
"content": [{"text": "Tell me about Amazon Bedrock in less than 100 words."}]
}]
## invoke the model
response = bedrock_runtime.converse(
modelId=modelId,
system= system_message,
messages=user_message,
performanceConfig=latency_optimized
)
output_message = response['output']['message']
print(output_message['content'][0]['text'])
SageMaker Model Endpoint from Amazon Bedrock: This inference endpoint is a pass through for a SageMaker Model Endpoint. Bedrock hosts ~50 models for you and you can access these models in a serverless pattern. For models that are not provided as managed hosting option, Bedrock provides you with a Model Marketplace. You can pick the model of choice from this marketplace, self host it as a SM endpoint. Alternatively you can go to Jumpstart Model hub within SageMaker Studio, search for Bedrcok ready models
Deploy the model of choice and create a SageMaker Endpoint
In the Bedrock Console — Create a Bedrock pointer to the SageMaker Endpoint by registering the SM model endpoint in Bedrock
Now you can use Bedrock API / SDK to invoke the SM endpoint similar to how you would have invoked a Bedrock Model endpoint.
## Model ID
model_id = 'arn:aws:sagemaker:us-east-1:<Account-ID>>:endpoint/jumpstart-dft-hf-llm-amazon-mistral-20241213-014140'
## Prompt
prompt = "Tell me about Amazon Bedrock in less than 100 words."
formatted_prompt = f"<s>[INST] {prompt} [/INST]"
native_request = {
"inputs": formatted_prompt,
"max_tokens": 512,
"temperature": 0.5,
}
request = json.dumps(native_request)
## Invoke model
response = bedrock_runtime.invoke_model(modelId=model_id, body=request)
model_response = json.loads(response["body"].read())
# Extract and print the response text.
print(model_response)
Intelligent Router: enhances the efficiency and cost-effectiveness of by intelligently routing requests to the most appropriate foundational models based on the model best suited for the task. Prompt Routing predicts the performance of each model for each request and dynamically routes each request to the model that it predicts is most likely to give the desired response at the lowest cost. You can choose from two prompt routers in preview that route requests between either Claude Sonnet 3.5 and Claude Haiku, or between Llama 3.1 7B and Llama 3.1 80B
In the below example, we are using the Meta Llama router to route the requests. The response provides you which model was accessed for the request.
# Set your prompt router ARN
MODEL_ID = "arn:aws:bedrock:us-east-1:924155096146:default-prompt-router/meta.llama:1"
# User message to be processed
user_message = "Tell me about Amazon Bedrock in less than 100 words."
# Prepare messages for the API call
messages = [
{"role": "user",
"content": [{"text": user_message}]
}
]
# Invoke the model using the prompt router
streaming_response = bedrock_runtime.converse_stream(
modelId=MODEL_ID,
messages=messages,
)
# Process and print the response
for chunk in streaming_response["stream"]:
if "contentBlockDelta" in chunk:
text = chunk["contentBlockDelta"]["delta"]["text"]
print(text, end="")
if "messageStop" in chunk:
print()
if "metadata" in chunk:
if "trace" in chunk["metadata"]:
print(json.dumps(chunk['metadata']['trace'], indent=2))
Custom Model Imports: enables you to import and use your customized models in a serverless pattern using Amazon Bedrock API. You can fine tune your model either in Bedrock, Amazon SageMaker or any platform. The fine tuned model weights can then be imported to Amazon Bedrock. You can access your imported custom models on-demand and without the need to manage underlying infrastructure. Supported model architecture includes Llama, Mistral and Flan. For additional details on Custom Model import refer to the following link — https://aws.amazon.com/blogs/machine-learning/amazon-bedrock-custom-model-import-now-generally-available/
Closing Thoughts: Your requirements on LLM model inference differ based on your requirement. It could be standing up a platform, customer facing gen ai application or internal facing etc…. Irrespective of the type of workload, you can now leverage the Bedrock inference options, and even built custom routers on top to address your requirement.