Version: latest

Gateway API Inference Extension

Gateway API Inference Extension is a new extension of the Kubernetes Gateway API that aims to address serving large language models (LLMs) inside Kubernetes, with particular focus on intelligent load balancing decisions. Envoy AI Gateway was, on the other hand, initially designed to route traffic to different AI providers outside a k8s cluster, serving primarily as an egress solution.

To make Envoy AI Gateway a comprehensive solution for all AI traffic management, we have support for the Gateway API Inference Extension that can be used together with Envoy AI Gateway APIs.

note

This features is experimental and currently under active development.

Setup

Before you begin, you'll need to complete the installation guide. To set up the Gateway API Inference Extension, you need to run the Envoy AI Gateway with the --enableInferenceExtension=true flag, which can be done by setting the controller.enableInferenceExtension Helm chart value to true like this:

helm upgrade -i aieg oci://docker.io/envoyproxy/ai-gateway-helm \
  --version v0.0.0-latest \
  --namespace envoy-ai-gateway-system \
  --set controller.enableInferenceExtension=true \
  --create-namespace

The Inference Extension is essentially a set of custom resources, so you need to install its CRDs in your cluster. You can do this by running the following command:

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.2.0/manifests.yaml

How to use

Configure AIGatewayRoute to use InferencePool

With AIGatewayRoute resource, you will specify a destination backend for each routing rule with AIGatewayRouteRuleBackendRef. The reference defaults to AIServiceBackend resource type, which is the standard resource for Envoy AI Gateway. When the Inference Extension is enabled, you can also specify InferencePool object as a destination backend.

An inference pool is a collection of endpoints that can serve one or more AI models. In our implementation, you can bundle multiple AIServiceBackend resources into a single InferencePool resource that can server the same set of "models", and Envoy AI Gateway will intelligently load balance the traffic to the endpoints in the pool.

For example, let's say you have the following rules in your AIGatewayRoute

rules:
- matches:
  - headers:
    - type: Exact
      name: x-target-inference-extension
      value: "yes"
  backendRefs:
    # The name of the InferencePool that this route will route to.
  - name: inference-extension-example-pool
    # Explicitly specify the kind of the backend to be InferencePool.
    kind: InferencePool
- matches:
  - headers:
    - type: Exact
      name: x-target-inference-extension
      value: "no"
  backendRefs:
    # The name of the AIServiceBackend that this route will route to.
  - name: my-ai-service-backend
    # This is optional and defaults to AIServiceBackend.
    # kind: AIServiceBackend

When a request comes in with the header x-target-inference-extension: yes, it will be routed to the InferencePool named inference-extension-example-pool. That eventually routes to an AIServiceBackend resource that are part of the pool.

On the other hand, if the request comes in with the header x-target-inference-extension: no, it will be routed to the AIServiceBackend named my-ai-service-backend without going through the InferencePool. That means that the request will be sent directly to the backend service without any AI specific load balancing.

Configure InferencePool

An InferencePool is defined to bundle multiple AIServiceBackend resources that can serve the same set of models.

The set of models is defined via InferenceModel that is part of the Inference Extension. Multiple InferenceModel resources can reference the same InferencePool. Please refer to the Inference Extension documentation for more details.

To specify multiple AIServiceBackend resources in an InferencePool, you can use the spec.selector field in the InferencePool resource. For example,

apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
  name: inference-extension-example-pool
spec:
  selector:
    # Select multiple AIServiceBackend objects to bind to the InferencePool.
    app: my-backend

this will select all AIServiceBackend resources with the label app: my-backend and bind them to the InferencePool.

What's next?

We have a full example configuration that demonstrates how to use the Inference Extension with Envoy AI Gateway. Feel free to check it out and modify it to suit your needs.

Setup​

How to use​

Configure AIGatewayRoute to use InferencePool​

Configure InferencePool​

What's next?​

Setup

How to use

Configure AIGatewayRoute to use InferencePool

Configure InferencePool

What's next?