rath.llm#

Provider options, request/response types, OpenAI and Anthropic clients, streaming deltas, embedding/VLM clients, retry, budget accounting, and response normalization.

Source#

Module

Source

rath.llm.provider

src/rath/llm/provider.py

rath.llm.base

src/rath/llm/base.py

rath.llm.registry

src/rath/llm/registry.py

rath.llm.embedding

src/rath/llm/embedding.py

rath.llm.vlm

src/rath/llm/vlm.py

rath.llm.openai.client

src/rath/llm/openai/client.py

rath.llm.anthropic.client

src/rath/llm/anthropic/client.py

rath.llm.chat_request

src/rath/llm/chat_request.py

rath.llm.chat_response

src/rath/llm/chat_response.py

rath.llm.openai.create_kwargs

src/rath/llm/openai/create_kwargs.py

rath.llm.openai.normalize

src/rath/llm/openai/normalize.py

rath.llm.anthropic.create_kwargs

src/rath/llm/anthropic/create_kwargs.py

rath.llm.anthropic.normalize

src/rath/llm/anthropic/normalize.py

Public contract#

Provider#

Provider stores OpenAI-compatible client identity plus model, sampling, tool, and provider-specific parameters required by the loop. It does not contain messages or tools; the session loop constructs those.

Field category

Fields

client identity

api_key, base_url, provider_kind

model

model

sampling

temperature, top_p, max_completion_tokens, max_tokens, stop, n, seed

penalties

frequency_penalty, presence_penalty, logit_bias

tools/output

tool_choice, parallel_tool_calls, response_format

OpenAI options

reasoning_effort, verbosity, metadata, user, store, service_tier, extra_create_args

retry/budget

retry_max_attempts, retry_base_seconds, budget_total_tokens, on_budget_exceeded

Provider.from_config(name=None, **overrides) builds a provider from ~/.openrath/config.json; explicit overrides win over the file.

Client#

from rath.llm import Provider, RathOpenAIChatClient, chat_client_for

provider = Provider(api_key="sk-...", base_url=None, model="gpt-5.5")
client = RathOpenAIChatClient(provider)
response = client.complete(request)

anthropic = Provider(provider_kind="anthropic", model="claude-sonnet-4-5")
client = chat_client_for(anthropic)

chat_client_for(provider) dispatches through the registry. Built-in kinds are OpenAI-compatible (None or "openai") and Anthropic ("anthropic"). Third-party adapters can call register_chat_client(kind, factory).

Provider dispatch registry

Provider.provider_kind selects a registered chat-client factory; new provider kinds integrate at the registry boundary instead of changing the session loop.#

Request and response DTOs#

Type

Description

RathLLMMessage

Chat messages[] element.

RathLLMFunctionTool

Function-style tool schema.

RathLLMChatRequest

OpenAI-compatible request kwargs.

RathLLMChatResponse

Normalized completion response.

RathLLMStreamDelta

Normalized streaming delta.

RathLLMChatChoice

Single choice.

RathLLMAssistantMessage

Assistant message, including tool calls.

RathLLMToolCallPart / RathLLMToolCallFunction

Tool call structure.

RathLLMTokenUsage

Usage statistics.

Embeddings and VLM#

v1.2 adds first-class provider wrappers for non-chat model calls. They use the same config style as Provider, but keep their public surface narrow so memory backends and visual tools do not depend on chat-completion internals.

API

Config key

Default behavior

EmbeddingProvider.from_config(name=None, **overrides)

llm.embedding_provider

Falls back through the configured default chat provider credentials and uses text-embedding-3-small when no embedding model is set.

RathOpenAIEmbeddingClient(provider)

OpenAI-compatible embedding endpoint

Returns embedding vectors for text input.

VLMProvider.from_config(name=None, **overrides)

llm.vlm_provider

Requires an explicit VLM provider entry or overrides.

RathOpenAIVLMClient(provider)

OpenAI-compatible vision/chat endpoint

Sends text plus image inputs through a VLM-compatible model.

Create arguments#

to_create_kwargs(req, default_model=...) converts the internal request to non-streaming OpenAI SDK kwargs. RathOpenAIChatClient.complete_stream(...) uses the streaming sibling and yields RathLLMStreamDelta chunks.

Streaming loop deltas

Streaming forwards deltas to on_event while the session loop still appends one durable assistant chunk per completed model round.#

Behavior

Description

model selection

Uses req.model; otherwise uses default_model. Raises ValueError if both are empty.

tool schema

Converts RathLLMFunctionTool to {"type": "function", "function": ...}.

stream

Non-streaming kwargs force stream=False; streaming kwargs force stream=True.

extra args

Merges req.extra_create_args last.

Environment and config fallback#

Client

Resolution order

OpenAI API key

Provider.api_key → Azure-aware env vars → matching config provider.

OpenAI base URL

Provider.base_urlOPENAI_BASE_URLAZURE_OPENAI_ENDPOINT → config.

OpenAI model

Provider.modelOPENAI_DEFAULT_MODEL → config default provider model.

Anthropic API key

Provider.api_keyANTHROPIC_API_KEY → matching config provider.

Anthropic base URL

Provider.base_urlANTHROPIC_BASE_URL → config.

Anthropic model

Provider.modelANTHROPIC_DEFAULT_MODEL → config provider model.

Legacy Azure endpoints are routed through openai.AzureOpenAI; /openai/v1 endpoints use the standard OpenAI client.

LLM retry, usage, and budget guard flow

Retries, usage aggregation, and budget checks sit around provider calls without changing the public Session and Provider API shape.#

Autodoc#

class rath.llm.Provider(*, base_url: str | None = None, api_key: str | None = None, model: str | None = None, temperature: float | None = None, top_p: float | None = None, max_completion_tokens: int | None = None, max_tokens: int | None = None, stop: str | list[str] | None = None, n: int | None = None, seed: int | None = None, frequency_penalty: float | None = None, presence_penalty: float | None = None, tool_choice: ~typing.Literal['auto', 'none', 'required'] | ~typing.Mapping[str, ~typing.Any] | None = None, parallel_tool_calls: bool | None = None, response_format: dict[str, ~typing.Any] | None = None, logit_bias: dict[str, int] | None = None, logprobs: bool | None = None, top_logprobs: int | None = None, reasoning_effort: str | None = None, verbosity: str | None = None, metadata: dict[str, str] | None = None, user: str | None = None, store: bool | None = None, service_tier: str | None = None, extra_create_args: ~typing.Mapping[str, ~typing.Any] = <factory>, retry_max_attempts: int | None = None, retry_base_seconds: float | None = None, budget_total_tokens: int | None = None, on_budget_exceeded: ~typing.Callable[[...], None] | None = None, provider_kind: ~typing.Literal['openai', 'anthropic'] | None = None)[source]#

LLM routing for run_session_loop (no messages / tools).

base_url, api_key, and model configure the HTTP client built from provider_kind (OpenAI-compatible or Anthropic). Other fields mirror RathLLMChatRequest (excluding what the loop fills in).

api_key may be omitted when callers supply a custom executor that never instantiates a default RathOpenAIChatClient or RathAnthropicChatClient.

classmethod from_config(name: str | None = None, *, store: ConfigStore | None = None, **overrides: Any) Provider[source]#

Build a Provider from ~/.openrath/config.json.

Looks up name (or llm.default_provider when name=None) under llm.providers, then constructs a Provider whose fields come from the entry. Any explicit overrides win — pass e.g. Provider.from_config("openai-main", api_key="ad-hoc") to rotate one field without touching the on-disk file.

Lazy-imports rath.config so that import rath.llm never touches the filesystem.

Raises KeyError when the named provider is missing; the message lists what is available.

class rath.llm.RathOpenAIChatClient(provider: Provider)[source]#

Thin client around openai.OpenAI chat completions (sync + streaming).

Empty Provider.api_key / Provider.base_url fall back to environment variables (set them in the shell or via rath.config):

  • base_url: OPENAI_BASE_URL then AZURE_OPENAI_ENDPOINT.

  • api_key: OPENAI_API_KEY for OpenAI-compatible endpoints; for *.azure.com endpoints the order becomes AZURE_OPENAI_API_KEYAZURE_API_KEYOPENAI_API_KEY.

Azure endpoints exposing the new /openai/v1 surface speak plain OpenAI Chat Completions, so the vanilla SDK is used. Legacy Azure endpoints (/openai without /v1) are routed through openai.AzureOpenAI with api_version taken from OPENAI_API_VERSION (default 2024-10-21).

complete(req: RathLLMChatRequest) RathLLMChatResponse[source]#

Run chat.completions.create and normalize the response.

Transient errors (rate limit, connection, timeout, server 5xx) are retried with exponential backoff per Provider.retry_max_attempts and Provider.retry_base_seconds.

complete_stream(req: RathLLMChatRequest) Iterator[RathLLMStreamDelta][source]#

Yield RathLLMStreamDelta for each chunk of a streaming completion.

Transient errors during the initial create call are retried; once the iterator starts producing chunks, retries are no longer possible (the stream is committed).

class rath.llm.RathAnthropicChatClient(provider: Provider)[source]#

Thin client around anthropic.Anthropic messages API (sync + streaming).

complete(req: RathLLMChatRequest) RathLLMChatResponse[source]#

Run messages.create and normalize the response.

Transient errors are retried per Provider.retry_max_attempts / Provider.retry_base_seconds. The retryable set is the Anthropic-flavored quadruple (RateLimitError, APIConnectionError, APITimeoutError, InternalServerError).

complete_stream(req: RathLLMChatRequest) Iterator[RathLLMStreamDelta][source]#

Yield RathLLMStreamDelta for each event from messages.stream.

Transient errors during the initial stream open are retried; once the iterator starts producing events, retries are no longer possible.

class rath.llm.EmbeddingProvider(*, model: str, base_url: str | None = None, api_key: str | None = None, dimensions: int | None = None, retry_max_attempts: int | None = None, retry_base_seconds: float | None = None)[source]#

Routing + credentials for an OpenAI-compatible embeddings endpoint.

The chat Provider (in rath.llm.provider) is intentionally not reused: embedding endpoints frequently live under a different base_url / model namespace even when the api_key is shared.

dimensions: int | None#

When set, request a truncated/projected embedding vector. The OpenAI SDK passes this as dimensions=. None means use the model’s native dimension.

retry_max_attempts: int | None#

Same retry knobs as Provider; None uses built-in defaults.

classmethod from_config(name: str | None = None, *, store: ConfigStore | None = None, **overrides: Any) EmbeddingProvider[source]#

Build an EmbeddingProvider from ~/.openrath/config.json.

Lookup order:

  1. name if given.

  2. llm.embedding_provider if set.

  3. llm.default_provider (chat fallback) — uses its credentials but replaces model with DEFAULT_EMBEDDING_MODEL since the chat model is unsuitable for embeddings.

Raises KeyError only when name is given explicitly and the entry is missing.

class rath.llm.RathOpenAIEmbeddingClient(provider: EmbeddingProvider)[source]#

Thin wrapper around openai.OpenAI().embeddings.create.

Construct once per EmbeddingProvider; the underlying SDK client is created up-front and reused across calls.

embed(texts: Sequence[str]) tuple[tuple[float, ...], ...][source]#

Embed an arbitrary number of texts; returns one vector per input.

An empty texts short-circuits to () without an API call.

embed_one(text: str) tuple[float, ...][source]#

Convenience for the single-text case.

class rath.llm.VLMProvider(*, model: str, base_url: str | None = None, api_key: str | None = None, max_tokens: int | None = 512, temperature: float | None = None, retry_max_attempts: int | None = None, retry_base_seconds: float | None = None)[source]#

Routing + credentials for an OpenAI-compatible vision endpoint.

classmethod from_config(name: str | None = None, *, store: ConfigStore | None = None, **overrides: Any) VLMProvider[source]#

Build a VLMProvider from ~/.openrath/config.json.

Lookup order:

  1. name if given.

  2. llm.vlm_provider if set.

Unlike EmbeddingProvider, there is no fallback to llm.default_provider: a chat model is rarely a vision model, and silently falling back would produce confusing 400 errors at first use. Raises KeyError instead.

class rath.llm.RathOpenAIVLMClient(provider: VLMProvider)[source]#

Thin wrapper turning (image, prompt) -> caption into a chat call.

describe(image_bytes: bytes, *, prompt: str, mime: str = 'image/png') str[source]#

Send a single image + text prompt; return the model’s reply text.

describe_path(path: Path, *, prompt: str) str[source]#

Load an image from disk and call describe().

class rath.llm.ChatClient(*args, **kwargs)[source]#

Minimal synchronous chat-completion contract.

Implementations must keep complete blocking and side-effect-free beyond the network call itself; retries / token accounting / budget handling are layered above in the session loop.

class rath.llm.StreamingChatClient(*args, **kwargs)[source]#

A ChatClient that also supports streaming completions.

run_session_loop() accepts any object satisfying this Protocol when on_event is provided. Both OpenAI and Anthropic adapters implement it.

rath.llm.chat_client_for(provider: Provider) ChatClient[source]#

Return the ChatClient for provider.provider_kind.

provider.provider_kind=None defaults to "openai". Unknown kinds raise ValueError listing what is currently registered.

rath.llm.register_chat_client(kind: str, factory: Callable[[Provider], ChatClient]) None[source]#

Register factory(provider) -> ChatClient under kind.

Overwrites any previous registration silently — late imports therefore win. Built-in kinds ("openai", "anthropic") are registered when their subpackages are imported by rath.llm.

rath.llm.registered_kinds() tuple[str, ...][source]#

Snapshot of currently registered kinds (useful for diagnostics / tests).

rath.llm.to_create_kwargs(req: RathLLMChatRequest, *, default_model: str | None) dict[str, Any][source]#

Map RathLLMChatRequest to OpenAI.chat.completions.create kwargs.

Non-streaming only: stream is forced to False after extra_create_args are merged. stream=True in extras raises ValueError.

rath.llm.normalize_chat_completion(completion: ChatCompletion) RathLLMChatResponse[source]#

Convert an SDK ChatCompletion into RathLLMChatResponse.

rath.llm.build_anthropic_kwargs(req: RathLLMChatRequest, *, default_model: str | None) dict[str, Any][source]#

Translate RathLLMChatRequest into messages.create kwargs.

default_model mirrors to_create_kwargs(): it’s used when neither the request nor the provider supplies a model name.

rath.llm.build_anthropic_stream_kwargs(req: RathLLMChatRequest, *, default_model: str | None) dict[str, Any][source]#

Same kwargs as build_anthropic_kwargs() for messages.stream.

Anthropic’s messages.stream(**kwargs) uses the same shape as messages.create; there is no stream=True flag. Named entrypoint parallel to rath.llm.openai.create_kwargs.to_create_kwargs_stream().

rath.llm.normalize_anthropic_response(payload: Mapping[str, Any]) RathLLMChatResponse[source]#

Map an Anthropic Message-shaped dict to RathLLMChatResponse.

payload is expected to be the result of message.model_dump(mode='json') on the SDK return value (or an equivalent fixture dict). Defending via dict lookups keeps the adapter compatible across minor SDK upgrades.

class rath.llm.RathLLMChatRequest(*, messages: tuple[~rath.llm.chat_request.RathLLMMessage, ...], model: str | None = None, tools: tuple[~rath.llm.chat_request.RathLLMFunctionTool, ...] | None = None, tool_choice: ~typing.Any | None = None, parallel_tool_calls: bool | None = None, response_format: dict[str, ~typing.Any] | None = None, temperature: float | None = None, top_p: float | None = None, max_completion_tokens: int | None = None, max_tokens: int | None = None, stop: str | list[str] | None = None, n: int | None = None, seed: int | None = None, frequency_penalty: float | None = None, presence_penalty: float | None = None, logit_bias: dict[str, int] | None = None, logprobs: bool | None = None, top_logprobs: int | None = None, reasoning_effort: str | None = None, verbosity: str | None = None, metadata: dict[str, str] | None = None, user: str | None = None, store: bool | None = None, service_tier: str | None = None, extra_create_args: ~typing.Mapping[str, ~typing.Any] = <factory>)[source]#

Maps to keyword arguments passed to the vendor chat API.

model=None falls back to model on the Provider held by the chat client.

class rath.llm.RathLLMMessage(role: str, content: str | None = None, name: str | None = None, tool_call_id: str | None = None, tool_calls: tuple[Mapping[str, Any], ...] | None = None)[source]#

One messages[] element for chat completions.create.

tool_calls is set only for assistant turns in tool-using conversations.

class rath.llm.RathLLMFunctionTool(name: str, parameters: dict[str, Any], description: str | None = None, strict: bool | None = None)[source]#

A function-style tool definition (type: function).

class rath.llm.RathLLMChatResponse(id: str, choices: tuple[RathLLMChatChoice, ...], created: int, model: str, object_type: Literal['chat.completion'] = 'chat.completion', service_tier: str | None = None, system_fingerprint: str | None = None, usage: RathLLMTokenUsage | None = None, raw: Mapping[str, Any] | None = None)[source]#

Normalized non-streaming ChatCompletion.

property primary_choice: RathLLMChatChoice#

The first choice (typical when n is 1).

class rath.llm.RathLLMStreamDelta(content_delta: str | None = None, tool_call_index: int | None = None, tool_call_id: str | None = None, tool_call_name_delta: str | None = None, tool_call_args_delta: str | None = None, finish_reason: Literal['stop', 'length', 'tool_calls', 'content_filter', 'function_call'] | None = None, usage: RathLLMTokenUsage | None = None)[source]#

One chunk emitted by a streaming completion.

Fields are independent and any subset may be populated:

  • content_delta carries an assistant text fragment.

  • tool_call_index / tool_call_id / tool_call_name_delta / tool_call_args_delta extend an in-progress assistant tool_call. Multiple tool calls in one stream are distinguished by tool_call_index.

  • finish_reason is set on the terminal chunk for a choice.

  • usage is populated only on the final stream event (and only when the underlying API agreed to report it, e.g. OpenAI’s stream_options={"include_usage": True}).

class rath.llm.RathLLMChatChoice(index: int, finish_reason: Literal['stop', 'length', 'tool_calls', 'content_filter', 'function_call'], message: RathLLMAssistantMessage, logprobs: Mapping[str, Any] | None = None)[source]#

One element of choices.

class rath.llm.RathLLMAssistantMessage(role: Literal['assistant'] = 'assistant', content: str | None = None, refusal: str | None = None, reasoning_content: str | None = None, tool_calls: tuple[RathLLMToolCallPart, ...] | None = None, function_call: Mapping[str, Any] | None = None, annotations: tuple[Mapping[str, Any], ...] | None = None)[source]#

Assistant message on a choice (content, optional tool calls, provider extras).

class rath.llm.RathLLMToolCallPart(id: str, type: str, function: RathLLMToolCallFunction)[source]#

One entry from message.tool_calls.

class rath.llm.RathLLMToolCallFunction(name: str, arguments: str, arguments_parsed: dict[str, Any] | None, arguments_parse_error: bool)[source]#

function payload inside a tool call (name + arguments string).

class rath.llm.RathLLMTokenUsage(prompt_tokens: int, completion_tokens: int, total_tokens: int, completion_tokens_details: Mapping[str, Any] | None = None, prompt_tokens_details: Mapping[str, Any] | None = None)[source]#

Token counts from usage; optional detail dicts stay JSON-shaped.

rath.llm.add_usage(a: RathLLMTokenUsage | None, b: RathLLMTokenUsage | None) RathLLMTokenUsage | None[source]#

Sum two token usages.

Returns None only when both inputs are None (so callers can detect that no provider in the chain reported usage). Detail dicts are not merged - they are dropped on the accumulated total because per-call breakdowns don’t sum cleanly.

exception rath.llm.BudgetExceededError[source]#

Raised by user code from Provider.on_budget_exceeded to abort a loop.

The session loop itself does not raise this automatically when budget_total_tokens is exceeded — it only invokes the callback (or logs a warning if no callback is set). Raising this from the callback is the documented way to stop the loop on overrun.

← API Reference