Privacy-Routed LLM Inference: Keeping Sensitive Data Out of the Cloud

I spent three hours debugging a “hallucination” in my agent’s daily briefing only to realize the agent wasn’t hallucinating at all. It had simply failed to access my local financial spreadsheets because of a tool denylist I’d configured for security, and instead of admitting it couldn’t see the data, it had tried to “guess” based on a few fragments it had previously cached in a cloud-based session. Even worse, I discovered that a fallback trigger in my orchestration layer had sent a summarized snippet of my private data to a cloud API because the local inference node had a momentary timeout.

If you’re building AI agents that touch real-world data, the “happy path” is usually just a prompt and an API key. The reality is a minefield of data leaks, prompt injections, and silent failures that send your private keys or bank statements to a third-party server because a local GPU pod decided to restart.

This is a problem for anyone running autonomous agents that have read or write access to a local filesystem. If your routing logic is flawed, your privacy isn’t a policy; it’s a coin flip.

The Wrong Way: Trusting the Orchestrator

My first attempt at “privacy” was naive. I used a simple conditional in my agent’s logic: if the query contained words like “bank,” “password,” or “private,” route it to a local Ollama instance. Otherwise, send it to GPT-4o.

This failed immediately for three reasons. First, keyword filtering is a joke. A user (or a prompt injection) can easily bypass “bank” by asking about “financial liquidity instruments.” Second, I assumed the orchestrator was a neutral party. In reality, the orchestrator often handles the context window, meaning the sensitive data is already in the prompt before the routing decision is even made. Third, I had no fail-safe. When the local model timed out, the system defaulted to the cloud provider to ensure “high availability.” In a privacy-first system, unavailability is better than exposure.

I also hit a wall with tool access. I had disabled sandbox.mode to let my agents actually do work, but I quickly found that built-in tools like read and edit can be manipulated to bypass exec allowlists. I saw a specific instance where a prompt injection convinced the agent to use a read-chunk command (a hidden utility in some knowledge base scripts) to dump raw data from a file that should have been summarized first.

The Actual Solution: Two-Tier Privacy Routing

The only way to actually guarantee privacy is to move the routing logic as close to the data as possible and treat the cloud LLM as an untrusted guest. I implemented a two-tier architecture: a local “Privacy Gate” and a reference-only knowledge base.

1. The Reference-Only Knowledge Base

Instead of feeding raw files to the LLM, I use a system where the LLM never sees the original document. I use poppler-utils for PDF extraction and a local embedding model to populate a Qdrant vector store. The agent queries the vector store, but the results are filtered through a local script before being sent to any inference engine.

This prevents the “context stuffing” problem where you send 10k tokens of a private document to a cloud model just to get a three-sentence summary. By keeping the retrieval local, the only thing that ever leaves the cluster is the final, synthesized answer.

2. The Privacy Gate (Routing Layer)

I wrote a wrapper, knowledge.sh, that handles the routing. It doesn’t rely on keywords. It relies on the data source. If the data comes from a “Sensitive” tagged volume in my cluster, the request is hard-pinned to the local GPU node.

Here is a simplified version of how I handle a private query:

#!/bin/bash
# knowledge.sh query - Local-first routing

QUERY=$1
MODEL="qwen2.5:14b"
# The local endpoint is a dedicated GPU node in my K8s cluster
LOCAL_ENDPOINT="http://ollama-gpu-node.internal/v1/chat/completions"

# Check if the query requires sensitive data access
if [[ "$QUERY" == *"--private"* ]]; then
    echo "Routing to local inference..."
    # We use a local model and a local endpoint. No cloud fallback.
    curl -X POST "$LOCAL_ENDPOINT" \
         -H "Content-Type: application/json" \
         -d "{
           \"model\": \"$MODEL\",
           \"messages\": [{\"role\": \"user\", \"content\": \"$QUERY\"}],
           \"stream\": false
         }"
else
    # Non-sensitive queries can go to the cloud orchestrator
    ./route-to-cloud.sh "$QUERY"
fi

3. Hardening the Execution

To prevent the “hallucination via missing data” problem, I stopped letting the LLM handle the final delivery of sensitive reports. I use a pattern where the LLM generates a template or a summary, but a local Python script handles the actual data insertion and delivery.

For my daily briefings, I use a wrapper script that ensures the data collection is isolated from the cloud inference:

#!/bin/bash
# life-briefing-run.sh

# 1. Collect raw data locally (Private)
./daily-briefing.sh --collect-only

# 2. Format the data using a local script (No LLM involved here)
# This prevents the LLM from accidentally leaking raw data in its output
python3 /opt/scripts/format-and-send-briefing.py

And the Python script handles the delivery via a secure API (like Telegram) without ever sending the raw content to a third-party LLM for “polishing”:

import json
import requests

def send_telegram_message(message):
    # Tokens are managed via SealedSecrets in K8s
    bot_token = 'ANONYMIZED_TOKEN'
    chat_id = 'ANONYMIZED_ID'
    url = f'https://api.telegram.org/bot{bot_token}/sendMessage'
    payload = {
        'chat_id': chat_id,
        'text': message,
        'parse_mode': 'Markdown'
    }
    requests.post(url, json=payload)

# Load the locally generated briefing
with open('/tmp/daily_briefing.txt', 'r') as f:
    content = f.read()

send_telegram_message(content)

Deep Dive: Why This Architecture Works

The shift from “keyword filtering” to “source-based routing” is the critical change here. In a standard agentic workflow, the agent decides which tool to use. If the agent is running in a cloud-hosted environment, that decision process (and the data retrieved by the tool) is already exposed.

By implementing the routing at the shell/wrapper level, I’ve created a physical air-gap between the sensitive data and the cloud API. The knowledge.sh script acts as a proxy. If the --private flag is present, the request never even reaches the cloud-based orchestrator’s networking stack.

also, the “Reference-Only” approach addresses the vector database leak. Most people assume that because a vector DB stores embeddings (numbers), it’s private. It’s not. If an attacker (or a malicious prompt) can trigger a read-chunk or a raw metadata dump, they can reconstruct the original text. By removing the read-chunk utility from my knowledge.sh and restricting the Qdrant MCP to only return top-k similarity results, I’ve limited the blast radius.

I’ve coupled this with Agent Credential Management: Two-Tier Service Accounts to ensure that the local inference pod has the permissions to read the encrypted Longhorn volumes, while the cloud-facing agent pod has zero access to those keys.

Troubleshooting and Edge Cases

Setting this up isn’t without its headaches. I ran into several issues that aren’t mentioned in the Ollama or Qdrant docs.

The Qdrant Version Mismatch

While implementing the vector search, I hit a frustrating error: Error: Not existing vector name

This happened after updating the Qdrant MCP server without updating the underlying Qdrant instance. The new MCP was attempting to query a vector name that didn’t match the collection schema on the server. The fix was to explicitly define the vector name in the environment variables of the MCP pod rather than relying on the default.

The SessionKey Trap

I spent an afternoon wondering why my cron jobs were still hitting the “main” session instead of the “isolated” one, despite setting a sessionKey. I discovered that sessionKey is merely a routing hint for the orchestrator. If you want actual isolation for sensitive data processing, you must explicitly set --session "isolated".

GPU Memory Deadlocks

When running qwen2.5:14b alongside an embedding model on a single GPU, I encountered a deadlock where the pod would enter a CrashLoopBackOff with an OOM error, but only during the first 5 minutes of a query. This was a result of the Recreate strategy in Kubernetes not cleaning up the GPU memory fast enough during a pod restart. I had to implement a specific preStop hook to ensure the Ollama process was killed cleanly:

lifecycle:
  preStop:
    exec:
      command: ["/bin/sh", "-c", "kill -SIGTERM $(pgrep ollama)"]

Operational Hardening

For this to be production-ready in a homelab, you can’t just run a script. I’ve integrated this into a wider GitOps pipeline.

Storage: All sensitive knowledge base data is stored on Longhorn volumes with encryption at rest. This ensures that even if a node is compromised, the raw data remains encrypted.
Secrets: I use SealedSecrets to manage the Telegram bot tokens and Qdrant API keys. No plaintext secrets exist in my Git repo.
Resource Pinning: I use node selectors to ensure the ollama-gpu-node is the only place where the private models run.

nodeSelector:
  hardware-type: gpu-node
  gpu-model: rtx-4090

Lessons Learned

If I had to do this over, I would have started with the “Reference-Only” architecture instead of trying to build a “smart” router. Trying to make an LLM decide if a query is private is a losing battle; the LLM is the very thing you’re trying to protect the data from.

The biggest surprise was how much “slop” exists in agent frameworks regarding tool safety. The fact that read-chunk existed as a hidden command in a knowledge base script is a reminder that you cannot trust third-party wrappers with sensitive data.

The tradeoff here is latency. Local inference on a 14B model is slower than GPT-4o. But when the alternative is leaking your financial history to a cloud provider, a 5-second delay is a price I’m happy to pay. If you need more speed, you can look into Building Karpathy’s LLM Wiki for ways to optimize local model deployment.

For those looking to implement similar privacy-first AI agents or secure infrastructure, I offer consulting on AI agent orchestration and predictive maintenance to help bridge the gap between a demo and a secure production system.