# LLM Proxy Service

**Port:** 7007  
**Purpose:** Multi-provider LLM gateway for DAARION agents

## Features

✅ **Multi-provider support:**
- OpenAI (GPT-4, GPT-4-turbo, etc.)
- DeepSeek (DeepSeek-R1)
- Local LLMs (Ollama, vLLM, llama.cpp)

✅ **Model routing:**
- Logical model names → Physical provider models
- Config-driven routing (`config.yaml`)

✅ **Usage tracking:**
- Token counting
- Latency monitoring
- Cost estimation
- Per-agent/microDAO tracking

✅ **Rate limiting:**
- Per-agent limits (10 req/min default)
- In-memory (Phase 3), Redis-backed (Phase 4)

✅ **Security:**
- Internal-only API (`X-Internal-Secret` header)
- API key management via env vars

## API

### POST /internal/llm/proxy

**Request:**
```json
{
  "model": "gpt-4.1-mini",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hello!"}
  ],
  "max_tokens": 100,
  "temperature": 0.7,
  "metadata": {
    "agent_id": "agent:sofia",
    "microdao_id": "microdao:7",
    "channel_id": "channel-uuid"
  }
}
```

**Response:**
```json
{
  "content": "Hello! How can I help you today?",
  "usage": {
    "prompt_tokens": 25,
    "completion_tokens": 10,
    "total_tokens": 35
  },
  "provider": "openai",
  "model_resolved": "gpt-4-1106-preview",
  "latency_ms": 1234.5
}
```

### GET /internal/llm/models

List available models:
```json
{
  "models": [
    {
      "name": "gpt-4.1-mini",
      "provider": "openai",
      "physical_name": "gpt-4-1106-preview",
      "max_tokens": 4096
    },
    ...
  ]
}
```

### GET /internal/llm/usage?agent_id=agent:sofia

Get usage statistics:
```json
{
  "total_requests": 42,
  "total_tokens": 12345,
  "avg_latency_ms": 987.6,
  "success_rate": 0.98
}
```

## Configuration

Edit `config.yaml`:

```yaml
providers:
  openai:
    base_url: "https://api.openai.com/v1"
    api_key_env: "OPENAI_API_KEY"
  
  local:
    base_url: "http://localhost:11434"

models:
  gpt-4.1-mini:
    provider: "openai"
    physical_name: "gpt-4-1106-preview"
    cost_per_1k_prompt: 0.01
    cost_per_1k_completion: 0.03
```

## Environment Variables

```bash
OPENAI_API_KEY=sk-...           # OpenAI API key
DEEPSEEK_API_KEY=sk-...         # DeepSeek API key
LLM_PROXY_SECRET=dev-secret-token  # Internal auth token
```

## Setup

### Local Development

```bash
cd services/llm-proxy

# Install dependencies
pip install -r requirements.txt

# Set API keys
export OPENAI_API_KEY="sk-..."

# Run
python main.py
```

### Docker

```bash
docker build -t llm-proxy .
docker run -p 7007:7007 \
  -e OPENAI_API_KEY="sk-..." \
  llm-proxy
```

### With docker-compose

```bash
docker-compose -f docker-compose.phase3.yml up llm-proxy
```

## Testing

### Test OpenAI

```bash
curl -X POST http://localhost:7007/internal/llm/proxy \
  -H "Content-Type: application/json" \
  -H "X-Internal-Secret: dev-secret-token" \
  -d '{
    "model": "gpt-4.1-mini",
    "messages": [
      {"role": "user", "content": "Say hello!"}
    ],
    "metadata": {
      "agent_id": "agent:test"
    }
  }'
```

### Test Local LLM

```bash
# Start Ollama
ollama serve
ollama pull qwen2.5:8b

# Test
curl -X POST http://localhost:7007/internal/llm/proxy \
  -H "Content-Type: application/json" \
  -H "X-Internal-Secret: dev-secret-token" \
  -d '{
    "model": "dagi-local-8b",
    "messages": [
      {"role": "user", "content": "Test"}
    ]
  }'
```

## Adding New Providers

1. Create `providers/my_provider.py`:

```python
class MyProvider:
    def __init__(self, config: ProviderConfig):
        self.config = config
    
    async def chat(self, messages, model_name, **kwargs) -> LLMResponse:
        # Implement provider logic
        ...
```

2. Register in `config.yaml`:

```yaml
providers:
  my_provider:
    base_url: "https://api.myprovider.com"
    api_key_env: "MY_PROVIDER_KEY"

models:
  my-model:
    provider: "my_provider"
    physical_name: "my-model-v1"
```

3. Initialize in `main.py`:

```python
from providers.my_provider import MyProvider

providers["my_provider"] = MyProvider(provider_config)
```

## Integration with agent-runtime

In `agent-runtime`:

```python
import httpx

async def call_llm(agent_blueprint, messages):
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://llm-proxy:7007/internal/llm/proxy",
            headers={"X-Internal-Secret": "dev-secret-token"},
            json={
                "model": agent_blueprint.llm_model,
                "messages": messages,
                "metadata": {
                    "agent_id": agent_blueprint.id,
                    "microdao_id": agent_blueprint.microdao_id
                }
            }
        )
        return response.json()
```

## Roadmap

### Phase 3 (Current):
- ✅ Multi-provider support
- ✅ Basic rate limiting
- ✅ Usage logging
- ✅ OpenAI + DeepSeek + Local

### Phase 3.5:
- 🔜 Streaming responses
- 🔜 Response caching
- 🔜 Function calling support
- 🔜 Redis-backed rate limiting

### Phase 4:
- 🔜 Database-backed usage logs
- 🔜 Cost analytics
- 🔜 Billing integration
- 🔜 Advanced routing (fallbacks, load balancing)

## Troubleshooting

**Provider not working?**
```bash
# Check API key
docker logs llm-proxy | grep "api_key"

# Test directly
curl https://api.openai.com/v1/models \
  -H "Authorization: Bearer $OPENAI_API_KEY"
```

**Rate limit issues?**
```bash
# Check current limits
curl http://localhost:7007/internal/llm/usage?agent_id=agent:sofia \
  -H "X-Internal-Secret: dev-secret-token"
```

**Local LLM not responding?**
```bash
# Check Ollama
curl http://localhost:11434/api/version

# Check logs
docker logs llm-proxy | grep "local"
```

## Architecture

```
agent-runtime
    ↓
    POST /internal/llm/proxy
    ↓
llm-proxy:
    ├─ Rate limiter (check agent quota)
    ├─ Model router (logical → physical)
    ├─ Provider selector (OpenAI/DeepSeek/Local)
    └─ Usage tracker (log tokens, cost, latency)
    ↓
[OpenAI API | DeepSeek API | Local Ollama]
    ↓
Response → agent-runtime
```

## License

Internal DAARION service

---

**Status:** ✅ Phase 3 Ready  
**Version:** 1.0.0  
**Last Updated:** 2025-11-24