Silver Owl

Research

How we think about
building with AI.

Six principles, a stack we can defend, and a short list of things we refuse to do — all field-tested across finance, operations, IoT, and enterprise advisory. Not theory. This is what we've learned shipping real systems.

Engineering principles

Six things we've learned the hard way. Each one came from watching something break in production or watching a client waste six figures on the wrong approach.

01

Start with the workflow. The model is a detail.

Every failed AI project we've seen started by picking a model and working backward. The right question is: what decision is a human making today that takes too long, costs too much, or produces inconsistent results? Map that workflow, find the choke point, then pick the cheapest model that clears the bar. We've shipped features with Claude Haiku that clients assumed required Opus — because the task was actually narrow and the prompt was tight.

02

RAG beats fine-tuning for almost every business problem.

Fine-tuned models are expensive to train, painful to version, and stale inside of six months when your data changes. For knowledge-intensive tasks — policy lookups, contract Q&A, product documentation search, internal procedures — a well-built retrieval pipeline with pgvector and a strong base model will outperform a fine-tuned snapshot nine times out of ten. We default to RAG. The exceptions are narrow: style transfer at scale, structured classification with thousands of labeled examples, or latency requirements that rule out a retrieval round-trip.

03

Multi-agent systems fail when agents hallucinate about each other.

Emergent orchestration sounds elegant. In practice it's a debugging nightmare. Every agent we build has a typed input schema, a typed output schema, a single narrow responsibility, and explicit handoff logic. Orchestration is code, not conversation. When a 22-agent system like EAS/Veridia has a failure, we can isolate exactly which agent produced the bad output and why — because the contract between agents is explicit, not inferred. If you can't unit-test an agent in isolation, your architecture has a problem.

04

Set your latency budget before writing a line of inference code.

We've watched teams spend weeks optimizing an AI feature that users had already stopped using because it was too slow. Latency is a product constraint, not an afterthought. Before any inference work starts, we set a target: under 800ms for autocomplete paths, under 2s for interactive responses, under 10s for background analysis where we can show a loading state. That budget determines the model tier, whether to stream, whether to cache, and whether to parallelize. Change the budget later and you rebuild the architecture.

05

Human-in-the-loop is architecture, not apology.

The goal isn't to remove humans — it's to make the decisions humans make faster, better-informed, and less exhausting. We design escalation paths as first-class features: what confidence threshold triggers a human review, what context does the AI pass along when it escalates, and how does the human's correction feed back into the system? A CRM that auto-qualifies leads but flags the ambiguous 15% for a 10-second human review is more useful than one that tries to classify everything and gets 25% wrong silently.

06

Evals are the only honest measure of whether a prompt change helped.

We maintain a golden test set for every AI feature we ship — a set of real or representative inputs with expected outputs, scored by both automated metrics and human judgment where needed. Before any model upgrade, prompt revision, or retrieval change goes to production, it runs against the eval suite. "It felt better in my three manual tests" is not a deployment criterion. A measured improvement on 200 real cases is. This is the single practice that separates teams that regress in production from teams that don't.

What we've ruled out

Deliberate constraints are engineering decisions. Here are four things we don't do, and why.

Not us

Single-model lock-in

We've never built a system that hard-codes one provider. Model routing at the orchestration layer costs a day of work and saves you from a bad week when a provider has an outage or doubles their prices. Every system we ship can swap the underlying model without touching product code.

Not us

Fine-tuning when RAG works

Fine-tuning is the right tool for a narrow set of problems. It is not the right tool for "make the model know about our product." We've seen clients burn $40k on a fine-tuning project that a $200/month RAG pipeline would have solved with better latency and fresher data.

Not us

GPT wrappers without evals

Wrapping an LLM API and calling it a product is fast to build and fast to break. If there is no evaluation harness, no regression test, and no monitoring for output quality, you will regress in production and not know it until a user complains. We don't ship without at least a lightweight eval suite.

Not us

Prompt engineering as a substitute for architecture

A clever system prompt can paper over a bad architecture for a few weeks. It will not hold. When a 2,000-token system prompt is doing structural work — routing, validation, error handling — that belongs in code, it's a sign the system wasn't designed, it was prompted into existence. We invest in architecture first.

Our AI stack

What we reach for and why. Each choice has a reason — not inertia, not trend-following.

Primary LLMs

Claude 3.5 Sonnet · Claude 3 Haiku · GPT-4o

Claude for reasoning-heavy tasks and long context; Haiku for latency-sensitive paths where cost per token matters; GPT-4o as a fallback and for vision tasks.

Embeddings

text-embedding-3-small (OpenAI)

1536-dim, cheap, fast, and consistently strong on semantic retrieval benchmarks. We haven't found a reason to switch.

Vector store

pgvector on Supabase

Keeps embeddings co-located with relational data. No separate vector DB to operate. HNSW indexes cover our scale.

Orchestration

Custom typed agent loops · Vercel AI SDK

Vercel AI SDK for streaming UI and standard chat patterns. Custom loops for multi-agent systems where we need explicit contracts and debuggable handoffs.

Storage

Supabase Postgres · Vercel Blob

Supabase for structured data, RLS policies, and auth. Blob for binary assets. Single vendor fewer moving parts.

Compute

Vercel (web + serverless) · VPS (persistent agents + cron)

Vercel for anything request-driven. VPS for agents that need to run continuously, long-running jobs, and scheduled tasks that can't tolerate cold-start latency.

Local inference

Ollama · qwen2.5-coder:14b · llama3.1

Free tier for dev tooling, code review agents, and jobs that don't need cloud-class output. Runs on WSL2 and hal9000 with zero per-token cost.

Frontend

Next.js 15 · TypeScript · Tailwind · shadcn/ui

App Router for streaming RSC. TypeScript strict mode, no exceptions. shadcn for accessible components without a design-system rebuild on every project.

From the work

Three architecture problems from products we've shipped. The interesting part is usually the constraint, not the model.

EAS / Veridia

Persona isolation in a 22-agent advisory system

Each of the 22 advisors in EAS needs a distinct, stable point of view. A CFO and a CISO will have genuinely different risk tolerances on the same question. We solved this with role-scoped system prompts, separate memory namespaces per advisor, and an explicit cross-agent escalation protocol — so the CEO advisor can call the CISO advisor for input without either contaminating the other's base persona. The hard part wasn't prompting; it was the typed handoff contract.

Read more →

APEX Terminal

Sub-2s signal generation on live options flow

Streaming real-time options market data into an LLM inference pipeline without blowing latency budgets required a layered caching strategy: pre-computed sector summaries refreshed on 60s intervals, a thin normalization layer to convert raw tick data into model-ready context, and streaming output so the UI renders before inference completes. We hit p95 under 1.8s on live feed days with heavy volume.

Read more →

FlockIQ

Edge-resident anomaly detection for commercial poultry

Commercial poultry operations don't have reliable internet. FlockIQ's custom PCB runs anomaly detection locally against calibrated thresholds for temperature, humidity, ammonia, and motion variance. Cloud sync and AI-assisted pattern analysis happen opportunistically. The architecture forces you to be disciplined: what genuinely needs cloud inference and what can be a rules-engine on a microcontroller?

Read more →

Want to go deeper?

Book a technical discovery call. We'll walk through your use case, tell you where the real complexity lives, and what we'd actually build — including what we'd talk you out of.

Book a discovery call →