Research
How we think about
building with AI.
Six principles, a stack we can defend, and a short list of things we refuse to do — all field-tested across finance, operations, IoT, and enterprise advisory. Not theory. This is what we've learned shipping real systems.
Engineering principles
Six things we've learned the hard way. Each one came from watching something break in production or watching a client waste six figures on the wrong approach.
01
Start with the workflow. The model is a detail.
Every failed AI project we've seen started by picking a model and working backward. The right question is: what decision is a human making today that takes too long, costs too much, or produces inconsistent results? Map that workflow, find the choke point, then pick the cheapest model that clears the bar. We've shipped features with Claude Haiku that clients assumed required Opus — because the task was actually narrow and the prompt was tight.
02
RAG beats fine-tuning for almost every business problem.
Fine-tuned models are expensive to train, painful to version, and stale inside of six months when your data changes. For knowledge-intensive tasks — policy lookups, contract Q&A, product documentation search, internal procedures — a well-built retrieval pipeline with pgvector and a strong base model will outperform a fine-tuned snapshot nine times out of ten. We default to RAG. The exceptions are narrow: style transfer at scale, structured classification with thousands of labeled examples, or latency requirements that rule out a retrieval round-trip.
03
Multi-agent systems fail when agents hallucinate about each other.
Emergent orchestration sounds elegant. In practice it's a debugging nightmare. Every agent we build has a typed input schema, a typed output schema, a single narrow responsibility, and explicit handoff logic. Orchestration is code, not conversation. When a 22-agent system like EAS/Veridia has a failure, we can isolate exactly which agent produced the bad output and why — because the contract between agents is explicit, not inferred. If you can't unit-test an agent in isolation, your architecture has a problem.
04
Set your latency budget before writing a line of inference code.
We've watched teams spend weeks optimizing an AI feature that users had already stopped using because it was too slow. Latency is a product constraint, not an afterthought. Before any inference work starts, we set a target: under 800ms for autocomplete paths, under 2s for interactive responses, under 10s for background analysis where we can show a loading state. That budget determines the model tier, whether to stream, whether to cache, and whether to parallelize. Change the budget later and you rebuild the architecture.
05
Human-in-the-loop is architecture, not apology.
The goal isn't to remove humans — it's to make the decisions humans make faster, better-informed, and less exhausting. We design escalation paths as first-class features: what confidence threshold triggers a human review, what context does the AI pass along when it escalates, and how does the human's correction feed back into the system? A CRM that auto-qualifies leads but flags the ambiguous 15% for a 10-second human review is more useful than one that tries to classify everything and gets 25% wrong silently.
06
Evals are the only honest measure of whether a prompt change helped.
We maintain a golden test set for every AI feature we ship — a set of real or representative inputs with expected outputs, scored by both automated metrics and human judgment where needed. Before any model upgrade, prompt revision, or retrieval change goes to production, it runs against the eval suite. "It felt better in my three manual tests" is not a deployment criterion. A measured improvement on 200 real cases is. This is the single practice that separates teams that regress in production from teams that don't.
What we've ruled out
Deliberate constraints are engineering decisions. Here are four things we don't do, and why.
Not us
Single-model lock-in
We've never built a system that hard-codes one provider. Model routing at the orchestration layer costs a day of work and saves you from a bad week when a provider has an outage or doubles their prices. Every system we ship can swap the underlying model without touching product code.
Not us
Fine-tuning when RAG works
Fine-tuning is the right tool for a narrow set of problems. It is not the right tool for "make the model know about our product." We've seen clients burn $40k on a fine-tuning project that a $200/month RAG pipeline would have solved with better latency and fresher data.
Not us
GPT wrappers without evals
Wrapping an LLM API and calling it a product is fast to build and fast to break. If there is no evaluation harness, no regression test, and no monitoring for output quality, you will regress in production and not know it until a user complains. We don't ship without at least a lightweight eval suite.
Not us
Prompt engineering as a substitute for architecture
A clever system prompt can paper over a bad architecture for a few weeks. It will not hold. When a 2,000-token system prompt is doing structural work — routing, validation, error handling — that belongs in code, it's a sign the system wasn't designed, it was prompted into existence. We invest in architecture first.
Our AI stack
What we reach for and why. Each choice has a reason — not inertia, not trend-following.
From the work
Three architecture problems from products we've shipped. The interesting part is usually the constraint, not the model.
EAS / Veridia
Persona isolation in a 22-agent advisory system
Each of the 22 advisors in EAS needs a distinct, stable point of view. A CFO and a CISO will have genuinely different risk tolerances on the same question. We solved this with role-scoped system prompts, separate memory namespaces per advisor, and an explicit cross-agent escalation protocol — so the CEO advisor can call the CISO advisor for input without either contaminating the other's base persona. The hard part wasn't prompting; it was the typed handoff contract.
Read more →
APEX Terminal
Sub-2s signal generation on live options flow
Streaming real-time options market data into an LLM inference pipeline without blowing latency budgets required a layered caching strategy: pre-computed sector summaries refreshed on 60s intervals, a thin normalization layer to convert raw tick data into model-ready context, and streaming output so the UI renders before inference completes. We hit p95 under 1.8s on live feed days with heavy volume.
Read more →
FlockIQ
Edge-resident anomaly detection for commercial poultry
Commercial poultry operations don't have reliable internet. FlockIQ's custom PCB runs anomaly detection locally against calibrated thresholds for temperature, humidity, ammonia, and motion variance. Cloud sync and AI-assisted pattern analysis happen opportunistically. The architecture forces you to be disciplined: what genuinely needs cloud inference and what can be a rules-engine on a microcontroller?
Read more →
Want to go deeper?
Book a technical discovery call. We'll walk through your use case, tell you where the real complexity lives, and what we'd actually build — including what we'd talk you out of.
Book a discovery call →