DevelopmentPlaybook9 min readPublished June 29, 2026

Used RTX 3090 from ~$700 · 0 open router ports · OpenAI-compatible /v1 endpoint

Build a Home AI Server That Runs Open-Weight LLMs 24/7

You don’t need a data center to run private AI. A repurposed gaming PC with a used RTX 3090, or a near-silent Mac, serves capable open-weight models on your own network — and Tailscale reaches it from anywhere with zero open ports. This is the full build-and-operate playbook for a 24/7 home AI server in 2026.

DA
Digital Applied Team
Senior engineers · Published Jun 29, 2026
PublishedJun 29, 2026
Read time9 min
SourcesNVIDIA · Apple · Ollama · Tailscale
Entry GPU (used)
$700+
RTX 3090 · 24 GB VRAM
Always-on power
~$52/mo
RTX 4090 @ $0.12/kWh
Open router ports
0
remote over Tailscale
70B on one card
48GB
RTX 6000 Ada · Q4_K_M

A home AI server is a spare computer that runs open-weight large language models on your own network, all day, with no cloud account in the loop. In 2026 the entry ticket is genuinely modest: a used RTX 3090 with 24 GB of VRAM costs roughly $700–$900, pairs with 64 GB of system RAM and a clean Linux install, and serves capable models through an OpenAI-compatible endpoint your existing tools already understand.

The reason to bother is no longer ideology. Self-hosting buys you three concrete things: data that never leaves your premises, predictable cost instead of metered per-token billing, and an endpoint with no rate limits or surprise deprecations. The catch is that most guides stop at “here’s how to run a model” — they never cover how to run one as infrastructure that stays up without babysitting.

This playbook closes that gap. We cover hardware tiers and the VRAM math that decides what actually fits, the two serving stacks worth knowing (Ollama and vLLM), the always-on layer (Proxmox, a process supervisor, Open WebUI, and a UPS), remote access over Tailscale with zero open ports, the one security rule you must not break, and the real monthly electricity cost. If you are still weighing whether to host at all, our buy-vs-rent-vs-cloud inference decision guide frames that call first.

Key takeaways
  1. 01
    A used RTX 3090 is still the best-value entry point.24 GB of GDDR6X for roughly $700–$900 runs 32B-class models at Q4_K_M and generates around 50 tokens/sec on a 7B model — the same VRAM as a new RTX 4090 at a fraction of the price.
  2. 02
    VRAM, not raw speed, decides what you can run.Budget about 0.6 GB per billion parameters at Q4_K_M. A 70B model needs roughly 38–48 GB, so it fits one 48 GB RTX 6000 Ada — but not a 32 GB RTX 5090, which tops out near a 32B model.
  3. 03
    Ollama and vLLM both speak the OpenAI API.Point any existing client at Ollama’s http://localhost:11434/v1 and nothing in your agent or IDE changes. Reach for vLLM when you need high concurrency, multi-GPU tensor parallelism, and PagedAttention throughput.
  4. 04
    Tailscale gives remote access with zero open ports.A WireGuard mesh assigns each device a private 100.x.y.z address that works behind NAT. The free Personal plan covers 6 users and unlimited devices — no port forwarding, no public DNS record.
  5. 05
    Never expose raw Ollama to the public internet.Port 11434 has no built-in authentication. Keep it inside the tailnet, or put it behind an authenticated reverse proxy over HTTPS. A public, open endpoint lets anyone enumerate and run your models.

01Why Self-HostPrivate inference is now an infrastructure decision, not a hobby.

Running a model locally used to be a weekend novelty. In 2026 it is a defensible operational choice, because open-weight models are good enough for a large slice of everyday work and the hardware to run them keeps getting cheaper on the used market. The framing that matters is not “can I run a model” — it is “can I run one as a service that stays up.”

Three motivations drive most home AI servers. They rarely arrive alone; privacy and cost predictability tend to show up together, with freedom from rate limits as the reason the box stays on after the novelty fades.

Data residency
Prompts stay on your network
100%

Nothing you send to a local model leaves your premises — prompts, documents, and model weights all stay on hardware you control. For client work, regulated data, or anything you would not paste into a public chatbot, that is the whole point.

No cloud account in the loop
Cost shape
Capex, not metered billing
0/token

You pay once for hardware and a predictable electricity bill, instead of per-token charges that scale with usage. For steady, high-volume workloads the amortized cost of a used-GPU box undercuts metered API spend within months.

Predictable monthly cost
Control
No rate limits or deprecations

Your endpoint has no request ceilings, no quiet model retirements, and no usage policy changes. You pin the exact model version, run batch jobs overnight, and keep an agent loop hammering the API without throttling.

You own the uptime

The honest counterweight: a local model is not GPT-5.5 or Claude Opus. You trade some frontier capability for control and cost predictability, and you take on the operational work this guide describes. For the privacy and compliance side of that trade-off, our earlier privacy-first local deployment guide goes deeper on data-handling patterns. What changed in 2026 is that the capability gap narrowed enough that the trade is worth making for real workloads, not just experiments.

02HardwarePick the tier that matches your model size and your electricity bill.

There are two hardware families for a home AI server: NVIDIA GPUs (fast, power-hungry, VRAM-capped) and Apple Silicon (slower per token, near-silent, unified memory, remarkably power-efficient). The table below puts the tradeoffs side by side, using a consistent method for the running-cost column so the numbers are comparable.

Home AI server hardware tiers in 2026 — VRAM, largest model at Q4_K_M, full-load system power, estimated monthly electricity at $0.12 per kWh, and street price as of June 2026.
Tier / exampleVRAMMax model @ Q4_K_MFull-load powerEst. elec. / moStreet price (Jun 2026)
Consumer & prosumer GPUs
RTX 3090 (used)24 GB GDDR6X32B (~22–24 GB)~550 W system~$48$700–$900 (GPU)
RTX 409024 GB GDDR6X32B (~22–24 GB)~600 W system~$52$1,200–$1,600 (new)
RTX 509032 GB GDDR7≤32B — not 70B at Q4~750 W system~$65$2,500–$4,000 (street)
RTX 6000 Ada48 GB GDDR670B (~38–48 GB)~450 W system~$39$6,000–$10,000
Apple Silicon (unified memory)
Mac mini M4 Pro 64 GB64 GB unified32B comfortably~30–40 W system~$3$1,599+
Mac Studio M4 Max 128 GB128 GB unified70B comfortably~40–60 W system~$4$2,499+
Monthly electricity = representative full-load system watts × 24 h × 30 d × $0.12/kWh. GPU prices are for the card only; add a host platform. Apple wattage figures are approximate — verify with a smart plug under your own load. Mac Studio pricing may have moved in 2026; confirm at apple.com before you buy.

Two cells deserve a flag. The RTX 5090 ships with 32 GB, which a 70B model at Q4 (roughly 38 GB plus context) overflows — so despite its speed it tops out around a 32B-class model, not 70B. And its often- quoted memory bandwidth of ~1.79 TB/s and a “3.2× throughput versus the RTX 4090” claim come from secondary, batch-throughput benchmarks; treat both as approximate and do not assume that multiplier carries over to single-user, interactive chat. The RTX 6000 Ada is the one single card here that holds a full 70B model at Q4_K_M in VRAM.

For most first builds, the decision collapses to four archetypes. For a deeper price-bracket breakdown across every tier, see our hardware price-brackets guide.

Budget · first build
Used RTX 3090, 24 GB

Roughly $700–$900 for the same VRAM as a new RTX 4090. Community consensus sweet spot: an Intel i5-14500, 128 GB DDR4, a 2 TB NVMe, and a 1000 W PSU, with the GPU passed through to a Linux VM. It has narrower memory bandwidth than the 4090, but for 7B–32B models it is the value leader.

Best value
Silent · low-power
Mac mini or Mac Studio

Unified memory lets a Mac Studio M4 Max 128 GB hold a 70B model, and the whole box draws tens of watts — roughly 10–15× less than a GPU tower. If you pay European electricity rates, this is often the cheapest machine to keep on 24/7.

Lowest running cost
70B in one box
RTX 6000 Ada, 48 GB

The only single consumer-or-prosumer card that fits a 70B model at Q4_K_M entirely in VRAM, at 300 W and ~$6,000–$10,000. Worth it if you need a single 70B endpoint without juggling two GPUs and PCIe-bound tensor parallelism.

Single-card 70B
Max GPU speed
RTX 4090 / RTX 5090

When tokens-per-second matters more than running cost. The 4090 delivers strong 7B–32B speeds at ~1 TB/s bandwidth; the 5090 is faster still and adds 8 GB of VRAM, at a meaningfully higher power draw and price.

Fastest consumer GPU
“The best first local LLM PC build in 2026 is still refreshingly simple: buy a used RTX 3090 with 24GB of VRAM, pair it with 64GB of system RAM, and run the machine on one clean Linux install.”— Popular AI · Best Budget Local LLM PC, 2026

03VRAM MathThe number that decides everything: how much fits in memory.

Before you buy anything, do the VRAM arithmetic. A model has to fit in memory — weights plus the KV cache that holds your context — or it spills to system RAM and slows to a crawl. The useful rule of thumb at Q4_K_M quantization is about 0.6 GB per billion parameters for the weights, then add headroom for context. The KV cache grows with the context window: a 70B model needs roughly 38–48 GB at a short context, but around 56 GB once you push it to a 32K window.

Approximate VRAM requirements at Q4_K_M quantization by model size, at a 2K and a 32K context window, with a representative GPU that fits each tier.
Model classParamsVRAM @ 2K ctxVRAM @ 32K ctxFits on
Gemma 3 4B / Phi-4 Mini~4B~3 GB~5 GBRTX 3060 8 GB
Llama 3.2 8B / Qwen2.5 7B~7–8B~5–6 GB~9 GBRTX 4060 8 GB
Llama 3.1 13B / Mistral 7B-Instruct~13–14B~9–10 GB~14 GBRTX 3060 12 GB
Qwen2.5 32B / DeepSeek-R1 32B~32B~22–24 GB~38 GBRTX 3090 / 4090 (24 GB)
Llama 3.3 70B / Qwen2.5 72B~70B~38–48 GB~56 GBRTX 6000 Ada or 2× RTX 3090
Figures are approximate and assume Q4_K_M weights. A 32B model fits a 24 GB card at a short context but overflows once you raise the window — which is exactly why the KV cache, not just the weights, governs your real ceiling.
The one rule to internalize
At Q4_K_M, budget roughly 0.6 GB per billion parameters for weights, then add context headroom on top. A 24 GB card lives happily at 32B for short prompts; a full 70B needs a 48 GB card or two 24 GB cards. Note that consumer RTX 3090s have no NVLink, so a dual-GPU setup talks over PCIe and pays a parallelism penalty — plan for it.

04Serving StackTwo servers worth knowing: Ollama and vLLM.

The serving layer is what turns model weights into an API. Two stacks cover almost every home setup, and the good news is that both expose an OpenAI-compatible endpoint — so the client code in your agent, IDE, or app does not change when you switch the base URL away from the cloud.

Ollama is the dead-simple default. Its REST API at http://localhost:11434/v1 implements /v1/chat/completions, /v1/completions, /v1/embeddings and /v1/models, so it drops into any client built for the OpenAI API by changing one base URL. vLLM is the throughput engine: its OpenAI-compatible server (python -m vllm.entrypoints.openai.api_server) adds PagedAttention, continuous batching, and multi-GPU tensor parallelism for serving many concurrent requests.

Dead-simple
Ollama
http://localhost:11434/v1

One model at a time, one command to pull and run, OpenAI-compatible out of the box. Critical knob for agents: set OLLAMA_CONTEXT_LENGTH (default is only 2048 tokens) to 32000–65536, and OLLAMA_ORIGINS="*" so an IDE like Cursor can call it. Pick a model trained for JSON tool-calling — Qwen3, Llama 3.3 70B, or Mistral Nemo — or your agent loops will break.

Solo / small team · easiest
High-throughput
vLLM
vllm.entrypoints.openai.api_server

PagedAttention stores the KV cache in non-contiguous blocks, and continuous batching keeps the GPU busy across concurrent users. Flags worth knowing: --tensor-parallel-size N for multi-GPU, --gpu-memory-utilization 0.95, --dtype float16. Needs Python 3.10+ (3.12+ recommended) and a minute of CUDA setup. The V1 rewrite landed in January 2025.

Many users / batch · fastest

vLLM’s own documentation cites 14–24× higher throughput than HuggingFace Transformers on the same hardware; treat that as a vendor-stated figure, since independent 2026 benchmarks show real but narrower advantages. The practical rule: start with Ollama, graduate to vLLM only when you actually have concurrency to serve. If you want the full runtime comparison — including LM Studio and llama.cpp — read our Ollama vs LM Studio vs vLLM runtime guide. One field-tested gotcha worth repeating: setting OLLAMA_CONTEXT_LENGTH before ollama serve fixes the most common “my agent goes senile” complaint — an agent carrying memory, skill definitions, and tool schemas blows through a 2048-token default almost immediately.

05Always-OnThe layer most guides skip: staying up without babysitting.

A demo runs while you watch it; infrastructure runs while you sleep. Four pieces turn a model into a service: a hypervisor or process supervisor that restarts it on crash or reboot, a friendly front end, and a UPS so a power blip does not corrupt a model mid-write.

Virtualization
Proxmox GPU passthrough
1GPU/VM

Enable IOMMU (Intel VT-d or AMD-Vi) in BIOS, bind the GPU to vfio-pci, then add the PCI device to a VM. In 2026 the recommended practice is one GPU per VM (not an LXC container) for full passthrough — the Proxmox host runs on integrated graphics while the discrete card serves the model.

VM, not LXC, for full passthrough
Front end
Open WebUI over Docker
1UI

A self-hostable, ChatGPT-style interface for any OpenAI-compatible endpoint: multi-user with LDAP/SSO, conversation history with semantic search, RAG over uploaded documents, voice I/O, and MCP support. Deploy it with a single Docker pull and point it at your Ollama or vLLM server.

ghcr.io/open-webui/open-webui
Power safety
Pure sine-wave UPS
10min

A 1500 VA pure sine-wave UPS (around $190) buys roughly 10–12 minutes of runtime — enough to shut the server down gracefully during an outage. Modern GPU PSUs with Active PFC require a pure sine-wave unit, not a stepped-approximation model; replacement batteries are cheap and user-swappable.

Active-PFC PSUs need pure sine

06Remote AccessReach your server from anywhere without opening a port.

You want to use the server from your laptop at a café, not just from your living room — but you do not want to forward a port and expose anything to the open internet. Tailscale solves this cleanly. It builds a WireGuard-based private mesh network, called a tailnet, where each device gets a stable 100.x.y.z address that works behind NAT with no router configuration. Your phone, laptop, and AI server simply see each other as if they were on the same LAN.

To share the model itself, Tailscale Serve proxies a local service — Ollama on port 11434 — over TLS to other devices inside your tailnet, with no separate reverse proxy or public DNS record. As an alternative, set OLLAMA_HOST=<tailnet_ip> to bind Ollama directly to the tailnet interface. Either way, the endpoint stays private to devices you have explicitly added.

Tailscale free tier (Personal, 2026)
The free Personal plan covers 6 users and unlimited devices, with MagicDNS, exit nodes, subnet routing, and up to 50 tagged resources — enough for a household or a small team, with no open router ports required. Paid Standard plans start at $8 per user per month if you outgrow it.

07SecurityThe one rule: never expose raw Ollama to the internet.

This is the part most cheerleading guides leave out, and it is the part that gets people compromised. Ollama has no built-in authentication. If you forward port 11434 to the public internet — or use Tailscale Funnel, which is the feature that exposes a service publicly — anyone who finds it can enumerate your models, run inference on your hardware, and pull your local weights. Tailscale Serve (private, within the tailnet) is safe; Tailscale Funnel (public) is not, for an unauthenticated endpoint.

There are exactly three safe ways to reach a self-hosted LLM endpoint. Pick one and never skip it.

Option A · simplest
Tailnet-only access
private mesh, no public surface

Keep Ollama bound to the tailnet and reachable only by devices you have added to your Tailscale account. There is no public attack surface at all — the endpoint does not exist on the open internet. This is the recommended default for a home setup.

Tailscale Serve · not Funnel
Option B · self-managed
Reverse proxy + auth
Nginx + HTTP Basic Auth over HTTPS

If you must publish on a domain, put an Nginx reverse proxy in front with HTTP Basic Auth, terminate TLS with a real certificate, and never let raw 11434 face the internet. The proxy, not Ollama, enforces who gets in.

TLS-terminated, authenticated
Option C · richer auth
Identity-aware proxy
Authentik / OAuth2 proxy

For team access with real accounts, front the endpoint with Authentik or an OAuth2 proxy so each request carries a verified identity. Heavier to set up than Basic Auth, but you get SSO, per-user revocation, and an audit trail.

SSO · per-user control

The pattern underneath all three is the same: authentication and transport security live in front of the model, because the model server provides neither. Treat port 11434 the way you would treat an unauthenticated database port — something that should never be directly reachable from outside your trusted network.

08Running CostWhat it actually costs to keep the lights on.

The recurring cost of a home AI server is electricity, and it varies wildly by hardware. An RTX 4090 system pulling around 600 W and running 24/7 costs roughly $52 a month at $0.12/kWh; the same math on an Apple Silicon box drawing tens of watts lands near $3–$5. That is a 10–15× gap, and it is the single biggest lever in the total cost of ownership. Our self-hosting total-cost-of-ownership analysis works the full amortization, hardware plus power, against cloud API spend.

Always-on electricity by hardware tier · 24/7 at $0.12/kWh

Est. at $0.12/kWh · (system watts ÷ 1000) × 24 × 30 × $0.12
RTX 5090 system~750 W · highest-draw consumer GPU
~$65/mo
RTX 4090 system~600 W · mainstream GPU tower
~$52/mo
RTX 3090 system~550 W · best-value used build
~$48/mo
RTX 6000 Ada system~450 W · 48 GB, power-efficient pro card
~$39/mo
Mac Studio M4 Max~40–60 W · 128 GB unified
~$4/mo
Mac mini M4 Pro~30–40 W · 64 GB unified
~$3/mo

Two adjustments change the picture materially. First, you can power-limit a GPU: dropping an RTX 4090 from its 450 W TDP to 350 W saves around 40% on its electricity for only about a 10% speed loss — an easy win for an always-on inference box where latency is rarely the binding constraint. Second, electricity prices are not universal. At European rates of roughly $0.25–$0.35/kWh, every figure above runs two to three times higher, which is precisely where Apple Silicon’s efficiency advantage flips the buy decision.

That regional split is the part worth thinking forward on. As open-weight models keep improving and GPU and DRAM prices stay volatile through 2026, the cleanest economics for many teams will be a hybrid: amortize a low-power local box for steady, private, high-volume workloads, and burst to cloud inference for spikes or for the occasional frontier-grade task a local model can’t handle. The home server is not an all-or-nothing replacement for the cloud — it is the always-on baseline you stop paying per token for.

09ConclusionA spare PC, two open-source servers, and zero open ports.

The state of home AI servers, mid-2026

A used GPU plus Ollama and Tailscale turns a spare computer into private AI infrastructure.

The build is genuinely accessible now. A used RTX 3090 with 64 GB of system RAM, a clean Linux install, Ollama or vLLM serving an OpenAI-compatible endpoint, Open WebUI as the front door, and Tailscale for remote access gives you a private model server you can reach from anywhere — for an entry cost under $1,000 and an electricity bill you can predict to the dollar.

Get the VRAM math right before you spend, choose the tier that matches your model size and your power rates, and treat the security rule as non-negotiable: a raw Ollama endpoint never belongs on the public internet. Keep it inside the tailnet or behind an authenticated proxy, and the rest of the stack is straightforward, well-trodden homelab work.

The broader shift is that private inference has matured from a weekend experiment into real infrastructure — something a team can depend on for client-confidential work, predictable cost, and an endpoint nobody else can rate-limit or deprecate. If you want help standing one up or wiring it into your agents and applications, our AI transformation engagements start with exactly this kind of build.

Stand up private AI infrastructure

A private model server your data never leaves.

We help teams design, build, and operate private AI infrastructure — local model servers, secure remote access, and the agent and application integrations that run on top — delivered in days, not quarters.

Free consultationExpert guidanceTailored solutions
What we work on

Self-hosted AI engagements

  • Hardware sizing — GPU vs Apple Silicon for your workload
  • Ollama / vLLM serving with OpenAI-compatible endpoints
  • Secure remote access — Tailscale and authenticated proxies
  • Open WebUI, RAG, and agent integration on a private endpoint
  • Hybrid routing — local baseline plus cloud burst
FAQ · Home AI server guide

The questions we get every week.

The best-value starting point in 2026 is a used NVIDIA RTX 3090 with 24 GB of VRAM, which costs roughly $700–$900 and runs 7B to 32B models well. Pair it with about 64 GB of system RAM, a 2 TB NVMe drive, and a clean Linux install. If you prefer a near-silent, low-power machine, a Mac mini M4 Pro with 64 GB of unified memory (from $1,599) or a Mac Studio M4 Max with 128 GB (from $2,499) runs the same model sizes at a fraction of the electricity. Match the tier to the largest model you intend to run and to your local electricity rate.
Related dispatches

Continue exploring local AI and self-hosting.