AI photo upscaling explained — when 2× actually beats 4×
Tomoda HinataTool author & maintainerPublished Apr 26, 202613 min read
AI upscaling uses a neural network trained on millions of low-resolution → high-resolution pairs to invent plausible detail that bicubic interpolation cannot recover. It rescues old thumbnails, salvages cropped phone shots, and prepares small archive photos for print. It does not recover detail that was never captured, and it can hallucinate uncanny faces — choose 2× over 4× unless you know why.
Tools used in this guide
What does AI upscaling actually do?
Classic resampling (bicubic, Lanczos) interpolates new pixels from the existing ones — it cannot invent detail that is not there. AI upscaling runs a convolutional or transformer neural network trained on low-res / high-res image pairs; given a 256×256 input, it outputs a 512×512 image whose extra pixels are statistically consistent with the training data. Real-ESRGAN, the most-deployed model family, was trained on degraded versions of high-quality images so it specifically learns to invert blur, JPEG compression, and downsampling.
When is upscaling the right tool?
Three classic wins. Old thumbnail rescue: a 200×200 avatar from 2010 needs to fit a 500×500 social card. Archive scans: a 720p photo from a CD-era backup needs print-ready resolution. Cropped phone shot: zooming in on the iPhone before the shutter dropped resolution; AI upscale recovers usable detail. Three classic losses: synthesising new viewpoints (impossible — AI cannot rotate the camera), recovering text from blurry photos (some success, but rarely production-grade), upscaling already-AI-generated images (compounds artifacts).
Why is 2× the safe default?
At 2× the network's hallucination is constrained: it has 4 input pixels per output pixel and can rely heavily on local edge information. At 4× there is only 1 input pixel for every 16 outputs — the network must invent 15 of every 16 pixels, which works for textures but produces uncanny faces, garbled text, and hallucinated patterns in flat areas. If you need 4×, run the model twice at 2× back-to-back rather than once at 4× — the intermediate result anchors the second pass.
How does the browser-side model work?
The tool ships an ONNX-format Real-ESRGAN-x2 model, ~50 MB, downloaded once and cached. Inference runs through ONNX Runtime Web compiled to WebAssembly, with WebGPU acceleration when available (Chrome 113+, Edge 113+, Safari 17+). On a 14" MacBook Pro M2 a 1024×1024 → 2048×2048 upscale takes about 4 seconds; on an iPhone 14 it takes about 12 seconds. Nothing is uploaded — the photo, the model, and the inference all stay in the browser tab.
What are the realistic limits?
Faces: even Real-ESRGAN can hallucinate slightly different facial features at 4×. Text: the network learns letterforms but tiny print on a phone-shot receipt rarely upscales to readable. Already-compressed images: heavy JPEG block artefacts get treated as features and amplified. AI-generated source: compounds the original AI's artefacts. When in doubt, run 2× and stop — the gains beyond that are usually paid for in artifacts.
Steps
About 1 minDrop the source image
Drag a JPG/PNG/WebP under 4 MP onto the tool. Larger inputs work but inference time scales quadratically.
Pick the scale factor
Default 2× is right for almost everything. Choose 4× only if you have a specific reason and a 2-pixel-tall face is not in the frame.
Wait for inference
First run downloads the ~50 MB model (cached after). Inference: ~4 s on M2, ~12 s on iPhone 14, longer on older devices.
Compare and download
Use the side-by-side compare slider to verify the result. If a face looks uncanny, drop to 2× or skip the upscale.
| Input → Output | Apple M2 (WebGPU) | iPhone 14 (Wasm) | Output file size |
|---|---|---|---|
| 256×256 → 512×512 | 0.4 s | 1.1 s | +150% bytes |
| 512×512 → 1024×1024 | 1.3 s | 3.8 s | +220% bytes |
| 1024×1024 → 2048×2048 | 4.1 s | 12.0 s | +280% bytes |
Frequently asked questions
Can I upscale beyond 4×?
Stack two 2× passes for an effective 4×. Beyond 4× the artifacts dominate — AI cannot invent detail that was never captured. For 8× of meaningful detail, re-shoot the source or accept the limit.
Why do faces sometimes look uncanny?
Real-ESRGAN was trained on photos of real people but the network has no semantic understanding of which face it is upscaling — it picks the most plausible facial features for the local context, which can make a 5-year-old child look subtly older or shift the bridge of a nose. Drop to 2×, or accept the artefact.
Does the model upload my photos?
No. The ~50 MB ONNX model downloads to your browser cache on first use; after that, inference runs entirely on your device via WebAssembly + WebGPU.
Can I batch-process multiple files?
Yes — the tool queues files and runs them sequentially on the same loaded model. The first file pays the cold-start; subsequent ones run immediately.
Is upscaling the right tool to remove JPEG artefacts?
Sometimes — Real-ESRGAN was trained partly on JPEG-degraded inputs and can clean up mild compression. Heavy block artefacts are usually amplified instead. Try it; if the output looks worse than the source, the artefacts are too severe.
Will it work on AI-generated images?
Often poorly. AI-generated images already contain hallucinated detail; running another AI upscaler compounds the artefacts. For Stable Diffusion outputs, prefer the model's own latent upscaler over a generic upscale.
Try it now
ONNX Runtime Web + WebGPU 2× / 4× super-resolution
AI Upscale (Real-ESRGAN) — 100% in-browser