Local-only URL shortener

Encodes the URL into the URL — no server, no database, no lookup. Every shortened link is a self-contained, lossless, decodable string.

Open in new tab ↗
How it works

Read the code + full explainanation at https://github.com/shabda/local-only-link-shortener

Every short link this page produces is fully self-contained: the encoded URL is the URL. The server doesn't see the encoded data, because the encoded part lives in the URL fragment (after the #) and browsers don't send fragments in HTTP requests. Open the HTML file straight from disk and the shortener still works.

1. The big lever: a denser visible alphabet ~75% of the savings

Base64 packs 6 bits per visible character. Base32768 packs 15. Its alphabet is 32,768 carefully chosen BMP code points (CJK Unified, CJK Extension A, Hangul Syllables — all 3 UTF-8 bytes, no surrogates, no right-to-left, no formatting). Each visible character carries 2.5× more information than base64.

Just swapping ASCII for base32768 — no compression at all — already drops visible URL length to 0.540× on real URLs. That's about three-quarters of the total visible-char win this page ever achieves; the entire compression stack underneath only adds the remaining 0.16. The alphabet is corpus-independent: it's a property of Unicode and UTF-8, not of which URLs we measured.

We also offer basE91 — 91 ASCII chars, ~6.5 bits/char, all URL-fragment-safe under WHATWG. It's only 8% denser than base64 visually, but it stays ASCII so wire bytes stay cheap. Toggle the mode at the top to see both.

2. The flip side: wire bytes why the alphabet alone isn't enough

Each base32768 character is 3 UTF-8 bytes on the wire. So base32768-alone expands bytes-on-the-wire to 1.619× the original on real URLs — a 60% blowup. The visible URL is short; the bytes flying through HTTP are not. Closing that gap is what the compression stack underneath is for.

3. Compression stack prefix table → dict-deflate → structural preprocessor

Prefix table. 85 hand-picked URL prefixes (https://www.youtube.com/watch?v=, https://github.com/, etc.) shared by encoder and decoder. The longest matching prefix becomes a 1-byte index; 0xFF means "no match, raw URL follows." A single byte reconstructs ~30 bytes of URL.

Deflate with a pre-shared dictionary. Plain deflate doesn't help on a 75-byte URL — its LZ77 window starts empty, so there's nothing to back-reference. We seed the window with a hand-curated 1.5 KB dictionary of common URL fragments (?utm_source=&utm_medium=, /wiki/, common TLDs, query keys, file extensions, …). Hot strings live near the end of the dictionary because shorter LZ77 distances encode in fewer Huffman bits. Both encoder and decoder load the same dictionary at init time, so it doesn't take a single byte of wire space.

Structural preprocessor. Deflate can't compress high-entropy character runs (commit hashes, content hashes, large integer IDs) and still pays Huffman literal cost per character. So before deflate sees the URL, we scan for four URL-grammar patterns and replace each with a marker byte sequence (markers are byte values below 0x20 — never valid in URL strings, so unambiguous):

  • 0x01 LL … — digit run (≥ 6 chars). A 19-char Twitter status ID packs to 8 bytes.
  • 0x02 LL … — lowercase hex run (≥ 8 chars, ≥ 1 a-f). A 40-char Git SHA packs to 20 bytes.
  • 0x03 …3 bytes/YYYY/MM/DD/ path. Year (14 bits) · month (4) · day (5) packed into 3 bytes. Used by every CMS / news platform — fires on 10% of real URLs and accounts for almost all of the preprocessor's contribution.
  • 0x04 …16 bytes — RFC 4122 UUID (8-4-4-4-12 hex). 36 chars → 17 bytes.

Plus an RFC 3986 canonicalisation pass: %XX sequences for unreserved chars (§6.2.2.2), lowercase scheme + host (§3.1, §3.2.2), strip default port (:80/:443), strip trailing / on host-only URLs, strip empty trailing ?/#. All lossless at URI-equivalence level. In practice these almost never fire on real URLs — modern parsers normalise URLs everywhere they touch them — but they're correct by spec.

None of the structural packers come from looking at URL data. They're properties of URL syntax (RFC 3986/3987/4122) and universal internet conventions, so they help on any URL that happens to contain them — whether we've heard of the host or not.

4. Variable-width base32768 tail — saves wire bytes only

Base32768 encodes the input bit stream in 15-bit chunks. When the input doesn't divide evenly into 15-bit groups, the trailing 1–14 bits get padded out to a full 15 bits and the last char is still drawn from the main BMP alphabet (3 UTF-8 bytes).

For trailing bits in 1..7 we instead pull the last char from a 254-codepoint alphabet carved out of Latin Extended A/B (U+00C0..U+01BD) — all 2-byte UTF-8, all URL-fragment-safe, none combining or RTL. Sub-ranges within that alphabet encode both the value and the bit count B, so the decoder reads exactly B bits from the tail char with no padding waste. Saves ~1 wire byte on the ~half of URLs whose post-deflate length lands in this regime.

No effect on visible char count — that's ceil(8N/15) regardless. Wire bytes only.

5. Free dispatch — 0 bits of mode marker

basE91 output is ASCII printable (U+0021…U+007E); base32768 output is CJK / Hangul (≥ U+3400). Disjoint Unicode ranges, so the decoder just inspects the first character:

if (firstChar.codepoint >= 0x3400) decodeBase32768(s);
else                                  decodeBasE91(s);

No marker char, no marker byte, no length prefix. The alphabet itself is the signal.

What didn't work

Brotli (with its 120 KB built-in static dictionary, tuned for HTML) lost to our 1.5 KB URL-tuned dictionary on URL-shaped input — its dictionary is full of English words and HTML tags that don't appear in URLs, and its per-stream overhead is heavier than deflate's on ~75-byte inputs.

URL grammar decomposition (parse the URL into scheme · subdomain · base · TLD · path, encode each piece with its own table) lost by 0.01%. A hand-curated prefix table reduces a known (scheme, host) pair to a 1-byte index. Grammar decomposition needs ~10 bytes of structural overhead (marker + scheme + subdomain + base length + TLD index + …) to describe the same thing. The simpler mechanism is already doing the work.

A universal-only dictionary (drop all popular-site entries, keep only RFC + framework conventions) made things worse, not better: 0.383 → 0.412 chars on real URLs. Real internet traffic actually does hit popular sites a lot, and the dict's github.com/ · youtube.com/watch?v= · wikipedia.org/wiki/ entries earn their bytes even on data the encoder hasn't seen.

zstd with a trained 16 KB dictionary (v15) did work — about 1% better than v14 on real URLs, with the dictionary trained on a fully held-out HN/Reddit snapshot disjoint from the bench corpus. But for ~50-byte inputs zstd's frame format eats most of the headroom, and shipping zstd to the browser would more than double the bundle (~62 KB → ~140 KB with zstd-wasm) for ~1% gain. Lives in the bench as a documented experiment; v14 stays the live encoder.

Benchmark

Two corpora, both averaging ~75 characters:

  • Synthetic — 1,000 URLs generated from common shapes (YouTube, GitHub, Wikipedia, news, etc.). Stresses the prefix table and dict.
  • Real — 4,313 URLs from Hacker News + Reddit across 30 subreddits and several time windows. Stresses the long-tail (indie blogs, substacks, news sites the dict has never heard of).

The two columns answer different questions: synthetic asks "how well does each trick do when the input matches the encoder's assumptions?", real asks "how well does each trick generalise?". Lower ratio = shorter.

VersionTrick Synth charsReal chars Synth bytesReal bytes
v1Passthrough 1.0001.0001.0001.000
v2Plain base64 (anchor)1.3511.3501.3511.350
αbasE91 alphabet only (no compression)1.2361.2361.2361.236
αbase32768 alphabet only (no compression)0.5400.5401.6191.619
v3Deflate + base64url 1.2401.1831.2401.183
v4+ shared dictionary 0.8610.9850.8610.985
v5+ prefix table 0.7810.9570.7810.957
v6brotli (control, lost)0.8390.9960.8390.996
v7+ basE91 alphabet 0.7200.8820.7200.882
v8+ base32768 alphabet 0.3170.3860.9521.159
v9+ free dispatch (no marker)0.3170.3860.7200.882
v10+ digit/hex run packing0.3070.3830.6960.875
v11grammar decomp (control, ~tied)0.3070.3830.6960.875
v12universal-dict only (control, lost)0.4170.4121.2521.236
v13+ date / UUID / canonicalise0.3060.3790.6950.866
v14+ variable-width tail / RFC-canonicalise (LIVE)0.3060.3790.6940.860
v15+ zstd-trained dict (Node-only experiment)0.3460.3760.7830.854

Reading the table:

  • The two α rows isolate the alphabet from everything else. Base32768 alone gets visible chars to 0.540 on real URLs; the entire stack adds 0.16 to land at 0.379. The alphabet does most of the work.
  • The dict (v3 → v4) and prefix table (v4 → v5) save much more on synthetic than on real — they're knowledge-based, and that knowledge applies more cleanly to URLs the generator templated around them. They still help real URLs (just less).
  • The structural preprocessor (v9 → v13) is corpus-independent and helps both, with most of the lift coming from the /YYYY/MM/DD/ packer alone.
  • v15 swaps deflate-with-hand-curated-dict for zstd-with-corpus-trained-dict (held-out training set, fully disjoint from the real bench). It wins on real URLs (-1.0% chars, -1.0% bytes) but regresses on synthetic — exactly the signature of "stop using corpus-tuned knowledge, gain generalisation." It's a Node-only experiment for now: the browser has no native zstd in 2026 and adding zstd-wasm would more than double the bundle size for ~1% gain.

stack Pure HTML + JS + pako for deflate. No build step, no server. The whole shortener — encoder, decoder, redirect — is one file you can open from disk.