How hash-based visitor identification works (and why it's privacy-safe) -- Web-Tracking.eu

The problem: counting unique visitors without tracking them

Counting unique visitors is one of the oldest problems in web analytics. You want to know how many different people came to your site yesterday, how many came back today, how many arrived for the first time from a specific campaign. These are reasonable questions for anyone running a website. Answering them requires some way of saying "these two page views came from the same person, and these two did not."

For decades, the standard answer was to give each visitor a permanent identifier — usually stored in a cookie — and to look that identifier up on every request. A new visitor gets a fresh ID. A returning visitor reuses theirs. Counting uniques is then just a question of counting distinct IDs.

That approach works beautifully for the engineer and terribly for the visitor. A permanent identifier, stored on the device, tied to every request, is exactly the kind of thing the ePrivacy Directive was written to control. It requires consent in the EU, and it creates a genuine privacy problem even when it doesn't: any party with access to that identifier can build a behavioural profile from it.

Modern cookieless analytics solves the same counting problem without storing anything on the device. The mechanism is hash-based visitor identification, and it works by computing a short-lived server-side identifier from data the server already has.

Why cookies and localStorage solved it (and created legal problems)

The appeal of cookies and localStorage for analytics was obvious: they give you a stable identifier that survives across requests, across tabs, and often across sessions, without any server-side state. The browser does all the work. You set a cookie on the first visit, you read it on every subsequent request, you never have to think about identity again.

The trouble is that this mechanism is legally regulated. Article 5(3) of the ePrivacy Directive treats "storing information" and "gaining access to information already stored" on the user's device as the trigger for consent. Cookies are the canonical example. So is localStorage. So is sessionStorage. So is IndexedDB. So is any kind of device fingerprinting that reads stable properties of the hardware or software.

That means every analytics tool that relies on device storage for counting uniques either needs a consent banner or needs to qualify for one of the narrow exemptions — which for analytics typically requires strictly first-party use, no cross-site tracking, limited retention, and a supervisory authority willing to accept the configuration. The compliance envelope is narrow, and the consequences of getting it wrong have been severe. French, German, and Austrian DPAs have all issued orders against websites running Google Analytics in its default configuration.

The hash-based approach sidesteps this entirely by moving the identifier off the device.

The hash-based alternative

The core idea is simple: instead of storing an identifier on the device and sending it back to the server on every request, the server computes an identifier from scratch on every request using attributes that the HTTP protocol already carries. The browser is never involved. Nothing is stored. Nothing is read.

For this to work as a visitor counter, the computed identifier has to be:

Stable across the time period you want to count (typically a day)
Unpredictable to anyone without access to the server-side secret
Unlinkable across time periods or across sites, so it can't be used to build a long-term profile
Irreversible, so the server cannot (and attackers cannot) recover the original inputs from the stored hash

A cryptographic hash function gives you all of these properties when you feed it the right inputs and rotate the right components.

How it works

Here is the general shape of the algorithm used by Web-Tracking.eu and similar cookieless analytics tools. The exact details vary between vendors, but the building blocks are the same.

function visitorId(request, siteId) {
    const ip     = request.remoteAddr;
    const ua     = request.headers['user-agent'];
    const date   = today('UTC');        // YYYY-MM-DD
    const salt   = dailySalt();         // random bytes, rotated every day

    const input  = ip + '|' + ua + '|' + siteId + '|' + date + '|' + salt;
    const hash   = sha256(input);

    // raw inputs are discarded here, only the hash leaves this function
    return hash;
}

That is the whole mechanism. On every incoming request to our tracker endpoint, the server:

Takes the IP address and User-Agent header from the request
Appends the current UTC date, the site ID, and the per-day secret salt
Runs the whole thing through SHA-256
Uses the resulting hash as the visitor ID for that request
Discards the original IP, User-Agent, and salt — only the hash is stored

The browser never sees this hash. It is not written to a cookie, not returned in a Set-Cookie header, not stored in localStorage, not echoed back in any response. It lives entirely in our database as a short hex string that identifies "the visitor who sent these requests on this day."

Why daily rotation matters

The date component of the input is the single most important piece of privacy engineering in this scheme. Without it, every request from the same (IP, User-Agent, site) triple would produce the same hash forever. The hash would effectively be a persistent, global identifier — exactly the thing we are trying to avoid.

With daily rotation, the hash for a given visitor changes at midnight UTC every day. A visitor who comes back tomorrow produces a completely different hash, indistinguishable from a brand-new visitor. We cannot link their Monday visits to their Tuesday visits, because we do not have any mechanism to bridge the two hashes.

This sacrifices some analytical power. We cannot compute a 7-day unique visitor count by counting distinct hashes; we have to estimate it from daily counts and some assumption about repeat rate, or present daily uniques and let the operator interpret them. The trade-off is deliberate: we get to measure audience without building a tracking infrastructure.

Some vendors rotate on a different schedule (weekly, monthly). Our view is that daily is the right default. Weekly and monthly rotation give slightly better analytics but weaker privacy guarantees, and any DPA challenging a cookieless approach is more likely to accept a short rotation window.

Why the salt matters

The per-day secret salt does two things. First, it prevents rainbow-table attacks: without the salt, an attacker who got hold of a list of our hashes could precompute hashes for common (IP, User-Agent) combinations and recover inputs. With a secret salt that we never disclose, the hash space is effectively randomised and pre-computation is useless.

Second, the salt means that even if the exact algorithm and the full input structure were published (as they are in this post), no one outside our servers can compute the visitor ID for a known (IP, User-Agent) combination. You would need the salt, and the salt is rotated daily and never leaves the server.

The salt is long (32 bytes of cryptographically-random data), stored in a secret manager separate from the analytics database, and rotated every 24 hours. Old salts are deleted after the rotation, so even a full database dump would not let you reconstruct yesterday's salts to compute yesterday's hashes.

What data is hashed

The ingredients of our hash are deliberately minimal:

IP address — identifies the network origin of the request. Inside the hash, never stored raw.
User-Agent — identifies the browser and OS in broad terms. Inside the hash, never stored raw as an identifier.
Site ID — scopes the hash to a single website. A visitor appearing on two sites tracked by Web-Tracking.eu produces two unrelated hashes.
Current UTC date — rotates the identifier daily.
Secret salt — rotates the identifier daily and prevents reverse engineering.

We do not hash in any device properties that would require reading from the browser (screen size, language, plugins, fonts, canvas rendering, etc.). Anything the browser sends voluntarily in the HTTP headers is fair game; anything that would require a JavaScript read from the device is not.

This is an important line. It means our "fingerprint" is only as stable as the IP address and User-Agent. A visitor who switches networks mid-day gets a new hash. A visitor who updates their browser gets a new hash after the update. A visitor who uses a VPN that rotates exit IPs gets a new hash on every rotation. We accept this imprecision as the cost of not building a real fingerprint.

Why this doesn't identify users

The hash output is a 64-character hex string that looks like:

a3f2c9b4e8d1f7a5c6b9e2d4f8a1c3b5e7d9f2a4c6b8e1d3f5a7c9b2e4d6f8a0

On its own, that string cannot be linked to any individual. It cannot be reversed to recover the IP or User-Agent. It cannot be correlated across days because the date and salt change. It cannot be correlated across sites because the site ID changes.

The raw inputs (IP and User-Agent) exist in memory for the milliseconds it takes to compute the hash. They are never written to disk, never copied to logs, never sent to any third party, never retained in any form. The only thing that persists is the hash, and the hash has none of the properties that would make it personal data under GDPR Article 4(1): it does not identify, it cannot single out, and it cannot be linked to an identifiable person without access to data we do not have.

The Article 29 Working Party's Opinion 05/2014 on anonymisation techniques sets out three tests: singling out, linkability, and inference. Our daily-rotating, site-scoped, salted hash fails all three tests as an identifier of individuals, which is precisely the point.

Limitations: same-household users, shared IPs, VPNs

Hash-based identification is not a perfect counter. It trades accuracy for privacy, and the ways it is imperfect are worth understanding.

Same-household users sharing an IP address and using the same browser version will produce the same hash. Two people on the same home Wi-Fi, both using Chrome on Windows, both on the same site on the same day, will be counted as one visitor. Cookie-based analytics would distinguish them (each gets their own cookie); hash-based analytics will not.

Corporate networks behind a NAT can collapse many real visitors into a single hash. An office of 200 people accessing your site from the same public IP, all running the same IT-pushed browser build, will look like one visitor. Cookie-based analytics would distinguish them; hash-based analytics will not.

VPN and Tor users have the opposite problem: a single person whose exit IP rotates will produce multiple hashes and be counted multiple times. Cookie-based analytics would see them as one visitor; hash-based analytics will overcount.

Users on multiple devices will always be counted separately, because the IP + User-Agent combination is different. This is the same limitation as cookie-based analytics without cross-device identity linking.

The net effect is that cookieless analytics tends to under-count uniques in dense shared environments and over-count uniques for privacy-conscious users with rotating IPs. Both effects wash out in aggregate for most websites, but if your audience is concentrated in corporate networks or heavily skews towards VPN users, the numbers may drift from cookie-based counts by several percent in either direction.

We think this is an acceptable trade-off, and so do most of our customers. Analytics is not census data. What matters is the shape of the curve, the comparison between days, the impact of a campaign, the ranking of pages. All of those are preserved perfectly well by hash-based counting.

Comparison with other cookieless approaches

Several analytics vendors have landed on variations of the same basic architecture.

Plausible historically used a per-site, daily-rotating hash of IP + User-Agent + site + salt. Their current implementation is very close to what is described above.
Fathom uses a similar hash-based approach with server-side computation and no client-side storage.
Pirsch uses a daily-rotating fingerprint based on salted IP + User-Agent.
Simple Analytics computes a hash of IP + User-Agent + hostname + rotating salt, also daily.

The differences are mostly in:

Rotation frequency (daily is typical, but not universal)
Whether the salt is shared across customers or per-site
Which exact header fields are included in the hash input
Whether IP addresses are truncated (v4: last octet, v6: last 80 bits) before hashing, for additional privacy
How long the resulting hash is retained in aggregated form

What is remarkable is the convergence. Five years ago, privacy-first analytics was a messy space with many competing architectures. Today, the serious players all use server-side hashing with some form of rotation, because it is the only architecture that cleanly avoids the ePrivacy trigger without compromising on the core analytical task.

The bottom line

Hash-based visitor identification is not a hack or a workaround. It is a principled solution to a real problem: counting unique visitors without storing anything on their devices and without retaining raw identifiers on the server. Done properly, with daily rotation, a rotating secret salt, and discarded raw inputs, it gives you usable analytics with strong privacy properties and no ePrivacy consent requirement.

Done improperly — with long rotation windows, shared salts, retained IPs, or additional device fingerprinting — it degrades into just another tracking scheme. The devil is in the engineering details, which is why it is worth understanding what your analytics vendor actually does under the hood.

Web-Tracking.eu is a cookieless analytics platform. Learn more about our approach on the no cookie banner legal explainer.