Test 26: Preventing Data Leaks in Documentation: Stop Using Real Domains

jb01455
Jan 26
5 min read

Technical documentation often aims to be helpful by showing realistic URLs, email addresses, and API endpoints. The problem is that “realistic” too easily becomes “real.” A single live domain name in a public guide can expose internal structure, reveal vendor relationships, and hand attackers a clean starting point for reconnaissance.

The simplest fix is also the most reliable: never use real, non-reserved domains in examples. Use the domain names that were set aside specifically for documentation, testing, and screenshots.

Why real domains in docs are risky

A domain name is rarely just a string. It is an identifier that connects to DNS records, TLS certificates, public scan data, historical ownership records, and sometimes working services. When documentation includes a real domain, it publishes a durable breadcrumb that is easy to copy, index, and probe.

This is not limited to public marketing docs. API references, troubleshooting guides, runbooks mistakenly shared outside the intended audience, sample configuration snippets in repos, and slide decks all get scraped and re-shared. Once the domain is out, retracting it rarely removes it from caches and archives.

A subtle issue is “accidental activation.” Readers frequently copy and paste sample commands. If the sample uses a real host, a harmless experiment can become an unwanted request to a production system.

What can leak from a simple hostname

Even when no secrets are shown, domain structure reveals context. Subdomains and hostnames often encode environment, region, role, and ownership, which can help an adversary map systems faster than scanning alone.

After you describe the risk in your writing standards, a concrete checklist helps authors self-review. Common leak categories include:

Internal topology
Partner relationships
Product code names
Environment naming patterns
Employee naming conventions in email examples

Those are not abstract concerns. A hostname like communicates “database,” “development,” and “corporate.” That can drive targeted password spraying, phishing themes, and service fingerprinting attempts.

How attackers and bots use documentation breadcrumbs

Attackers routinely read documentation because it is curated reconnaissance. Docs tell them which endpoints matter, which protocols are supported, how authentication is meant to work, and what error messages look like. A real domain inside that context reduces effort.

Automation amplifies the exposure. Search engines index docs and preserve snapshots. Crawlers extract links and hostnames, then feed them into asset discovery tooling that correlates DNS, certificates, and open ports. Even if a domain is not linked as a clickable URL, it is still easy to harvest from HTML, PDFs, and code blocks.

If a real domain later expires, the risk can worsen. Expired domains have been used in multiple classes of attacks, including taking over trust relationships that were built around domain ownership.

Safer alternatives: reserved domains and special-use names

The internet standards community anticipated this problem. RFC 2606 reserves , , and for documentation. These domains are intentionally stable and widely recognized as placeholders. Google’s developer documentation guidance also recommends them for generic examples.

There are also special-use names intended for other safe contexts (commonly referenced alongside RFC 2606 in documentation practice):

Placeholder	Best use	Why it is safer
/ /	Public docs, screenshots, sample URLs	Reserved for documentation and predictable
(TLD)	Fictional organizations in multi-domain scenarios	Signals “not real,” avoids collisions
(TLD)	Automated tests, QA examples	Intended for testing, avoids the public DNS namespace
(TLD)	Negative examples, “this must fail” cases	Guaranteed invalid, useful in validation docs
	Local-only services in tutorials	Loops back to the user’s machine

When you need a domain that “looks real,” a safe pattern is to keep it under with meaningful subdomains, like or . It reads naturally while staying inside reserved space.

Example Domain exists for this exact purpose: it is a neutral, stable target that can appear in documentation and screenshots without coordinating with a brand owner or risking a collision with an operational service.

Making examples feel real without being real

Some teams avoid placeholders because they worry examples will feel artificial. That usually happens when placeholders are used without context. A reader benefits most from examples that show consistent structure: auth host, API host, and a few paths that match the narrative.

A practical approach is to define a small, reusable fictional system model and apply it across docs: one identity provider host, one API host, one web app host. Keep the names consistent across pages so readers can track what is happening.

A short internal convention document can cover patterns like these:

Use for API base URLs
Use for OAuth and OIDC examples
Use for upload and download flows

If you need to demonstrate multi-tenant routing, keep the tenant identifier in the path or a subdomain under , like . The realism comes from structure, not from using a live corporate domain.

Process controls: style rules, review, and automation

Policy is the guardrail, but process is what makes it stick. The goal is to make “no real domains” the default path that requires no extra effort from authors.

A lightweight policy section in the writing style guide can be explicit and testable:

Allowed domains: , , , , , , and
Disallowed content: company-owned domains, partner domains, customer domains, and any domain that resolves to a real service
Exception handling: documented approval path for the rare cases where a real domain must appear
Scope clarity: stricter rules for public docs, labeled rules for internal-only docs, and clear access control expectations

This is also an editorial workflow issue. Put domain checks on the same checklist as “no secrets” and “no personal data.” Reviewers should search for , , and common TLDs, then confirm that every occurrence is from the allowlist.

CI and repository scanning patterns

Automation is the fastest way to prevent regressions, especially when documentation lives next to code and changes frequently.

A common implementation is an allowlist scan in CI that fails the build when it finds a domain not on the list. This can be as simple as a script that extracts hostnames from Markdown, HTML, and code blocks, then compares them against approved patterns.

Teams often combine multiple controls:

A pre-commit hook for fast feedback to authors
A CI job that scans the full repository on every pull request
A periodic scheduled job that scans generated site output, since build steps can introduce new text

If you already run secret scanners like Gitleaks or TruffleHog, adding a custom rule for disallowed domains can consolidate tooling. The point is not perfect parsing. It is catching the common, expensive mistakes early.

Handling screenshots, PDFs, and generated docs

Screenshots and exported documents are frequent sources of “hidden” real domains because the text is not always obvious in the source Markdown. Address this by treating images and PDFs as first-class documentation content, with review requirements and automated checks when possible.

Practical measures include OCR scanning for PDFs and screenshots, and a rule that forbids capturing browser address bars unless they show a placeholder domain. For UI walkthroughs, the safest pattern is to use a mocked environment that displays -based URLs or to blur and replace the address bar consistently.

Generated docs deserve special attention. OpenAPI specs, Postman collections, and SDK reference generators can pull server URLs from configuration files that were never written for publication. Put the allowlist scan on the generated output, not only on the source.

When a real domain is unavoidable

Sometimes documentation must name a real domain because the domain is the product surface, like , or because a standards document requires a canonical reference. In those cases, reduce risk by limiting what you reveal and by separating “identity” from “reachability.”

Tactics that work well:

Mention only the public, intended hostname, not internal subdomains or environment variants.
Avoid showing real query strings, tokens, or tenant identifiers in URLs, even as “fake” examples.
Keep any real domains out of copy-paste command blocks when a placeholder can demonstrate the mechanics.
Consider a dedicated public sandbox hostname that is meant to be touched by readers, with strict isolation from production.

The guiding principle is simple: documentation should teach patterns without publishing infrastructure. Reserved example domains make that the default, and they remove an entire class of avoidable data leaks from the writing process.