What Healthcare's Web Topology Reveals About AI Readiness | Industry Report #001

The Headline

Across eleven Fortune 500 healthcare sites we crawled in Q1 2026, totaling 7,125 pages and 11.1 million tokens of public content, roughly a quarter of all content cannot be reached by following internal editorial links from the homepage. The XML sitemap rescues most of those pages for search-engine crawlers that read sitemaps, but a residual one page in twelve is structurally invisible: no editorial path from the homepage and not in the sitemap. It exists as a URL on the site but is not advertised anywhere an unaided crawler can find it.

The implication is direct. When an AI agent that follows links is asked about a major healthcare brand, somewhere around one in four of that brand’s pages is not in the answer space the agent reaches on its own. A meaningful slice of that gap is content the site itself does not advertise to crawlers either.

The Three-Tier Reachability Model

Every page on a site falls into one of three reachability tiers:

Reachable via internal links. A path exists from the homepage that follows editorial (in-content) links to the page. These are the pages a link-following AI agent can find on its own.
Sitemap-only. The page is listed in the site’s XML sitemap, but no internal editorial link path leads to it. Crawlers that read sitemaps will index it. Agents that don’t, will not.
Structurally unreachable. The page is neither reachable through editorial links nor present in the sitemap. It exists as a URL on the site but is not advertised anywhere an unaided crawler can find it. Whether it surfaces in a search at all depends on external inbound links from other sites.

Averaged across our 11-site healthcare sample:

Structurally unreachable (7.9%)

Sample mean across 11 healthcare F500 sites, Q1 2026

View data

Segment	Value
Reachable via internal links	72.8%
Sitemap-only	19.3%
Structurally unreachable	7.9%

The averages obscure how unevenly the failure distributes. On the per-site dot strip below, four of eleven sites have a structurally-unreachable rate at or near zero: their editorial graph and their sitemap agree on what the site is. The remaining seven spread between 0.4% and 34.7%, with one site exposing fully a third of its pages outside any advertised path.

Each dot is one site. Sample mean 7.9% (n=11).

View data

Site	Structurally-unreachable page rate (% of site's pages)
SH1	0%
SH3	0%
SH6	0%
SH9	0%
SH2	0.4%
SH5	2.9%
SH11	7%
SH4	11.1%
SH10	11.3%
SH7	20%
SH8	34.7%
Sample mean	7.9%

Health Metrics: The Failure Modes Don’t Align

The conventional intuition is that orphans (pages with no inbound editorial links) are the dominant structural problem. The healthcare F500 data tells a more interesting story: the sites with the worst orphan rates are not the same sites with the worst dead-end rates.

Pages with zero inbound editorial links. Sample mean 13.0% (n=11).

View data

Site	Orphan rate (% of site's pages)
SH2	0.4%
SH1	2.3%
SH11	2.8%
SH8	3.6%
SH6	4.2%
SH10	12.3%
SH4	13.4%
SH5	19.1%
SH7	20.2%
SH9	22.4%
SH3	42.6%
Sample mean	13%

Pages that receive inbound links but link out to nothing. Sample mean 14.6% (n=11), pulled by one extreme site.

View data

Site	Dead-end rate (% of site's pages)
SH2	0%
SH6	0%
SH5	0.5%
SH3	1.2%
SH4	1.8%
SH1	4.5%
SH9	14.7%
SH10	15.8%
SH7	20.5%
SH8	29.1%
SH11	72%
Sample mean	14.6%

Compare the two strips. The site with the highest orphan rate (SH3 at 42.6%) has nearly the lowest dead-end rate (1.2%). The site with the highest dead-end rate (SH11 at 72.0%) has nearly the lowest orphan rate (2.8%). These are independent failure modes, not two faces of the same problem.

What the dot strips also show is that dead-end rates have wider variance than orphan rates. Orphans cluster between 0.4% and 22.4% on ten of the eleven sites, with one site pushing past 40%. Dead-end rates spread more uniformly between 0% and 30% on ten of the eleven, then jump to 72% on the one extreme case.

The takeaway for an AI agent that traverses link structure: an orphan is a missing entry point, but a dead-end is a circulation terminus. Both fail the agent’s expectation that arriving at a page means discovering more pages. On these healthcare sites, the failure is rarely just one or the other. Most sites have at least one of the two pushed into the red.

The Industry Scorecard

The five-lens analysis assigns each site a green/amber/red score on each of the five lenses. Across the sample:

Site	Skeleton	Circulation	Organs	Health	Nervous Sys.
SH1
SH2
SH3
SH4
SH5
SH6
SH7
SH8
SH9
SH10
SH11

Green: healthy
Amber: moderate concern
Red: critical

Five-lens scorecard for each of the 11 healthcare sites.

The pattern in the heatmap: skeletons and circulation are mostly fine. Health and organs are the soft spots. Ten of eleven sites earn green skeletons and the remaining one earns amber, indicating that link density is healthy at the body level. But three of eleven are red on health and only two are green, while four of eleven are red on organs. The picture is consistent: healthcare F500 sites are well-built at the surface and broken at the level of how editorial signal travels between topical communities.

Click Depth: The Reachability Tax

Even pages that are technically reachable can sit far from the homepage. The cohort’s average max click depth is 7.3 clicks; one site has pages 14 clicks deep. Across the sample, 21% of pages live at click depth 4 or greater.

Click-depth metric	Sample mean (n=11)	Range
Max click depth	7.3 clicks	3–14
Avg click depth	3.2 clicks	1.7–4.2
Pages at depth ≥4	21.0%	0%–37%

Two regimes coexist. Five sites cap out at a max depth of three to five clicks: either the homepage reaches little via editorial links, or the entire link graph is a flat fan rather than a hierarchy. The other extreme is SH6 at depth 14, a deep but linear graph where some pages require fourteen editorial hops from the entry point. AI agents and traditional crawlers both bias toward shallow paths. A page that needs seven or more editorial hops to reach is, in practice, only as discoverable as its sitemap line.

Content Quality at Scale

Across all eleven sites combined: 7,125 pages, 11.1 million tokens of body text, and roughly 257,000 internal editorial links between them.

Content metric	Sample mean (n=11)	Notes
Pages per site	648	Median 699, range 239–998
Avg word count per page	874	Median 605; long tail of long pages
Avg token count per page	~1,470	Character-based estimate (length / 4)
Pages with thin content (under 200 words)	17.2%	Range 0%–37%
Pages with zero content	~1%	The “redirect / shell page” rate
Internal-link-to-external-link ratio	82% internal	Healthcare sites mostly link to themselves
Title tag coverage	99.5%	Near-universal
Meta description coverage	80.5%	One in five pages missing a description

The thin-content figure is the one to watch. One in six pages on the average site in this cohort has fewer than 200 words. These are the pages that exist as much for the sitemap as for the reader: redirect targets, navigation stubs, location pages, single-claim landing pages, archived event recaps. From an AI agent’s perspective they are weak signals: too little text to anchor a confident summary, but enough to clutter a knowledge graph. Some thin pages are legitimate (a category landing page that delegates to its children). The 17.2% sample mean is the systemic fraction worth flagging, not the individual page.

What Every Healthcare F500 Site Shares

Looking at each site’s single largest URL section reveals a consistent shape. Six of the eleven sites have a press-release / news / newsroom section as their dominant content type; on the sites where that section exists it routinely accounts for between 40% and 80% of all crawled pages.

News / newsroom variant	50.6%	6 of 11 sites: appears under names like /news, /newsroom, /knowledge-center, /media-center
Therapeutic / product area	41.2%	3 of 11 sites: e.g. /innovativemedicine, /life-and-science, /public-policy-institute
Brand / about	27.3%	2 of 11 sites: /who-we-are, /about-us
Single-locale wrapper	100%	1 of 11 sites: entire site lives under /en-us with no further section structure

Avg % of site's pages on the dominant section, when it exists

Each site's largest URL-path section, grouped by archetype across the 11 sites.

The dominant pattern across the cohort is some variant of press releases / news / newsroom. Half the sample carries this archetype, and on those sites it accounts on average for half of all crawled pages. The healthcare F500 sites in this batch are, structurally, news archives with a thin product layer on top.

That single architectural choice cascades into most of the topology metrics above. Press-release content is high-volume, fast-decaying, and rarely cross-linked outside its own date silo. A 2018 quarterly earnings note is unlikely to link to a 2024 product launch, and vice versa. The result is exactly the topology we measured: deep click-depth, high orphan and dead-end rates concentrated in the news silo, low cross-section linking. Most healthcare F500 sites are not architecturally broken in some unusual way. They are press-release archives that happen to also host product pages.

What This Means for AI Search Readiness

Two distinct AI consumption patterns interact with the three reachability tiers differently. Indexer crawlers (GPTBot, ClaudeBot, PerplexityBot, and similar) consume sitemap.xml and reach tier-2 sitemap-only pages just as Googlebot does; for them, the structurally-unreachable 7.9% is the relevant gap. Agentic browsing at query time is the other case: an LLM following internal links in real time to answer a specific question (“what does this company offer in oncology?”). That mode is link-bound. It does not pre-load the sitemap; it traverses what is editorially linked from where it lands. For the agentic case, the full 27% unreachable-via-links figure is in play.

The structural problem matters in both regimes, but more acutely in the second. The remainder of this section concerns the link-following case.

For an AI agent that uses internal link structure to discover and rank content, three implications follow from this data.

1. About a quarter of the content is invisible to a link-following agent. The 27% average unreachable-via-links rate means that for every three pages an agent learns about by traversing, there is roughly one more it never sees. Roughly 19% of the typical site is rescued by the sitemap (for crawlers that read sitemaps); the residual 8% is invisible to anything that doesn’t already know the URL.

2. Even reachable pages often go nowhere. The dead-end rate isn’t a missing-page problem; it’s a circulation problem. The agent arrives at the page but learns nothing about what’s adjacent. Topical neighborhoods don’t get built because the links that would build them don’t exist. This affects how an agent answers questions like “what else does this company offer in this space?” The answer space gets unnecessarily narrow.

3. Press-release content drowns the product story. When a news / newsroom section averages around half of a site’s crawlable pages on the half of sites where that pattern dominates, and the editorial linking structure mostly stays within the date silo, an AI agent asked “what does this company do?” gets answered with “it issues press releases.” The product story that the marketing team would want surfaced gets relatively under-linked compared to the news-archive content that exists by default.

The fix is not more content, more pages, or a redesign. It is structural: closing dead-ends with outbound links to topical neighbors, rescuing orphans into their communities, building the cross-section bridges that turn a press-release archive into a topology with a real product spine. This is the entire premise of the Digital MRI service.

Methodology

This report aggregates topology and content data from eleven anonymized Fortune 500 healthcare websites. All sites went through the same pipeline: Playwright-based content crawl, main-content extraction (filtering navigation, header, and footer links), graph construction with a global-nav-link filter, and the five-lens topology analysis (skeleton, circulation, organs, health, nervous system). All identifying information has been replaced with neutral codenames (SH1 through SH11); the mapping is not exposed in this report. A site-level data appendix is available on request.

The eleven sites are a curated cohort, not a random sample of the healthcare industry. Aggregate figures (sample means, ranges) describe this batch and are presented as exploratory benchmarks for F500 healthcare topology, not as point estimates for the broader sector. A larger cross-industry baseline is in progress.

A twelfth Fortune 500 healthcare site originally targeted for this batch was excluded after repeated TCP-level network unreachability from our crawl host (geo-blocking at the site’s edge). It is not part of the figures above.

A note on what “structurally unreachable” measures

Throughout this report, structurally unreachable refers to pages that have no internal editorial link path from the homepage and are not present in the site’s XML sitemap. It is a measure of a page’s structural advertisement by the site itself, not a measure of its search-engine indexing status. Pages the site itself flags as non-indexable are excluded from the count.

The homepage is the anchor because it is the page crawlers and agents almost always reach first: it is the canonical entry point linked from the domain root, from search results, from external citations, and from the brand’s own marketing. Discovery then proceeds outward through whatever editorial links the homepage exposes. A page that cannot be reached by following those links is, in practical terms, a page the site is not advertising to anyone who arrives at the front door.

Disclaimer. This analysis was performed using web topology crawling and network science methods including PageRank, Louvain community detection, and betweenness centrality. The crawler respects site-level robots directives; disallowed pages were never fetched and are not part of this dataset. Navigation, header, and footer links were excluded automatically: any link target that appeared on more than 80% of pages was treated as global navigation and stripped from the graph before topology analysis. Only in-content (editorial) links remain. Pages the site itself marks non-indexable are subtracted from the structurally-unreachable count. All data represents publicly accessible page structure only. No content, metadata, or user data was collected or stored. All identifying information has been anonymized; the codename-to-domain mapping is intentionally not published. Token counts are character-based estimates (page length divided by four), not tokenizer output. Click depth is measured as breadth-first distance from the homepage following internal editorial links. Statistical patterns are presented for educational purposes only and do not constitute advice about any specific site or company.