Industry Report #001
What Healthcare's Web Topology Reveals About AI Readiness
An analysis of eleven Fortune 500 healthcare sites finds that roughly one in four pages cannot be reached by following internal editorial links from the homepage, and that the topology failure modes which cripple one site rarely cripple the next in the same way.
The Headline
Across eleven Fortune 500 healthcare sites we crawled in Q1 2026, totaling 7,125 pages and 11.1 million tokens of public content, roughly a quarter of all content cannot be reached by following internal editorial links from the homepage. The XML sitemap rescues most of those pages for search-engine crawlers that read sitemaps, but a residual one page in twelve is structurally invisible: no editorial path from the homepage and not in the sitemap. It exists as a URL on the site but is not advertised anywhere an unaided crawler can find it.
The implication is direct. When an AI agent that follows links is asked about a major healthcare brand, somewhere around one in four of that brand’s pages is not in the answer space the agent reaches on its own. A meaningful slice of that gap is content the site itself does not advertise to crawlers either.
The Three-Tier Reachability Model
Every page on a site falls into one of three reachability tiers:
- Reachable via internal links. A path exists from the homepage that follows editorial (in-content) links to the page. These are the pages a link-following AI agent can find on its own.
- Sitemap-only. The page is listed in the site’s XML sitemap, but no internal editorial link path leads to it. Crawlers that read sitemaps will index it. Agents that don’t, will not.
- Structurally unreachable. The page is neither reachable through editorial links nor present in the sitemap. It exists as a URL on the site but is not advertised anywhere an unaided crawler can find it. Whether it surfaces in a search at all depends on external inbound links from other sites.
Averaged across our 11-site healthcare sample:
The averages obscure how unevenly the failure distributes. On the per-site dot strip below, four of eleven sites have a structurally-unreachable rate at or near zero: their editorial graph and their sitemap agree on what the site is. The remaining seven spread between 0.4% and 34.7%, with one site exposing fully a third of its pages outside any advertised path.
View data
| Site | Structurally-unreachable page rate (% of site's pages) |
|---|---|
| SH1 | 0% |
| SH3 | 0% |
| SH6 | 0% |
| SH9 | 0% |
| SH2 | 0.4% |
| SH5 | 2.9% |
| SH11 | 7% |
| SH4 | 11.1% |
| SH10 | 11.3% |
| SH7 | 20% |
| SH8 | 34.7% |
| Sample mean | 7.9% |
Health Metrics: The Failure Modes Don’t Align
The conventional intuition is that orphans (pages with no inbound editorial links) are the dominant structural problem. The healthcare F500 data tells a more interesting story: the sites with the worst orphan rates are not the same sites with the worst dead-end rates.
View data
| Site | Orphan rate (% of site's pages) |
|---|---|
| SH2 | 0.4% |
| SH1 | 2.3% |
| SH11 | 2.8% |
| SH8 | 3.6% |
| SH6 | 4.2% |
| SH10 | 12.3% |
| SH4 | 13.4% |
| SH5 | 19.1% |
| SH7 | 20.2% |
| SH9 | 22.4% |
| SH3 | 42.6% |
| Sample mean | 13% |
View data
| Site | Dead-end rate (% of site's pages) |
|---|---|
| SH2 | 0% |
| SH6 | 0% |
| SH5 | 0.5% |
| SH3 | 1.2% |
| SH4 | 1.8% |
| SH1 | 4.5% |
| SH9 | 14.7% |
| SH10 | 15.8% |
| SH7 | 20.5% |
| SH8 | 29.1% |
| SH11 | 72% |
| Sample mean | 14.6% |
Compare the two strips. The site with the highest orphan rate (SH3 at 42.6%) has nearly the lowest dead-end rate (1.2%). The site with the highest dead-end rate (SH11 at 72.0%) has nearly the lowest orphan rate (2.8%). These are independent failure modes, not two faces of the same problem.
What the dot strips also show is that dead-end rates have wider variance than orphan rates. Orphans cluster between 0.4% and 22.4% on ten of the eleven sites, with one site pushing past 40%. Dead-end rates spread more uniformly between 0% and 30% on ten of the eleven, then jump to 72% on the one extreme case.
The takeaway for an AI agent that traverses link structure: an orphan is a missing entry point, but a dead-end is a circulation terminus. Both fail the agent’s expectation that arriving at a page means discovering more pages. On these healthcare sites, the failure is rarely just one or the other. Most sites have at least one of the two pushed into the red.
The Industry Scorecard
The five-lens analysis assigns each site a green/amber/red score on each of the five lenses. Across the sample:
| Site | Skeleton Size, density, and average path length. How big and how connected the site is at the body level. | Circulation PageRank distribution and structural bottlenecks. How importance flows between pages, and which hubs hold it all together. | Organs Community detection. Whether the site's topical clusters cleanly separate, or whether one mega-cluster dominates everything. | Health Islands, orphans, and dead-ends. Where content is structurally dying: unreachable, unlinked, or terminating. | Nervous Sys. Click depth, bridges, and cross-community linking. Whether the site is a well-designed building or a pile of disconnected rooms. |
|---|---|---|---|---|---|
| SH1 | |||||
| SH2 | |||||
| SH3 | |||||
| SH4 | |||||
| SH5 | |||||
| SH6 | |||||
| SH7 | |||||
| SH8 | |||||
| SH9 | |||||
| SH10 | |||||
| SH11 |
- Green: healthy
- Amber: moderate concern
- Red: critical
The pattern in the heatmap: skeletons and circulation are mostly fine. Health and organs are the soft spots. Ten of eleven sites earn green skeletons and the remaining one earns amber, indicating that link density is healthy at the body level. But three of eleven are red on health and only two are green, while four of eleven are red on organs. The picture is consistent: healthcare F500 sites are well-built at the surface and broken at the level of how editorial signal travels between topical communities.
Click Depth: The Reachability Tax
Even pages that are technically reachable can sit far from the homepage. The cohort’s average max click depth is 7.3 clicks; one site has pages 14 clicks deep. Across the sample, 21% of pages live at click depth 4 or greater.
| Click-depth metric | Sample mean (n=11) | Range |
|---|---|---|
| Max click depth | 7.3 clicks | 3–14 |
| Avg click depth | 3.2 clicks | 1.7–4.2 |
| Pages at depth ≥4 | 21.0% | 0%–37% |
Two regimes coexist. Five sites cap out at a max depth of three to five clicks: either the homepage reaches little via editorial links, or the entire link graph is a flat fan rather than a hierarchy. The other extreme is SH6 at depth 14, a deep but linear graph where some pages require fourteen editorial hops from the entry point. AI agents and traditional crawlers both bias toward shallow paths. A page that needs seven or more editorial hops to reach is, in practice, only as discoverable as its sitemap line.
Content Quality at Scale
Across all eleven sites combined: 7,125 pages, 11.1 million tokens of body text, and roughly 257,000 internal editorial links between them.
| Content metric | Sample mean (n=11) | Notes |
|---|---|---|
| Pages per site | 648 | Median 699, range 239–998 |
| Avg word count per page | 874 | Median 605; long tail of long pages |
| Avg token count per page | ~1,470 | Character-based estimate (length / 4) |
| Pages with thin content (under 200 words) | 17.2% | Range 0%–37% |
| Pages with zero content | ~1% | The “redirect / shell page” rate |
| Internal-link-to-external-link ratio | 82% internal | Healthcare sites mostly link to themselves |
| Title tag coverage | 99.5% | Near-universal |
| Meta description coverage | 80.5% | One in five pages missing a description |
The thin-content figure is the one to watch. One in six pages on the average site in this cohort has fewer than 200 words. These are the pages that exist as much for the sitemap as for the reader: redirect targets, navigation stubs, location pages, single-claim landing pages, archived event recaps. From an AI agent’s perspective they are weak signals: too little text to anchor a confident summary, but enough to clutter a knowledge graph. Some thin pages are legitimate (a category landing page that delegates to its children). The 17.2% sample mean is the systemic fraction worth flagging, not the individual page.
What Every Healthcare F500 Site Shares
Looking at each site’s single largest URL section reveals a consistent shape. Six of the eleven sites have a press-release / news / newsroom section as their dominant content type; on the sites where that section exists it routinely accounts for between 40% and 80% of all crawled pages.
The dominant pattern across the cohort is some variant of press releases / news / newsroom. Half the sample carries this archetype, and on those sites it accounts on average for half of all crawled pages. The healthcare F500 sites in this batch are, structurally, news archives with a thin product layer on top.
That single architectural choice cascades into most of the topology metrics above. Press-release content is high-volume, fast-decaying, and rarely cross-linked outside its own date silo. A 2018 quarterly earnings note is unlikely to link to a 2024 product launch, and vice versa. The result is exactly the topology we measured: deep click-depth, high orphan and dead-end rates concentrated in the news silo, low cross-section linking. Most healthcare F500 sites are not architecturally broken in some unusual way. They are press-release archives that happen to also host product pages.
What This Means for AI Search Readiness
Two distinct AI consumption patterns interact with the three reachability tiers differently. Indexer crawlers (GPTBot, ClaudeBot, PerplexityBot, and similar) consume sitemap.xml and reach tier-2 sitemap-only pages just as Googlebot does; for them, the structurally-unreachable 7.9% is the relevant gap. Agentic browsing at query time is the other case: an LLM following internal links in real time to answer a specific question (“what does this company offer in oncology?”). That mode is link-bound. It does not pre-load the sitemap; it traverses what is editorially linked from where it lands. For the agentic case, the full 27% unreachable-via-links figure is in play.
The structural problem matters in both regimes, but more acutely in the second. The remainder of this section concerns the link-following case.
For an AI agent that uses internal link structure to discover and rank content, three implications follow from this data.
1. About a quarter of the content is invisible to a link-following agent. The 27% average unreachable-via-links rate means that for every three pages an agent learns about by traversing, there is roughly one more it never sees. Roughly 19% of the typical site is rescued by the sitemap (for crawlers that read sitemaps); the residual 8% is invisible to anything that doesn’t already know the URL.
2. Even reachable pages often go nowhere. The dead-end rate isn’t a missing-page problem; it’s a circulation problem. The agent arrives at the page but learns nothing about what’s adjacent. Topical neighborhoods don’t get built because the links that would build them don’t exist. This affects how an agent answers questions like “what else does this company offer in this space?” The answer space gets unnecessarily narrow.
3. Press-release content drowns the product story. When a news / newsroom section averages around half of a site’s crawlable pages on the half of sites where that pattern dominates, and the editorial linking structure mostly stays within the date silo, an AI agent asked “what does this company do?” gets answered with “it issues press releases.” The product story that the marketing team would want surfaced gets relatively under-linked compared to the news-archive content that exists by default.
The fix is not more content, more pages, or a redesign. It is structural: closing dead-ends with outbound links to topical neighbors, rescuing orphans into their communities, building the cross-section bridges that turn a press-release archive into a topology with a real product spine. This is the entire premise of the Digital MRI service.
Methodology
This report aggregates topology and content data from eleven anonymized Fortune 500 healthcare websites. All sites went through the same pipeline: Playwright-based content crawl, main-content extraction (filtering navigation, header, and footer links), graph construction with a global-nav-link filter, and the five-lens topology analysis (skeleton, circulation, organs, health, nervous system). All identifying information has been replaced with neutral codenames (SH1 through SH11); the mapping is not exposed in this report. A site-level data appendix is available on request.
The eleven sites are a curated cohort, not a random sample of the healthcare industry. Aggregate figures (sample means, ranges) describe this batch and are presented as exploratory benchmarks for F500 healthcare topology, not as point estimates for the broader sector. A larger cross-industry baseline is in progress.
A twelfth Fortune 500 healthcare site originally targeted for this batch was excluded after repeated TCP-level network unreachability from our crawl host (geo-blocking at the site’s edge). It is not part of the figures above.
A note on what “structurally unreachable” measures
Throughout this report, structurally unreachable refers to pages that have no internal editorial link path from the homepage and are not present in the site’s XML sitemap. It is a measure of a page’s structural advertisement by the site itself, not a measure of its search-engine indexing status. Pages the site itself flags as non-indexable are excluded from the count.
The homepage is the anchor because it is the page crawlers and agents almost always reach first: it is the canonical entry point linked from the domain root, from search results, from external citations, and from the brand’s own marketing. Discovery then proceeds outward through whatever editorial links the homepage exposes. A page that cannot be reached by following those links is, in practical terms, a page the site is not advertising to anyone who arrives at the front door.
Disclaimer. This analysis was performed using web topology crawling and network science methods including PageRank, Louvain community detection, and betweenness centrality. The crawler respects site-level robots directives; disallowed pages were never fetched and are not part of this dataset. Navigation, header, and footer links were excluded automatically: any link target that appeared on more than 80% of pages was treated as global navigation and stripped from the graph before topology analysis. Only in-content (editorial) links remain. Pages the site itself marks non-indexable are subtracted from the structurally-unreachable count. All data represents publicly accessible page structure only. No content, metadata, or user data was collected or stored. All identifying information has been anonymized; the codename-to-domain mapping is intentionally not published. Token counts are character-based estimates (page length divided by four), not tokenizer output. Click depth is measured as breadth-first distance from the homepage following internal editorial links. Statistical patterns are presented for educational purposes only and do not constitute advice about any specific site or company.