AI Agent Discovery: How LLMs Find, Index, and Recommend Software Tools End to End
The complete pipeline from LLM web crawl to product recommendation — how structured data, indexing, query decomposition, and agent selection work together to surface software tools.
AI Agent Discovery: How LLMs Find, Index, and Recommend Software Tools End to End
> TL;DR
> - LLMs build product knowledge by crawling the web and extracting structured data — this happens before any user query
> - The pipeline: crawl → schema extraction → indexing → query decomposition → filtering → ranking → recommendation/purchase
> - Every stage has a failure point. This guide covers them all.
> - Make your product crawlable and indexable →
Updated: April 21, 2026
---
Stage 0: The Crawl (Before Any Query)
AI product discovery starts long before a user types a query. It starts when LLM crawlers visit your site.
Active crawlers as of 2026:
GPTBot — OpenAI (powers ChatGPT recommendations)ClaudeBot — Anthropic (powers Claude's web knowledge)PerplexityBot — Perplexity AI (powers Perplexity's product index)Google-Extended — Google AI (powers AI Overviews)Applebot — Apple IntelligenceBytespider — TikTok / ByteDance AIMeta-ExternalAgent — Meta AIThese crawlers visit your site and extract two types of data:
Structured data (JSON-LD schema) — Parsed directly. Clean, high-confidence, directly usable in product indexes. Fields like name, price, featureList, applicationCategory are extracted with certainty.
Unstructured HTML — Parsed with NLP. Lower confidence. The crawler has to infer what's a price, what's a feature, what's a description. Errors are common. Often not used for product indexes at all.
Products with structured data get clean, accurate records. Products without it get noisy, incomplete records — or no record at all.
What to check right now: Open your server access logs and search for GPTBot, ClaudeBot, PerplexityBot. They're almost certainly already visiting your site. What are they finding?
---
Stage 1: Schema Extraction and Indexing
When a crawler finds on your page, it parses the JSON and extracts a structured product record. For a SoftwareApplication schema, it builds an index entry that looks roughly like:
product_id: "yourproduct.com"
type: SoftwareApplication
name: "YourProduct"
category: DeveloperApplication
subcategory: API Monitoring
price_min: 0
price_max: 99
price_unit: per month
free_trial: true
features: [webhooks, REST API, Slack integration, alerting, uptime monitoring]
audience: developers, devops teams, SRE engineers
rating: 4.7 (from 2,340 reviews)
last_crawled: 2026-04-18
This record lives in the LLM's product knowledge base. When a user asks a product query, this record is what gets retrieved and evaluated — not your live website.
Critical insight: Schema affects this record at crawl time. A schema update today takes 1–3 weeks to propagate across all LLM indexes. This is why schema is infrastructure — you're not optimizing for today's queries, you're building the foundation for all future queries.
---
Stage 2: Query Decomposition
When a user asks an AI agent for a product recommendation, the agent first breaks the natural language query into structured evaluation criteria.
Example query: "I need an uptime monitoring tool for our API. Must have webhook alerts, works with PagerDuty, under $50/month, and free tier to start."
Decomposed criteria:
category: API monitoring / uptime monitoring
required_features: ["webhooks", "PagerDuty integration"]
price_max: 50
price_unit: per month
free_tier: required
Each criterion maps to a schema field. Criteria that map cleanly to schema fields get evaluated with high confidence. Criteria that don't map to any schema field require NLP inference from unstructured text — less reliable, more likely to produce uncertain or incorrect results.
Design principle: Write your featureList and additionalProperty to match how buyers describe requirements. Not how your marketing team describes your product.
---
Stage 3: Candidate Retrieval
The agent queries its product index for candidates that match the primary criteria — usually applicationCategory and key features. This is a broad retrieval step that returns a candidate pool (typically 5–20 products).
Products that don't appear in the agent's index at all — because they have no schema or were never crawled — can't be retrieved at this stage. They're invisible before any evaluation begins.
Products with schema but in the wrong applicationCategory may also be missed. Getting your category right is as important as getting it set at all.
---
Stage 4: Filtering
The agent applies each buyer criterion as a hard filter. Products that fail any criterion are dropped from the pool.
Common filter failures and their schema causes:
| Buyer criterion | Schema field checked | Common failure |
|---|---|---|
| "Under $50/month" | offers[].price + priceSpecification.unitText | Missing unitText — agent can't determine if price is per seat, per month, or per year |
| "Free tier available" | additionalProperty[Free Trial] or offers[name="Free"].price = 0 | Not declared — agent can't confirm without guessing |
| "Works with PagerDuty" | featureList or additionalProperty[Integrations] | Not listed specifically — agent marks as unconfirmed |
| "GDPR compliant" | additionalProperty[Compliance] | Not in schema — agent can't confirm |
Every unconfirmed criterion is a potential disqualification. Agents working on behalf of a specific buyer don't recommend products they can't verify against the buyer's stated requirements.
---
Stage 5: Ranking
Among products that survive filtering, the agent ranks by confidence and quality:
1. Rating signal — aggregateRating.ratingValue weighted by reviewCount. Volume matters.
2. Schema completeness — More fields answered = more confident agent = higher rank. A product that answers 14/15 evaluation criteria beats one that answers 10/15.
3. Description specificity — Agents generate recommendation reasoning from your description. "Uptime monitoring for API-first teams with P95 alerting and PagerDuty escalation workflows" generates better reasoning than "powerful monitoring for modern teams."
4. Recency — dateModified on your schema signals active maintenance. Stale or missing dates reduce confidence.
5. Brand authority — Domain authority and mention frequency across the web. Built over time, not quickly fixable.
---
Stage 6: Recommendation or Agentic Purchase
Recommendation mode:
The agent generates a ranked list with reasoning, surfaces 1–3 options, and provides a link to each product. The human makes the final decision.
The quality of the recommendation reasoning comes directly from schema. "I recommend YourProduct because it has webhook alerts, PagerDuty integration, a free tier, costs $29/month, and has 4.7 stars from 2,300 reviews" — every fact in that sentence came from your schema.
Agentic purchase mode:
The agent doesn't just recommend — it acts. It initiates a trial, fills a signup form, or connects to your MCP server's start_trial tool. The human set the criteria; the agent handles execution.
In agentic mode, schema accuracy is critical. An incorrect price in schema that triggers a purchase at the wrong amount, or a "free trial" flag that's no longer accurate, causes real-world problems — not just a missed recommendation.
---
The Full Pipeline Visualized
Your website
│
├─ [Schema.org JSON-LD] ──────────────► LLM Crawler (GPTBot, ClaudeBot...)
│ │
│ Schema extraction
│ │
│ Product index entry
│ │
User query ──────────────────────────────► Query decomposition
│
Candidate retrieval from index
│
Filter: price ✓ features ✓ trial ✓
│
Rank: rating + completeness + recency
│
┌───────────────────┴───────────────────┐
│ │
Recommendation Agentic purchase
(human decides) (agent executes via MCP)
Your schema determines your entry, your eligibility, and your rank at every stage of this pipeline.
---
Making Your Product Discoverable: Action Checklist
robots.txt (check for GPTBot, ClaudeBot, PerplexityBot)SoftwareApplication or appropriate schema type to your product/landing pagesapplicationCategory to the correct Schema.org categoryfeatureList with specific capabilities buyers filter onoffers array with pricing per tier, priceSpecification.unitText explicitadditionalProperty for free trial, integrations, compliance, deployment modelaggregateRating with real review countFAQPage with 4–6 pre-purchase questions/.well-known/mcp.jsonRun all of these through the free audit → in 30 seconds.
---
Related articles: