eCommerce12 min read

Agentic Commerce Merchandising: Catalog Optimization

AI-driven catalog merchandising — PLP re-ranking agents, attribute enrichment, visual tagging, and 90-day lift study methodology for eCommerce agencies.

Digital Applied Team
April 15, 2026
12 min read
90-day

Lift Study

Cohort

PLP Ranking

Tagging

Visual Agent

Enrich

Taxonomy Fill

Key Takeaways

Static Merchandising Is the Bottleneck: Weekly revenue-ranked category pages ignore cohort intent, inventory velocity, and real-time signals — the single largest easy-win in agentic commerce.
Four Agents, Not One Model: Production catalog optimization splits into PLP re-ranking, attribute enrichment, visual tagging, and a personalization orchestrator — each with its own evals and rollback behavior.
Cohort-Aware Ranking Wins: Re-ranking by cohort and intent signal (new vs returning, search query, device, source) consistently lifts PLP conversion 8-18% in controlled tests, more than any single-lever tweak.
Attribute Gaps Are the Hidden Tax: Most catalogs ship 40-60% of SKUs with missing facet values — agents can fill gaps from source descriptions, supplier data, and imagery in days rather than quarters.
90-Day Lift Studies Build Trust: Geo-holdout or cohort-holdout designs with pre-registered metrics turn agentic merchandising from pitch deck claim into auditable business case.
Platform Pattern, Not Platform Problem: Shopify, BigCommerce, and Salesforce Commerce Cloud each expose the primitives needed — the deployment shape differs but the agent architecture is portable.

Most merchandising is static — category pages ranked by revenue once a week, maybe refreshed mid-week if a merchandiser has time. Agentic merchandising reads traffic, intent, and inventory signals in real time and re-ranks the PLP per cohort, enriches missing attributes on the fly, and tags products from imagery faster than any human team could. The 90-day lift studies agencies can prove make this the easiest agentic commerce win available right now.

This framework walks through the four-agent architecture we ship to eCommerce clients on Shopify Plus, BigCommerce, and Salesforce Commerce Cloud. You will get the agent specs, the platform-by- platform deployment patterns, the 90-day lift study methodology that turns the work into a defensible business case, and the metrics map that keeps conversion gains from eroding margin.

Why Static Merchandising Loses to Agentic

Walk into any eCommerce team and ask how the homepage category pages were ranked this week. The answer is almost always a variation of the same pattern: export revenue or units sold from last week, sort descending, push the top N to the top of the PLP, pin a few hero SKUs for campaign reasons, repeat. The problem is not that the logic is bad — revenue-ranked PLPs are a reasonable baseline. The problem is that the logic is static, identical across cohorts, and updated far slower than customer behavior actually shifts.

A returning customer from paid search for "waterproof hiking boots" and a new visitor from a broad brand display ad should see different orderings of the hiking-boots PLP. The first has a sharp, constrained intent and high cart-ready probability; the second needs brand context, social proof, and range. One static ranking cannot serve both without leaving conversion on the table. Multiply that by every source, every cohort, every inventory-velocity change across the week, and the gap between what static merchandising delivers and what is achievable grows sharply.

The Three Gaps Agentic Merchandising Closes
  • Cadence gap. Weekly manual ranking versus per-session or per-cohort ranking continuously updated from live behavior.
  • Signal gap. Revenue alone versus the full signal stack — intent, device, source, inventory velocity, margin, return rate, and freshness.
  • Completeness gap. Human-curated facets and attributes on a subset of SKUs versus agent-enriched attributes across the full catalog, including imagery- derived tags no human could produce at scale.

The four-agent architecture below closes each of those gaps. It is also why every major platform — Shopify, BigCommerce, Salesforce Commerce Cloud — now exposes the primitives needed to run this pattern. The industry knows where this is going; the question is who captures the lift first.

Agent 1: PLP Re-Ranking

The PLP re-ranking agent is the headline component — the one that moves conversion numbers in the lift study. Its job is to produce a per-cohort, per-intent ordering of products within a category or search result page, updated continuously against live signals, with session-level stability so the same user on the same visit does not see shuffled cards between scrolls.

Signal Inputs

  • Intent signals. Source (paid search, organic, direct, email), query text if any, referring campaign, device type, session depth.
  • Cohort signals. New vs returning, previous category affinity, LTV band, loyalty tier, geography.
  • Product signals. 7/30/90-day conversion, inventory depth, margin, return rate, recency of listing, price band, discount status.
  • Business signals. Campaign pins, end-of-life suppressions, over-stock promotions, launch-window boosts.

Architecture: Learned Ranker + LLM Policy Agent

The production pattern is a hybrid. A lightweight learned ranker — gradient-boosted trees are fine, small transformers if volume justifies — scores every product in the category for the current cohort and intent. An LLM agent then reviews the top slate for policy compliance (campaign pins honored, end-of-life items suppressed, diversity constraints satisfied) and either approves or requests adjustments. The agent is also where exception handling lives: a low-inventory blockbuster that the ranker wants to feature but that would frustrate users post-click needs to be down-ranked with an explanation the team can audit.

Session Stability and Rollback

Re-ranking should stabilize within a session. The standard approach hashes the session ID plus the cohort key into the scoring function so the same user sees the same order on the same visit even if the underlying model updates. Rollback is handled by storing the last known-good collection snapshot on the platform side; if eval metrics breach the alert threshold, the agent reverts to the snapshot within minutes rather than waiting for a human to notice and intervene.

Agent 2: Attribute Enrichment

Most catalogs ship with 40-60% of SKUs missing at least one material facet value — color family, material, fit, room, occasion, the specific attributes customers actually filter on. The reason is mundane: merchant-of-record workflows prioritize launch, not backfill. Humans fill the gaps when they have time, which is rarely. The attribute enrichment agent fills those gaps from authoritative sources: supplier feeds, long-form product descriptions, spec sheets, and product imagery.

Source Hierarchy

  1. Supplier feeds and PIM data (highest authority). If the color field is populated in the supplier feed but empty in the storefront catalog, pull and copy — no inference required.
  2. Long-form description text (second authority). Parse descriptions for structured attributes; most "Deep navy blue with contrast stitching" language is safe to convert to `color: navy` when flagged with source citation.
  3. Spec sheets and PDFs (third authority). Useful for technical categories — electronics, appliances, industrial — where the PIM often lacks granularity but the spec sheet is comprehensive.
  4. Product imagery (last resort, handled by the visual tagging agent). Use only when no authoritative text source exists, and always with lower confidence weighting.

Writing Rules

Every write by the enrichment agent must cite its source, log the confidence score, and route through a review queue when confidence falls below the per-category threshold. Critical facets like size, fit, and material get stricter thresholds than descriptive facets like "aesthetic" or "style". The agent never overwrites a human-set value; disagreements are surfaced to the review queue rather than resolved autonomously. This is the single most important guardrail against attribute hallucination.

Agent 3: Visual Tagging

Visual tagging unlocks facet expansion no human team could produce at scale. A frontier vision model reads every product image and extracts structured tags: primary and secondary colors, patterns, materials, silhouettes, occasions, stylistic descriptors, and increasingly specific category-dependent tags like sleeve length, neckline, heel height, or leg cut. The output feeds the facet system, the search index, and the PLP re-ranker as additional cohort signals.

What Visual Tagging Does Well (and Doesn't)
  • Strong: color families, dominant patterns, visual style clusters, detectable accessories, background/lifestyle cues.
  • Medium: exact material identification (silk vs satin from a single image), precise fit tags requiring garment-on-body context.
  • Weak: attributes that require touch or weight (fabric hand, softness, weight), brand-specific sizing nuances, regulated claims (organic, fair-trade).

Taxonomy Alignment and Human Review

Visual tags only create business value if they align with the facet taxonomy customers actually use. The agent's output is normalized against the existing facet vocabulary and anything that falls outside is flagged for taxonomy expansion rather than silently dropped or force-mapped to an adjacent value. This is how you discover that a growing "cottagecore" cohort exists in the data before a merchandiser spots it. For the broader orchestration patterns between these vision, enrichment, and ranking agents, see our multi-agent orchestration patterns guide.

Agent 4: Personalization Orchestrator

The personalization orchestrator is the coordinator. The three specialist agents — re-ranking, enrichment, visual tagging — each own their domain. The orchestrator decides which signals flow to which agent, resolves conflicts, enforces global policy (minimum-representation rules for new arrivals, campaign pins that override ranker output, brand-safety suppressions), and handles cross-surface consistency so the customer sees a coherent experience across PLP, search, recommendation rail, and email.

Orchestrator Responsibilities

  • Conflict resolution. When the ranker wants to feature a SKU that the enrichment agent has flagged for incomplete attributes, the orchestrator holds the feature until enrichment completes or down-weights with reason logged.
  • Cross-surface consistency. A cohort that just saw a hero banner for a campaign should see that campaign's products ranked coherently on the PLP they land on, without the PLP re-ranker independently deciding a different product is the right hero.
  • Global policy enforcement. Margin floors, return-rate ceilings, new-arrival representation minimums, regional inventory constraints.
  • Budget and cost awareness. Rate-limiting the specialist agents so enrichment passes don't saturate inference capacity during peak traffic windows.

The orchestrator is the natural place to apply production agent patterns around reliability and cost — see our Claude Agent SDK production patterns guide for the underlying harness.

Shopify Deployment Pattern

Shopify (including Shopify Plus) is the most common deployment target because of the combination of Admin API coverage, Storefront API for custom frontends, and the metafields system for storing agent-derived attributes. The typical stack keeps Shopify as the system of record and runs the agents in a separate compute layer that reads and writes via API.

Shopify Integration Surface
  • Admin GraphQL API for reading products and collections and writing collection sort orders, metafields, and tags.
  • Webhooks on product create/update, collection update, and inventory levels to keep the agent layer in sync.
  • Metafields and metaobjects for storing agent-derived attributes with provenance (source, timestamp, confidence).
  • Storefront API for custom Hydrogen or headless implementations that render per-cohort PLPs at the edge.
  • Shopify Functions for lightweight sort and filter logic that needs to run inside the Shopify platform rather than in an external service.

Deployment Shape for Shopify Plus

For theme-based storefronts, the re-ranker writes manual sort orders to collections via the Admin API and honors Shopify's native collection caching. For Hydrogen or headless storefronts, the re-ranker is called at request time from the edge layer with cohort keys, bypassing Shopify's collection ordering entirely. Attribute enrichment writes to metafields or metaobjects so theme and headless layers both benefit. Visual tagging runs as a background job keyed on product-image webhooks.

BigCommerce Deployment Pattern

BigCommerce pairs well with agentic merchandising because its Catalyst storefront framework and GraphQL Storefront API are designed for headless, edge-rendered experiences where per-cohort logic can live naturally. The integration surface is different from Shopify in naming but comparable in capability.

BigCommerce Integration Surface

  • V3 Catalog API for reading and writing products, categories, sort orders, and custom fields.
  • Custom fields and metafields for agent- derived attributes with provenance metadata.
  • Webhooks across products, categories, inventory, and orders for signal collection.
  • Catalyst on Next.js as the storefront framework, giving natural edge-layer hooks for per-cohort rendering.
  • BigCommerce B2B Edition primitives when the merchant runs a mixed B2C/B2B motion and the agent needs to respect buyer-group pricing and catalog visibility.

The deployment shape mirrors the Hydrogen pattern on Shopify: keep BigCommerce as the system of record, run the agent layer externally, and have the edge layer (Catalyst or a custom Next.js storefront) call the re-ranker at request time with cohort context. BigCommerce's stronger native support for multi- storefront makes the orchestrator pattern especially natural — a single agent set serves multiple brand fronts with per-brand policy overrides.

Salesforce Commerce Cloud Deployment Pattern

Salesforce Commerce Cloud B2C is the enterprise deployment target — larger catalogs, more brands per instance, more integration complexity, and a native Einstein personalization layer already in place. Agentic merchandising on SFCC is less about replacing Einstein and more about augmenting it: Einstein handles the native personalization surface, and the agent layer handles the PLP re-ranking, enrichment, and visual tagging that Einstein doesn't natively own.

SFCC Integration Surface
  • OCAPI and SCAPI for catalog reads and writes, including product, category, and custom attribute operations.
  • B2C Commerce custom attributes for agent- derived enrichment and visual tag outputs.
  • Einstein recommendations consumed as a signal into the orchestrator rather than a competing ranking system.
  • PageDesigner for merchandiser-facing controls that expose agent state and allow pinning, overriding, and reviewing proposed changes.
  • Data Cloud integration for unified cohort signals across storefront, service, and marketing touch- points.

The deployment takes six to eight weeks longer than Shopify or BigCommerce primarily because of release cadence and change- management overhead, not because the architecture differs in substance. For an enterprise client with brand sensitivity and stakeholder complexity, that additional runway is usually worth it — the eventual governance, auditability, and cross-brand leverage are stronger than on smaller platforms.

90-Day Lift Study Methodology

The 90-day lift study is what turns agentic merchandising from pitch-deck claim into defensible business case. The core design principles: pre-register the primary metric before launch, design for adequate statistical power at a realistic effect size, hold the treatment stable for the full 90 days, and lock analysis decisions before opening the data. This is experimentation discipline, not marketing.

Study Design Choices

Three designs work in practice, in rough order of preference depending on platform and traffic mix:

  • Cohort-split (preferred). Randomize at the session or user level within a stable hash; control sees the current ranking, treatment sees agent-driven ranking. Cleanest design when session identity is reliable.
  • Geo-holdout. Hold out specific geographies (e.g. three DMAs in the US, two countries in EMEA) as control while treatment runs everywhere else. Good when cohort-split is hard (heavy direct traffic, fuzzy identity).
  • Switchback. Alternate treatment and control across time windows. Weakest design for catalog work because of seasonality and day-of-week confounds; use only when the other two are impossible.

Pre-Registration Checklist

  • Primary metric: typically revenue per session or PLP conversion rate, fixed before launch.
  • Guardrail metrics: return rate, margin per session, search-to-PLP exit rate, page latency.
  • Minimum detectable effect: the smallest lift you would act on, locked before data review.
  • Stopping rules: when and why the test would be stopped early (safety or guardrail breach), and when it would be extended.
  • Analysis plan: specific statistical tests, subgroup analyses, and how multiple comparisons will be handled.

Metrics That Matter

The metric set below is what we ship on every engagement. Each metric covers a failure mode that conversion rate alone misses, and every one should be reported on the same dashboard so stakeholders can see the full picture rather than a cherry- picked slice.

MetricWhat It CatchesAgent Owner
PLP conversion rateHeadline lift; direct read on ranking quality.Re-ranker
Revenue per sessionMargin and price-mix shifts hidden by CVR alone.Orchestrator
Search-intent alignmentClicks that match the query's dominant intent category.Re-ranker
Category discovery rateWhether agents help customers find breadth, not just close the first sale.Orchestrator
ATC-to-purchase rateTop-of-funnel lifts that break the bottom of the funnel.Re-ranker
30-day return rateVisual-tag and attribute errors that drive wrong-product purchases.Visual tagger, Enrichment
Facet coverageCatalog health; percent of SKUs with complete critical facets.Enrichment, Visual tagger
Ranking stabilitySession-level thrash that annoys users and hurts trust.Re-ranker

Pair this metric set with the eCommerce SEO checklist to make sure the ranking changes don't erode organic category performance, and with the agentic commerce protocol guide for the adjacent story on how AI shopping agents on the buyer side will shape these surfaces next.

Conclusion

Agentic commerce merchandising is not a far-future bet. The platform primitives exist on Shopify, BigCommerce, and Salesforce Commerce Cloud today. The four-agent architecture — PLP re- ranking, attribute enrichment, visual tagging, and a personalization orchestrator — is portable across all three and covers the gaps that static merchandising leaves open. The 90-day lift study methodology is what converts the work from anecdote to auditable business case.

The agencies and in-house teams moving fastest are the ones treating this as a sequenced program: enrichment and visual tagging first to clean the catalog, PLP re-ranking with lift- study scaffolding next, personalization orchestration last once the specialist agents are stable. For a longer view of the platform landscape this sits inside, the 2026 eCommerce platform comparison matrix is the companion read.

Ready to Turn Your Catalog Into an Agentic Advantage?

We design and ship PLP re-ranking agents, attribute enrichment pipelines, visual tagging systems, and the 90-day lift studies that prove the work pays for itself — on Shopify Plus, BigCommerce, and Salesforce Commerce Cloud.

Free consultation
Expert guidance
Tailored solutions

Frequently Asked Questions

Related Guides

Continue exploring agentic commerce and catalog optimization