Testing ChatGPT's Non-Ranking Nature: Espresso Machine Brand Visibility Analysis

Our Hypothesis

Unlike Google's search results which provide clear, consistent rankings, we hypothesized that ChatGPT doesn't actually "rank" brands in the traditional sense. A brand might be mentioned first in one answer but appear later (or not at all) in others, suggesting that ChatGPT's responses are more about contextual relevance than strict ranking.

Inspiration: Elie Berreby's Analysis

This experiment was inspired by Elie Berreby's article "Rankings Don't Apply to AI Search", which argues that traditional SEO ranking concepts don't apply to AI-powered search. Berreby suggests that AI responses are inherently variable and context-dependent, challenging the notion of consistent rankings in AI-generated content.

Methodology

To test this hypothesis, we conducted a comprehensive experiment with 100 runs across four distinct rounds:

  • Round 1 (Baseline): 100 runs with search mode disabled, using the exact same question to establish a baseline for ChatGPT's response patterns
  • Round 2: 100 runs with slight variations in question phrasing to test Elie Berreby's point about response consistency
  • Round 2.2: 100 runs with more significant question variations, focusing on different aspects like reliability, best taste, and quality to understand how context affects brand visibility
  • Round 3: 100 runs with search mode enabled to analyze how web search capabilities impact response variability

Key Findings

Round 1: Baseline Analysis

Round 1 Rankings Comparison

Our baseline analysis of 100 identical queries asking "what's the best espresso machines under $250" revealed a surprising level of consistency in ChatGPT's responses. De'Longhi maintained the #1 position with almost no exceptions, while positions 2–5 showed minor variations in their relative ordering. This pattern of consistency in top positions, with some flexibility in lower rankings, is actually quite similar to what we observe in Google search results. There were only a few instances (3–4 times) where ChatGPT provided only the top 3 instead of the top 5, but even then, the top 3 remained consistent.

This finding suggests that ChatGPT does maintain a form of ranking consistency for certain queries, particularly when the question is specific and the product category has clear market leaders. The consistency in De'Longhi's #1 position across 100 runs indicates that ChatGPT's responses aren't entirely random or context-dependent for well-established product categories.

Round 2: Question Variation Analysis

Round 2 Brand Visibility

In Round 2, we tested Elie Berreby's challenge about question phrasing by introducing subtle variations to the original question. Despite these variations, the results remained remarkably consistent, suggesting that with minor phrasing changes, it's still reasonable to discuss rankings in ChatGPT's responses. The core brands maintained their relative positions, indicating a level of stability in ChatGPT's recommendations for this product category.

Round 2.2: Contextual Variation Analysis

Brand Visibility Heatmap
Brand Visibility Comparison

Round 2.2 introduced stronger contextual variations by incorporating terms like "most reliable," "best taste," and "most professional." This round better aligns with Elie's fundamental challenge about question phrasing, as it demonstrates how different aspects of the same product category can lead to significantly different brand recommendations. While the original brands maintained roughly similar positions, we observed a substantial increase in the total number of distinct brands mentioned. Some brands showed strong visibility in relation to specific concepts while remaining nearly absent in others. This confirmed Elie's point about established brands like De'Longhi maintaining strong associations with key concepts like reliability. However, based on our dataset of real ChatGPT conversations, Round 2.1's results may better reflect natural variation in question phrasing that users typically employ.

Round 3: Search Mode Impact Analysis

Brand Mentions Evolution

Round 3 addressed the challenge raised by John Campbell, Itay Malinski, and Paul DeMott regarding the impact of ChatGPT's Search feature. With Search enabled (increasingly the default setting), we observed what could be described as a "fuzzy ranking" system. While the core group of brands remained similar to the non-Search results, approximately half a dozen new brands appeared, sometimes even claiming top positions. This introduces more noise into the rankings while maintaining some consistency with the original results.

Conclusion

Our comprehensive experiment with 100 runs per round revealed several key insights about ChatGPT's ranking behavior:

  • With identical or subtly varied questions (Rounds 1 and 2), ChatGPT shows remarkable consistency in its rankings, particularly for established market leaders
  • Strong contextual variations (Round 2.2) introduce more brand diversity while maintaining core brand associations with specific concepts
  • Search mode (Round 3) creates a "fuzzy ranking" system that maintains core brand presence while introducing additional variability
  • The results suggest that ChatGPT's responses can be discussed in terms of rankings, but with important caveats about context and search settings
  • Established brands maintain strong positions across variations, particularly when associated with key concepts like reliability

These findings suggest that while ChatGPT's responses aren't as rigid as traditional search rankings, they do maintain a degree of consistency that makes ranking-based analysis meaningful, especially for well-established product categories and market leaders.