How Vinted Serves Personalised Search Autocomplete (9 minute read)

Data searchinfrastructureai Read original

Vinted rebuilt their autocomplete system using edge-ngram indexing on Vespa and a LightGBM re-ranking model, growing autocomplete usage from 8% to 20%+ of search sessions while serving 4,700 QPS at 31ms P99.

What: A detailed technical breakdown of Vinted's search autocomplete rebuild, covering candidate generation from product metadata and search logs (125M suggestions across 50+ markets), offline heuristic scoring, edge-ngram indexing for sub-30ms matching, fuzzy typo handling, and a 63-feature LightGBM Learning-to-Rank model for personalized re-ranking that runs inside Vespa on every keystroke.

Why it matters: Shows how investing in foundational retrieval infrastructure delivered bigger wins than ML alone, and demonstrates that engagement metrics are better autocomplete indicators than downstream revenue—even Amazon reports only 0.13% revenue lifts from autocomplete improvements despite strong usage gains.

Takeaway: Consider moving matching costs to index time with edge-ngrams if prefix queries are too slow; run LTR only on exact matches not fuzzy fallbacks; and test debounce timing—Vinted saw 12% usage lift dropping from 350ms to 100ms.

Deep dive

Vinted generates 125 million autocomplete candidates from two sources: product metadata combinations (brand+category+color) and actual user search queries, with query-based suggestions comprising only 2% of the pool but driving 50% of clicks
Offline scoring uses a multi-objective heuristic combining sell-through rate, sold item count, suggestion usage, and CTR—normalized per country, language, and first letter so suggestions compete within their context, not globally
Edge-ngram indexing moved matching cost from query time to index time by pre-splitting suggestions into all prefixes at indexing ("apple" → ["a", "ap", "app", "appl", "apple"]), dropping P99 latency from 220ms to 25ms
Accent handling uses a multiplexer to index both original and ASCII-folded tokens, so typing "z" matches both "Zara" and "Žalgiris" but typing "ž" returns only "Žalgiris"—preserving intent when users deliberately type accents
Progressive query relaxation cascades through three tiers (exact prefix → fuzzy edit distance 1 → fuzzy edit distance 2), stopping as soon as 10 deduplicated suggestions are found, with 62% of requests never leaving the exact tier
The LightGBM LTR model uses 63 features across four groups (query/suggestion properties, popularity signals, user behavior like click history and category preferences, and contextual factors), optimizing for NDCG@1 with LambdaRank
Top features by importance are input length, when users typically click a given suggestion relative to current input length, prefix-level click frequency, and suggestions CTR—validating that the model builds on the heuristic baseline rather than replacing it
Vespa runs two-phase ranking: first-phase uses the SLS heuristic score to select top 1,000 candidates per content node, then second-phase re-ranks the top 20 with LightGBM using user features fetched in real-time from Vinted's Feature Store
Over 35 A/B tests yielded key lessons: cleaning noisy training labels from short prefixes (where users are still typing) immediately improved ranking quality, and restricting LTR to exact matches only (not fuzzy) gave a clear relevance boost
The cumulative SLS impact measured +49% suggestions CTR and +42% suggestion usage; adding LTR personalization on top delivered another +8% CTR and +4% usage, with up to +16% CTR on longer queries and stronger effects in non-clothing verticals like sports (+0.91% transactions)
Tests on richer UI features (capitalisation, category scopes) consistently lost to plain lowercase suggestions—industry defaults exist for a reason, and novelty in autocomplete UX rarely beats user familiarity with the basic pattern
Infrastructure runs on Vespa clusters with 6 content nodes per datacenter (AMD EPYC 64-core, 512GB RAM), averaging 2% search CPU and peaking at 4.5% during evening traffic, with substantial headroom for growth
Key architectural decision: Vespa was chosen over Elasticsearch for native ML inference support despite weaker lexical analysis—the team contributed Lucene Linguistics to Vespa to bridge the gap and bring edge-ngram tokenization into the platform
Future roadmap includes session-aware re-ranking using previous queries as context, surfacing user's past searches directly in autocomplete, and exploring LLM-based suggestion generation for long-tail queries once latency constraints can be met
Biggest learnings: get retrieval foundations right first (most usage lift came before ML), real user queries beat generated metadata combinations when volume exists, personalisation pays off in the long tail not aggregate metrics, and engagement metrics (CTR, usage) are more sensitive indicators than downstream revenue

Decoder

Learning-to-Rank (LTR): Machine learning approach that trains models to optimize the ordering of search results by learning from user interactions, rather than using hand-tuned scoring formulas
Edge-ngram: Indexing technique that pre-generates all prefix substrings of a term at index time, turning expensive prefix queries into fast exact lookups (e.g., "apple" becomes ["a", "ap", "app", "appl", "apple"])
Vespa: Open-source search and ranking engine that supports native ML model inference in the query path, allowing real-time personalization without leaving the search layer
NDCG: Normalized Discounted Cumulative Gain, a ranking quality metric that rewards placing highly-relevant results at the top of the list, with position importance decaying logarithmically
LightGBM: Fast, memory-efficient gradient boosting framework that builds decision tree ensembles, popular for production ranking systems due to speed and native categorical feature support
LambdaRank: A pairwise learning-to-rank algorithm that optimizes ranking metrics like NDCG directly by comparing pairs of documents and learning which should rank higher
P99 latency: 99th percentile latency—the response time threshold that 99% of requests complete under, a standard SLA metric for high-traffic services
Sell-through rate (STR): Percentage of listed items that actually sell, indicating real demand rather than just inventory volume
ASCIIFolding: Text normalization filter that converts accented Unicode characters to their ASCII equivalents (ž→z, é→e), enabling accent-insensitive matching
Levenshtein edit distance: Measure of string similarity based on minimum number of single-character edits (insertions, deletions, substitutions) needed to transform one string into another

Original article

Vinted rebuilt its search autocomplete system, moving from static, generic suggestions to a hybrid approach combining a strong heuristic scoring model with a Learning-to-Rank (LTR) model. They score suggestions offline using popularity, sell-through rate, and usage signals, index them with clever prefix and fuzzy matching techniques, then apply a LightGBM model in real-time that incorporates user behavior and context to re-rank results.