Smarter URL Normalization at Scale: How MIQPS Powers Content Deduplication at Pinterest (10 minute read)
Pinterest built MIQPS to solve URL deduplication at scale by normalizing URLs into canonical forms while avoiding false positives that would merge distinct content.
What: MIQPS is Pinterest's URL normalization system that strips tracking parameters, formatting variations, and other noise from URLs to group duplicate content under a single canonical representation, helping Pinterest identify when multiple URLs point to the same underlying content.
Why it matters: URL deduplication is critical for platforms that aggregate external content because the same article or product can appear with dozens of URL variants (different tracking codes, session IDs, URL capitalization, etc.), leading to fragmented engagement metrics, redundant processing, and degraded user experience from seeing duplicates in feeds.
Takeaway: When building systems that ingest external URLs, implement URL normalization early with both automated rules and continuous precision monitoring to prevent both duplicate content issues and accidental over-merging of distinct pages.
Deep dive
- Pinterest faces massive URL variation because users share links with different tracking parameters, capitalization, and formatting that all point to the same content
- MIQPS (the name suggests "Million Queries Per Second" scale) normalizes URLs by applying rule-based transformations to strip known noise patterns
- The system creates equivalence groups where multiple URL variants map to a single canonical URL representing the true underlying content
- Precision is the critical metric—over-aggressive normalization can incorrectly merge distinct content (like different product SKUs or article pages)
- Pinterest implements safeguards to prevent precision loss, likely through allowlists, domain-specific rules, or conservative parameter stripping
- Continuous evaluation loops measure normalization accuracy in production, catching cases where rules merge content incorrectly
- The feedback loop allows Pinterest to adjust normalization rules over time as new URL patterns emerge or existing rules prove too aggressive
- This approach balances recall (catching most duplicates) with precision (not creating false positives) in a production environment
- URL normalization at scale requires domain expertise since what constitutes "noise" varies by website—some query parameters change content while others are just tracking
- The system likely handles edge cases like URL encoding differences, trailing slashes, default ports, and protocol variations (http vs https)
Decoder
- URL normalization: Process of transforming different URL representations into a standard canonical form by removing irrelevant differences
- Canonical form: The single authoritative representation chosen from multiple equivalent URLs
- Equivalence groups: Sets of URLs that are considered duplicates and map to the same canonical URL
- Precision vs recall tradeoff: Precision measures avoiding false positives (incorrectly merging different content), while recall measures catching true duplicates
- Tracking parameters: Query string parameters added to URLs for analytics (utm_source, fbclid, etc.) that don't change the underlying content
Original article
Pinterest's MIQPS system normalizes URLs by stripping noise (like tracking parameters and formatting differences) to map many variant URLs to a single canonical form, enabling URLs to be clustered into equivalence groups, with safeguards for precision (avoid over-merging distinct content) and continuous evaluation loops to measure accuracy and adjust rules over time.