OpenRights Entertainment Rights & Catalog Intelligence
Data Cleaning intermediate

Deduplicate Titles by EIDR/ISAN + Fuzzy Title Match

Apps Script that finds duplicate titles via EIDR/ISAN exact match AND Levenshtein fuzzy matching. Catches typo-induced duplicates that pure-key matching misses.

By OpenRights Team · · 11 min read
titlesdata cleaningdeduplicationapps script
On this page

Duplicate titles in a catalog are quietly catastrophic. Two entries for “Ember Lanes” mean royalties split across two rows when they should accumulate on one; an avails query against “title contains ‘Ember Lanes’” returns both, and the platform integration picks whichever one the API returned first. The duplicate that didn’t get royalties looks inactive; someone archives it; now you can’t reconstruct the revenue history.

EIDR and ISAN should make this a solved problem — they’re unique identifiers. In practice, half the catalog has no EIDR yet, and when two people enter the same title on different days, one of them typos “Embur Lanes” and the fuzzy half goes undetected.

This tutorial walks through an Apps Script that catches both: exact match on EIDR/ISAN first (fast, high confidence), then Levenshtein fuzzy matching on normalized titles (slower, needs human review) to surface the typos.

Step 1: Set Up the Titles Sheet

The companion template ships with these columns: Title | Year | Type | EIDR | ISAN | Normalized Key | EIDR Dupe? | Key Dupe? | Notes. Paste your catalog in, then add this formula for Normalized Key (row 2):

=LOWER(TRIM(SUBSTITUTE(REGEXREPLACE(A2, "^(the|a|an) ", ""), ".", "")))&"|"&B2

This strips leading articles, lowercases, and appends the year. “The Ember Lanes (2024)” becomes ember lanes|2024.

Step 2: Run the Exact-Match Pass in Formulas

The helper columns catch EIDR and key duplicates with simple COUNTIFs:

=IF(AND(D2<>"", COUNTIF(D:D,D2)>1), "YES", "")   // EIDR Dupe?
=IF(COUNTIF(F:F, F2)>1, "YES", "")               // Key Dupe?

Conditional formatting on YES = red background. Now you’ve caught the obvious cases. The hard ones — typos — will still slip through.

Step 3: Paste the Apps Script

Open Extensions > Apps Script in your sheet and paste the openrights-title-deduplicator.gs file. The core function:

function findDuplicateTitles() {
  const sheet = SpreadsheetApp.getActive().getSheetByName("Titles Input");
  const data = sheet.getRange(2, 1, sheet.getLastRow()-1, 9).getValues();
  const results = [];

  for (let i = 0; i < data.length; i++) {
    for (let j = i+1; j < data.length; j++) {
      const a = normalizeTitle(data[i][0]);
      const b = normalizeTitle(data[j][0]);
      if (data[i][1] !== data[j][1]) continue; // different years — not a dupe
      const dist = levenshtein(a, b);
      if (dist > 0 && dist <= 2) {
        results.push([data[i][0], data[j][0], data[i][1], dist]);
      }
    }
  }
  writeResults(results);
}

The dist <= 2 threshold catches 1-2 character typos while avoiding false positives on legitimately different titles that happen to share letters.

Step 4: Review the Dedupe Candidates

The script writes its output to a “Dedupe Candidates” sheet: one row per candidate pair, with both titles, the shared year, and the edit distance. Sort by edit distance ascending — distance-1 pairs are almost always real duplicates; distance-2 pairs need human judgment.

Step 5: Handle Remakes

Same title, different year = not a duplicate. The algorithm already filters these out via the year check. Remakes share titles legitimately; “The Italian Job” (1969) and “The Italian Job” (2003) both exist, both have separate EIDRs. Don’t merge.

Common Issues

  • Subtitle collapse. “Avengers” and “Avengers: Endgame” are distance 9 — the algorithm won’t flag them, but they’re the same franchise and sometimes the same title with a subtitle dropped. Manual review catches this.
  • International titles. “Amélie” and “Le Fabuleux Destin d’Amélie Poulain” are the same film. Pure string matching can’t catch cross-language dupes; use the EIDR pass for these.
  • Episode titles. Two episodes named “Pilot” from different series aren’t dupes. Always include Series Title in the key when dealing with episodic content.

Download the Template + Apps Script

Grab the Title Deduplication Worksheet and the openrights-title-deduplicator.gs Apps Script. Paste the script into a fresh Apps Script project bound to your workbook and run findDuplicateTitles from the editor.

OpenRights Weekly

Free templates, tutorials, and data quality tips. Every week.

Need help with your catalog data?

Share your challenge and we'll help — by email or callback. 100% free, no sales pitch. Part of our mission to help rights and catalog professionals.

Get Free Help