Smartly detect and clean duplicates from your dataset (CSV or Excel).

This function scans your data to find:

🔁 Exact duplicates — identical rows or repeated entries.
🤖 Fuzzy duplicates — similar rows with small differences
(typos, spacing, casing, or minor text variations).

It automatically keeps the first valid occurrence of each duplicate
and exports everything neatly organized in a single downloadable ZIP.

📦 Inside the ZIP you’ll get:

```
deduplicated_<name>.csv
```
— your cleaned dataset (duplicates removed)
```
duplicates_removed_<name>.csv
```
— all duplicate rows that were dropped
```
fuzzy_pairs_<name>.csv
```
— pairs of rows that look alike (based on similarity)

Args:
file (FilePath): The uploaded CSV or Excel file to analyze.
subset (str): Optional — comma-separated list of column names to check.
If left empty, all columns are analyzed.
similarity_threshold (int): Optional — how strict fuzzy matching should be (0–100).
Higher = only very similar values are flagged.
Default = 90 (good balance).

Returns:
str: Generated ZIP archive containing the cleaned dataset
and detailed duplicate reports.

Clean Your Data: Remove Duplicates Instantly

Run Function

Reviews

0.0

Integration