Mar 10 2008, 5h37

So, if like me, you have thousands of mp3s, live with someone else who has thousands of mp3s, have stored backups of your music elsewhere multiple times when rebuilding your laptop or computer, and have attempted to utilize musicbrainz to fix your tags, only to find that it has successfully nuked half your collection so you're forced to copy over that backup again, and it's all stored on one central machine in your house.... okay, so I'm probably the only person in the world who has this issue, but moving on.

The end result of the above, is that there are probably 2-3 copies of every mp3 mxcl and I have stored on our media center, in various directories. Max was going to write a program that would sift through this mess, discarding mp3 headers and meta data, then md5ing the contents of the actual music itself and then storing that in DB along with the path and filename so I could fix it all later. Unfortunately, he's a busy man, so I attempted yet again to find something already in existence that does this... I mean, c'mon, we can't be the ONLY people who've ended up with duplicates in our music collection. Perhaps everyone else just doesn't care, and working at Last.fm had turned me into a complete elitist when it comes to library cleanliness? I entirely blame sharevari!

That aside, after a couple hours of searching, I found DuMP3. The premise behind this program, is that it does some cool analysis on all your files in specified folders, stores the fingerprint data in a db, then tells you what are duplicates. It /is/ java, which means it's a complete CPU and memory whore, and also means that configuring it makes you want to cleave your eyes out with a blunt (rust optional) spork, but after a bit of fiddling around, it does work. I'm using Linux, I have no idea if it runs on Windows, nor do I care particularly, it's java, therefore I assume it will, but good luck handling the output. (Just thought I'd mention that before anyone asked me without bothering to visit the homepage for it).

One of the biggest challenges with the output, however, is the oddities of filenames - filenames with spaces in them, brackets, curly braces, pretty much everything one could hope to have to uncover where your one-liner falls short:

Found a duplicate:
/share/Music/mxcl/The Beatles/Help!/Yesterday.mp3
+ /share/Music/mxcl/The Beatles/1962-1966 (CD 1)/13 Yesterday.mp3 (80.59896%)

Found a duplicate:
/share/Music/mxcl/The Beatles/Unknown Album/01 When Im Sixty-four.mp3
+ /share/Music/mxcl/The Beatles/Sgt. Pepper's Lonely Hearts Club Band/08 When I'm Sixty-Four.mp3 (89.583336%)

Found a duplicate:
/share/Music/mxcl/The Beatles/Hey Jude/02 I Should Have Known Better.mp3
+ /share/Music/mxcl/The Beatles/A Hard Day's Night/I Should Have Known Better.mp3 (92.578125%)

Found a duplicate:
/share/Music/mxcl/The Beatles/Hey Jude/01 Cant Buy Me Love.mp3
+ /share/Music/mxcl/The Beatles/A Hard Day's Night/Can't Buy Me Love.mp3 (92.447914%)

Found a duplicate:
/share/Music/mxcl/The Beatles/Abbey Road/Something.mp3
+ /share/Music/mxcl/The Beatles/One/23 Something.MP3 (81.90104%)

Found a duplicate:
/share/Music/mxcl/The Beatles/Past Masters Vol. 2/Let It Be.mp3
+ /share/Music/mxcl/The Beatles/One/25 Let It Be.MP3 (80.078125%)

Found a duplicate:
/share/Music/mxcl/The Beatles/Abbey Road/Come Together.mp3
+ /share/Music/mxcl/The Beatles/One/24 Come Together.MP3 (83.333336%)

Found a duplicate:
/share/Music/mxcl/The Beatles/Past Masters Vol. 1/From Me To You.mp3
+ /share/Music/mxcl/The Beatles/One/01 From Me To You.MP3 (80.33854%)

You also need to be a little circumspect about the results too, for instance:

Found a duplicate:
/share/laptopmusictomerge/music/Portishead/Dummy/02-Sour Times.mp3
+ /share/laptopmusictomerge/music/massive attack mix) (1) (1/Unknown/00-nobody loves me.mp3 (93.75%)
+ /share/Music/mxcl/Portishead/Dummy/02 Sour Times.mp3 (95.703125%)

The track in the first line, 02-Sour Times.mp3 and the one in the last line, 02 Sour Times.mp3 are indeed the same track, however, the massive attack mix version isn't; it does however sound very much alike, hence the high percentage score.

So, for me, handling the results is a simple case of doing something like this:

$ ./finddups.sh /share/Music/ /share/laptopmusictomerge/ | grep '100.0%' | grep -o "+ /.*(" | sed 's/+ //g' | sed 's/ (//g' > 100pcmatchlist; sed -i -e 's/\n/\0/g' 100pcmatchlist

The above, simply put, finds every match that is definitely 100%, extracts the matches - I'm using anything with a + at the front as the duplicate, gets rid of the extra crap at the front and end, puts it in a file called 100pcmatchlist, and then changes all the linebreaks from character returns to \0, which is compatible with xargs.

Then, when I want to kill the dupes, I do:

$ cat 100pcmatchlist | xargs -0 rm**

Because xargs is sexy and clever, and because we've passed it -0 (which says to it 'the seperator is \0 -if you're using find, add -print0 to your find command and it will change the seperator to that too), it skips over spaces and such, and it seamlessly handles odd characters like brackets and braces.

So yeah, I just freed up 13gb of space by deleting dupes. I am a happy man.

As a sidenote, RJ did suggest using our own fingerprinter as a way of doing this, however, the output was a little too ambiguous in relation to what I was trying to achieve - matching my music to my music -; in general, I just found DuMP3 to be a lot better for this particular task.

**I will not be held responsible if you destroy your music collection by doing anything that I've written about above.
As always, YMMV, and as always, you should test your one-liners at least 3 times using echo in place of any irreversible command prior to running it for real.


