Machine leaning algorithm for data classification.

Posted by twk on Stack Overflow See other posts from Stack Overflow or by twk
Published on 2010-06-03T15:49:33Z Indexed on 2010/06/05 17:22 UTC
Read the original article Hit count: 241

Hi all,

I'm looking for some guidance about which techniques/algorithms I should research to solve the following problem. I've currently got an algorithm that clusters similar-sounding mp3s using acoustic fingerprinting. In each cluster, I have all the different metadata (song/artist/album) for each file. For that cluster, I'd like to pick the "best" song/artist/album metadata that matches an existing row in my database, or if there is no best match, decide to insert a new row.

For a cluster, there is generally some correct metadata, but individual files have many types of problems:

  • Artist/songs are completely misnamed, or just slightly mispelled
  • the artist/song/album is missing, but the rest of the information is there
  • the song is actually a live recording, but only some of the files in the cluster are labeled as such.
  • there may be very little metadata, in some cases just the file name, which might be artist - song.mp3, or artist - album - song.mp3, or another variation

A simple voting algorithm works fairly well, but I'd like to have something I can train on a large set of data that might pick up more nuances than what I've got right now. Any links to papers or similar projects would be greatly appreciated.

Thanks!

© Stack Overflow or respective owner

Related posts about machine-learning

Related posts about classification