Contributor
Prathamesh Ghatole

Clean Up The Music Listening Histories Dataset


Mentors
Alastair Porter
Organization
MetaBrainz Foundation Inc
Technologies
python, postgresql, apache spark, Typesense
Topics
data science, data analytics, Data Preprocessing, Data Engineering
The Music Listening History Dataset is pretty damn impressive; It contains ~27 billion logs of real-world data from last.fm scrobbles distributed into 18 chunks summing up to ~611.39 GB of compressed text files. This results in 583k users, 555k unique artists, 900k albums, and 7M tracks. Here each scrobble is represented in the following format: timestamp, artist-MBID, release-MBID, recording-MBID. (Source: https://simssa.ca/assets/files/gabriel-MLHD-ismir2017.pdf) Unfortunately, this data has some significant fallbacks due to last.fm’s out-of-date matching algorithms with the MusicBrainz DB, resulting in frequent mismatches & errors in the recording-MBID data, affecting the quality of the available dataset. Overall, the goal of this project is to create an updated version of the MLHD in the same format as the original, but with incorrect data resolved and invalid data removed.