Clean Up The Music Listening Histories Dataset
- Mentors
- Alastair Porter
- Organization
- MetaBrainz Foundation Inc
- Technologies
- python, postgresql, apache spark, Typesense
- Topics
- data science, data analytics, Data Preprocessing, Data Engineering
The Music Listening History Dataset is pretty damn impressive;
It contains ~27 billion logs of real-world data from last.fm scrobbles distributed into 18 chunks summing up to ~611.39 GB of compressed text files. This results in 583k users, 555k unique artists, 900k albums, and 7M tracks.
Here each scrobble is represented in the following format:
timestamp, artist-MBID, release-MBID, recording-MBID.
(Source: https://simssa.ca/assets/files/gabriel-MLHD-ismir2017.pdf)
Unfortunately, this data has some significant fallbacks due to last.fm’s out-of-date matching algorithms with the MusicBrainz DB, resulting in frequent mismatches & errors in the recording-MBID data, affecting the quality of the available dataset.
Overall, the goal of this project is to create an updated version of the MLHD in the same format as the original, but with incorrect data resolved and invalid data removed.