Using the Spotify API for data-driven analysis of my playlists (Part 2/2)

Hey there and welcome to the second part of using the Spotify API for a data-driven analysis of playlists. Last time we went through connecting to the API programmatically, and how to process the necessary data. This time I will show how I used the outputs to build a Power BI report that analysed various aspects of my own playlists.

I will cover the core aspects of any report, namely the data model, purpose of each report page, cross filter techniques, and core measures. This won’t be a step-by-step guide on how to build the report, rather a highlight of the main aspects and/or challenges I faced during development.

If you just want to interact with the report, you can find it on my GitHub repository here. Oh and don’t worry about downloading any additional files for the report — the queries are loading the Parquet files hosted on GitHub from the previous article : )

The goal

Let’s start by looking at what everything is building towards: the report pages. It’s easier to understand the technical decisions made if you have in mind what the end-result should look like.

There are a total of 3 pages:

Playlists, Artists, Albums & Tracks Overview

Playlists, Artists, Albums & Tracks Overview
Playlists, Artists, Albums & Tracks Overview

This is meant as an initial overview about the “music inventory” in my playlists. It has the standard layout across the report to showcase a title at the top, followed by a header of KPIs and a slicer panel on the left. The remaining canvas is used for visuals with navigational arrows in the footer.

By default, the bar charts are already drilled down to the artist level of the “good music” playlist — my largest playlist — but the full hierarchy is Playlist > Artist > Album > Track.

The visuals break down the chosen hierarchy level:

  • By number of tracks to understand the most popular e.g. artists
  • By a given audio feature to e.g. identify the longest tracks
  • By the correlation between all the audio features, which is interesting to see change by cross-filtering for different artists or albums

The audio features bar chart on the right uses the recently-released field parameters, controlled by the same-titled slicer on the left side. Note the correlation matrix always displays all the features.

Playlist Decomposition

Playlist Decomposition
Playlist Decomposition

This is a fun page to interactively analyse which artists, albums or tracks I have on the playlists. You see how it starts from a total number of tracks, but then is interactively broken down by different variables. There isn’t a lot going on, but I found it fun to click around and investigate the funnel of all the variables available.

Duplicate Track Exploration

Duplicate Track Exploration
Duplicate Track Exploration

To be honest this page was the main motivation for me to build the report to start with. I am paranoid with keeping duplicate versions of the same track in my playlists, e.g. initially the track released as a single, but later comes out in an album. Spotify itself stores these as different tracks, so I built this page to find same-titled tracks for the same artist(s) that have been released in multiple albums, and so, find what is potentially duplicated tracks in my playlists.

The table on the left holds the complete list of potential duplicates, and clicking one of the tracks displays all its instances in the second table. The screenshot above shows I had All Good Thing’s “For The Glory” from two albums in my “good music” playlist: the original single release and the later album release.

Data Model

Now that you’re aware of what I was building toward, let’s get on with the data model.

Data Model
Data Model

The report is built on top of a star schema with four additional tables created for specific visual needs.

The dimensions are straightforward lists of unique playlists, albums, artists, and tracks and additional information. The fact table connects all these dimensions so that each row represents a combination of audio feature, corresponding value, and the four dimensions mentioned. I could have and maybe should’ve split the audio features into another dimension, but it was helpful to keep them in the fact for the sake of the correlation matrix — just a bit annoyed I had to leave a text column on my fact.

Below is a sample of the fact table:

Fact sample

The Audio Features Corr X and Y tables were used specifically for the correlation matrix visual. Audio Features FP was generated by Power BI when creating the audio features field parameters. _measures holds DAX created measures.

Data Modelling

Now let’s have a look at the data modelling in Power Query — though it was a light exercise to be honest.

The dimensions were created by duplicating the track playlists file into four different queries: playlists, albums, artists and tracks. For each query I kept only columns relevant to that dimension, removed duplicate rows and added an automatic index.

The fact was modelled by joining with all other queries to grab the dimension IDs from the automatic index column. The starting point of the fact was the file of audio features which contains the track id, the audio feature and the respective value. I joined this to the original tracks_playlists file (TrackPlaylists query if you download the report) on the track id, and from that point I had both audio features and all necessary values to join with the dimensions.

With the joins complete, the data modelling was also complete. Some choices made while processing the raw data with Python really simplified the modelling in Power Query, namely unpivoting the audio features and keeping all the IDs in a single file (track_playlists.parquet).

Cross-dimension filters

One really neat trick I had to learn for this report was how to implement cross-filter across dimensions. By that I mean clicking e.g. in a data point for Dance Gavin Dance (an artist) should only display the number of albums I have of them in the playlist(s) selected. If you think about this requirement in terms of a SQL query, you’ll realise this is not straightforward in Power BI by default. Have a look at the diagram below

The filter direction of the Tracks-Fact relationship makes it so Tracks can filter Fact, but not vice-versa. In the relationship Albums-Fact only Albums can filter Fact. A filter of Artitsts = “Dance Gavin Dance” definitely filters the rows in Fact to their rows only, but then Fact can’t propagate the filter to Albums and display only DGD’s albums.

However, we do know that we are interested only in those rows that have a) the ID for DGD, and the albums whose IDs are in those rows.

The solution for this was done at the measure-level, but this is especially noteworthy so I am explaining it separately from the measures.

# Albums =
DISTINCTCOUNT( 'Fact'[Album ID] ),
NOT(ISBLANK( 'Fact'[Playlist ID]))

The above is the basic measure I used to count the number of albums. Aside from the good old DISTINCTCOUNT, look at the filter applied in the CALCULATE: it checks the playlist id is not blank.

If you think in SQL, when you join two tables you get back a NULL result if the join condition is not fulfilled for a row. That’s what happens with Power BI calculations as well. In this case in particular, if the playlist id is blank for a particular row of the fact table, then don’t bother considering it for calculations. If Dance Gavin Dance was clicked in a bar chart, then don’t even bother counting any rows with a blank playlist id, i.e. rows that were not about DGD.

It seems odd the filter condition for albums uses playlist id, but honestly in this report it could be the id of any dimension. All rows must have a playlist, artist, album and track associated, i.e. all foreign keys exists for all rows, then checking for the playlist id is just a way of making Power BI ask “is this row relevant for my current calculations”? You can find other solutions that handle these kind of blank checks in the filter pane at the visual-level, but in my case it was simpler to build it into the measures directly.

Core measures

I separate the measures created for this report into three logical groups:

  • Base measures, i.e., counts of dimensions in this particular case including the filter logic explained previously
  • Field parameter measures for the audio features bar chart in the first page — all six measures are a CALCULATE AVERAGE of the audio features, each measure filtered for an individual audio feature
  • Other measures, which are just two

Base measures and field parameters don’t require further explanation, so let’s tackle the other measures.

“Audio Corr Coef” is the measure used to calculate the correlation for the audio features in the correlation matrix. BI Elite has a detailed video explaining the logic behind this visualisation, so refer to his video. The differences in my implementation are the use of a single measure instead of creating multiple measures for each individual calculation, and the showing only the bottom half of the matrix to avoid displaying duplicate values. The hiding part was achieved by changing the return value of the measure to be

RETURN IF(    CurrentX < CurrentY,    result,    BLANK())

Afterwards I used conditional formatting to display the actual correlation value in each cell of the matrix, colour code the values with a gradient from red to green and completely hide that top half.

The last measure has to do with the duplicate track count. And yes this is the last measure. Most of the report was achieved with a handful of measures because these visuals revolve around similar calculations but sliced by different dimensions.

So, my definition for a duplicate track was a track name that appeared multiple times for the same artist in the same playlist. This calculation required a virtual table, i.e., a table calculated during the measure.

VAR FactWithNames = ADDCOLUMNS(    'Fact',    "Track Name",    RELATED(Tracks[track_name]),    "Artist Name",    RELATED(Artists[artist_name]),    "Playlist Name",    RELATED(Playlists[playlist_name]))

ADDCOLUMNS returns a virtual table, stored in the the variable FactWithNames. The virtual table uses the “Facts” table as its source, and adds three columns for the track, artists and playlist names, by using the relationship with the dimensions in the data model.

Next, I can use an aggregation to calculate the number of times each trio of track + artists + playlist appears, using GROUPBY, and COUNTX for the aggregation measure.

VAR TrackArtist = GROUPBY(    FactWithNames,    Tracks[track_name],    Artists[artist_name],    Playlists[playlist_name],    "Duplicate Count", COUNTX(CURRENTGROUP(), 'Fact'[Track ID]))

Note how GROUPBY uses the virtual table as the input, but references the original columns from the dimensions. DAX continues referencing the original columns in the background and as a consequence GROUPBY also needs to reference the original source, not the virtual source.

VAR NumberAudioFeatures = DISTINCTCOUNT( 'Fact'[Audio Feature] )RETURN DIVIDE(AVERAGEX(TrackArtist, [Duplicate Count]), NumberAudioFeatures)

Finally, the number of times the track exists in the playlist can be derived by dividing the number the result of the GROUPBY by how many audio features there are for the track. Couple points to explain here:

  • [Duplicate Count] is the syntax to reference the column calculated by GROUPBY
  • Each track+artist+playlist trio is duplicated in the “Fact” table by nature, so the result of GROUPBY needs to be divided by how many audio features there are
  • Keep in mind this measure is used in a table visual along with the playlist, artist and track names, so it is indeed calculated for each trio

The last thing to do was to add the measure as a visual-level filter to the table in the slicer panel.

The left table shows only those tracks that are identified as possibly duplicate entries in the playlist, and clicking one of them highlights all the duplicate instances in the right table.

Track tooltip

The last thing I have to discuss before wrapping up is the track tooltip. This is a tooltip that lists the artists, playlists, albums, and audio features for a specific track. The goal is to further contextualise a track when hovering with the mouse in the visuals across the report. To be honest it is more experimental than maybe useful to the user, but well here is the tooltip with the information spread across two table visuals to avoid it becoming too wide.

Closing thoughts & Further work

And we’ve reached the end of the walkthrough for the report. This version of the report covers all the requirements I had in mind, but writing this article showed me it could be enhanced further.

For example, I could add drill throughs for playlists, artists, and albums; or the first page could probably be better laid out not just visually but also in terms of how the hierarchies and cross filters play off each other

At any rate, I am pretty satisfied with the outcome and there were a few things that were really worth highlighting in this article like the technique to deal with filters across dimensions and the calculation of duplicate tracks using a virtual table.

Finally, if you want a copy of the .pbix file, head over to my GitHub repository : )



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
José Fernando Costa

I write about data science to help other people who might come across the same problems