What exactly is K-Pop as a music genre?

An exploratory data analysis of top K-pop tracks using R, Spotify Web API, and lyrics from Genius.

Ashley Yihan Fang
11 min readNov 22, 2020

As a music fan living in 2020, you have probably asked at least one of the two following questions: “Why is K-pop so addictive?” or, “Why is K-pop so popular?”

Whether you like it or not, K-pop, short for Korean popular music, is rapidly gaining popularity in the U.S. and countries all around the globe. With the rise of superstar groups like BTS and BLACKPINK, K-pop is also becoming one of the top genres on music streaming services. Even as a die-hard K-pop fan, I still had my jaw dropped when I saw Dynamite by BTS top the Billboard Hot 100. K-pop is becoming the dark horse and an increasingly fascinating music genre to explore.

Since K-pop is a genre with appealing visual components such as choreography, we tend to pay less attention to the music itself. I knew K-pop draws heavy influences from genres such as pop, hip hop, dance, rock, and R&B, but I find it hard to describe in words what K-pop music exactly sounds like. Therefore, as soon as I learned about Spotify Web API and the types of data it offers, I decided to start analyzing the audio features of K-pop songs (I was so excited when I got my first string of Spotify data I cried in front of my laptop at 3AM).

In this exploratory data analysis (EDA), I will examine the audio features and lyrics of 88 K-pop tracks from 2019. I will also be analyzing K-pop alongside 10 other major genres, with tracks obtained from Spotify’s “Best of 2019” playlists: pop, rock, hip hop, dance, R&B, indie, country, Afropop, Latin, and gospel. The top K-pop tracks are a combination of Spotify’s “Top K-Pop Artists of 2019” playlist and “K-Pop 2019” by NewMusicFriday. I will use the spotifyr package to extract data from Spotify Web API, and the genius package to obtain K-pop lyrics from Genius. Spotify offers data on audio features such as “valence,” “danceability,” “acousticness,” and “speechiness,” all of which will be introduced later in detail.

You can find all my code and datasets here.

After some extensive data cleaning, let’s take a look at the top 10 K-pop tracks in 2019, based on the “popularity” rating provided by Spotify:

kpop %>% 
arrange(desc(popularity)) %>%
select(artist.name, track.name, popularity) %>%
head(n = 10) %>%
kbl(caption = "Top 10 K-Pop Tracks in 2019") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F) %>%
column_spec(1, bold = T) %>%
row_spec(1:10, color = "black", background = "#ECFFF0")

Unsurprisingly, “Boy With Luv” by BTS tops the chart, followed by the virtual League of Legends girl group K/DA and multiple BLACKPINK tracks.

Since girl and boy groups play the dominant role in K-pop, I thought it would be interesting to look at the types of artists who made up the top list of 2019.

kpop %>% 
mutate(type = fct_infreq(type)) %>%
ggplot() +
geom_bar(aes(x = type, fill = type), show.legend = F) +
theme(axis.text.x = element_text(angle = 30, size = 12),
panel.background = element_rect(fill="#F4F6F7")) +
scale_fill_brewer(palette = "Pastel2") +
labs(title = "Types of Artists in K-Pop Top List")

While girl and boy groups make up the majority of the list, we still have a decent amount of solo artists, as well as two boy bands (who are mainly instrumentalists) and one mixed-gender group, KARD.

Next, I will compare some audio features among different types of K-pop artists. I chose “valence,” “energy,” and “danceability” because K-pop is generally perceived as upbeat, intense, and choreography is an indispensable part of K-pop performances. Spotify describes these three audio features as follows:

Valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Energy: A measure from 0.0 to 1.0 which represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.

Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.

In addition, I included track tempo, measured in beats per minute (BPM), as a fourth audio feature. I omitted boy bands and mixed-gender groups due to the small sample size. Since type is a categorical variable, I used geom_jitter() to add random noise so observations don’t overlap. The code for “valence” is included as an example:

valence <- kpop %>% 
group_by(type) %>%
filter(!(type == "boy band" | type == "mixed group")) %>%
ggplot(aes(x = reorder(type, popularity), y = valence, color = type)) +
geom_jitter(alpha = 0.5, show.legend = F) +
stat_summary(fun = mean, geom="crossbar",
width=0.5, show.legend = F) +
scale_color_brewer(palette = "Set2") +
theme(axis.text.x = element_text(angle = 30, size = 10),
panel.background = element_rect(fill="#F4F6F7")) +
labs(title = 'Energy', x = element_blank())
Audio features by artist type
Audio Features by Artist Type

On average, tracks by girl groups are the most positive, energetic, and danceable. Boy groups have the fastest songs, and male solo artists tend to have songs that are softer and slower. However, it is worth noting that the observations are quite disperse, indicating that my sample of K-pop tracks vary in style and emotions.

My friend used to joke that whenever K-pop idols pull off a high note, it’s always that A4 note for male idols and E5 for female idols. Therefore, I thought it would also be interesting to look at the most common key and mode (major or minor) used in top K-pop tracks:

The most frequently used keys are G-major and C-major, which are also quite common in other music genres because they are easy to play on instruments such as the piano and strings.

Next, let’s take a look at lyrics. What are some of the main themes and topics covered in K-pop tracks? In order to gather a larger sample, I scraped lyrics for not only the 88 top tracks, but also the EPs and albums in which those tracks are featured. Using the genius package, I obtained lyrics for a total of 464 K-pop songs that were popular in 2019. Since K-pop songs often contain English lyrics, I counted the frequency of both English and Korean words. After removing stop words (such as “a,” “the,” “than,” and “ooh”), I created word clouds for the most common words in K-pop lyrics:

wordcloud_ko <- 
wordcloud(words = lyrics_ko$word, freq = lyrics_ko$n,
max.words = 40, random.order = FALSE, rot.per = 0,
colors = RColorBrewer::brewer.pal(4, "Pastel1"))
Most Frequently Used Words in K-Pop Lyrics

For me, these words seem to be very present-focused, with love and bodily experiences as central themes. I also see a subtle contrast between the types of emotions portrayed with Korean and English lyrics. While the English words tend to be upbeat and party-themed, the Korean words seem to be more emotional. For example, “순간” is “moment,” “함께” is “together,” “시간” is “time,” “괜찮아” means “it’s okay,” and “눈빛” means “expression in one’s eyes.” All of these words are quite different than those in the English word cloud.

I also wanted to compare the audio features of K-pop with other major music genres, so I obtained data for 50 tracks from each of 10 different genres. After removing tracks that appeared in multiple playlists, I ended up with a total of 532 songs from 11 genres (including K-pop).

One interesting finding from my data exploration is the average loudness of genres. Spotify records loudness in decibels (dB), so loudness values are negative, and louder tracks have smaller absolute values (e.g. -5dB is louder than -20dB). Therefore, I obtained the plot below by taking the negative reciprocal of loudness values:

alltracks %>% 
group_by(genre) %>%
mutate(mean = mean(-1 / loudness)) %>%
ggplot(aes(x = reorder(genre, mean), y = mean)) +
geom_col(aes(fill = genre), show.legend = FALSE) +
coord_flip() +
scale_fill_brewer(palette = "Set3") +
labs(x = "Genre", y = "Loudness", title = "Average Relative Loudness by Genre") +
theme(panel.background = element_rect(fill="#F4F6F7"),
axis.text.y = element_text(size = 13))
Note: “cng” is short for Christian and gospel

I was surprised to find that K-pop tracks are actually the loudest among all 11 genres. Mastering engineers sometimes like to put a “brick wall” limiter on tracks to make them louder, since louder tracks are supposedly more captivating (I am not a fan of this mixing technique, though).

I always thought of K-pop as a genre that thrives on singles and EPs instead of full-length albums. K-pop groups usually have more than one comeback (a K-pop-specific word for releasing new music) per year, and they like to experiment with different styles, concepts, and choreography, so it is more realistic for them to work on shorter and more focused projects. Since Spotify provides data on the type of album a track belongs to, I calculated, under each genre, the proportion of tracks that belong to a single.

singlepercent <- alltracks %>%
group_by(genre, album.type) %>%
filter(album.type == "single") %>%
summarize(single.no = n(), percentage = single.no / 50)
singlepercent %>%
select(genre, percentage) %>%
rename(singles = percentage) %>%
arrange(desc(singles)) %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = F) %>%
column_spec(1, bold = T) %>%
row_spec(1:11, color = "black", background = "#ECFFF0")
Proportion of singles by genre

In fact, dance is the genre with the highest proportion of singles; only 4% of the top 50 dance tracks belong to an album. As expected, K-pop also has a high proportion of singles; 64% of the top 50 K-pop tracks are released as singles instead of a part of a larger project.

Last, but not least, I will compare the audio features of all 11 genres using radar charts. I chose six audio features: “danceability,” “energy,” and “valence,” which were described previously, as well as “liveness,” “acousticness,” and “speechiness,” which Spotify describes as follows:

Liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.

Acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

Speechiness: Detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.

To create radar charts, I calculated the average attribute values for all audio features, as well as the maximum and minimum values across genres, which were used to define boundaries for the radar charts:

# Calculating mean values of audio features
features <- alltracks %>%
select(danceability, energy, speechiness, acousticness, liveness, valence, genre) %>%
group_by(genre) %>%
mutate(danceability = mean(danceability),
energy = mean(energy),
speechiness = mean(speechiness),
acousticness = mean(acousticness),
liveness = mean(liveness),
valence = mean(valence)) %>%
distinct(genre, .keep_all = TRUE) %>%
ungroup()
# writing function to find max and min of each feature
colMax <- function (x) { apply(x, MARGIN=c(2), max) }
colMin <- function (x) { apply(x, MARGIN=c(2), min) }
maxmin <- data.frame(max=colMax(features), min=colMin(features)) %>%
t() %>% as_tibble() %>% select(-genre) %>% mutate(genre = "NA")
# joining data frames and preparing data for radar chart
features2 <- rbind(maxmin, features) %>% as.data.frame()
rownames(features2) <-
c("max", "min", "pop", "rnb", "hiphop", "indie", "cng", "latin", "afropop", "rock", "country","dance", "kpop")

Finally, I plotted radar charts for each of the 11 genres. The code for K-pop is shown here as an example:

features2 %>% 
filter(genre %in% c("kpop", "NA")) %>%
select(-genre) %>%
sapply(as.numeric) %>%
as.data.frame() %>%
radarchart(pcol = "#F1948A", pfcol = scales::alpha("#F1948A", 0.4), plwd = 3, cglcol="grey", cglty = 1, axislabcol="black", cglwd = 0.8, vlcex = 1, cex.main = 1.5)

Just by looking at the shape of the charts above, K-pop is perhaps the most similar to dance- both are high on “energy,” low on “speechiness” and “acousticness,” and middle-of-the-road on “danceability” and “valence.” Isn’t it funny that dance and K-pop are actually not that “danceable?”

Although this data exploration project is supposed to be about K-pop, I also enjoyed looking at the characteristics of other genres. In fact, these radar charts confirmed my impressions of many genres: hip hop has the highest “speechiness,” rock and dance are the most energetic, and Latin and Afropop are the “happiest” genres. As a musician, I discourage the use of over-simplistic approaches when defining genres, but as an amateur data scientist, I had so much fun exploring music data and making visualizations of audio features that I could not put into words.

Thank you for spending the time to read through my EDA! I hope you found it at least somewhat interesting and informative. This project is inspired by Simran Vatsa’s blog on Taylor Swift, who is also one of my all-time favorites. Please check out her blog here:

I am still new to R and data analytics, so please let me know if there’s anything I can improve upon. Thank you so much! (๑•̀ㅂ•́)و✧

--

--