Music Analysis
About
I have a problem of listening to too many songs, I always run out of songs pretty fast. This project was born as a solution to that.
I collected by personal music data, using spotipy API I was able to get the songs' components, using this I was able to analyse and create a criteria for my personal music preference. Using this criteria and a bigger (about half a million rows) dataset, I predicted about 2000+ new songs that I would like.
Go to..
Dataset
Project Code
Libraries

Head

Body
Input
print(df1.head(2))
print(df1['Artist'].mode())
print(df1['Artist'].value_counts(['BTS'])
total_count = df1['Artist'].str.count('Post').sum()
print(total_count)
print(len(df1))
avg={}
valueColumns=['Danceability','Energy','Loudness','Speechiness','Acousticness','Instrumentalness','Liveness','Tempo','Duration_ms']
avg_values=df1[valueColumns].mean()
for a,b in zip(avg_values,valueColumns):
avg[b]=a
avg['Key']=df1['Key'].mode()[0]
#avg['Artist']=df1['Artist'].mode()
print('avg: ')
print(avg)
Output

Input
#Analysing dataset attributes
for i in valueColumns:
print(i,":")
sns.catplot(x=df1.index,y=df1[i],data=df1)
plt.show()
Output
Input
#Analysing dataset attributes
for i in valueColumns:
print(i,':')
#font={'size':10}
#plt.rc('font',**font)
x = np.arange(min(df1[i]), max(df1[i]), 0.001)
#create range of y-values that correspond to normal pdf with mean=0 and sd=1
y = norm.pdf(x,df1[i].mean(),stat.stdev(df1[i]))
#define plot
fig, ax = plt.subplots(figsize=(5,5))
ax.plot(x,y)
#choose plot style and display the bell curve
plt.style.use('fivethirtyeight')
plt.show()
Output
Input
#Printing artists Ranking:
print(df1['Artist'].value_counts().idxmax())
a=0
df1_1=df1[df1['Artist']!=df1['Artist'].value_counts().idxmax()]
while a<5:
a=a+1
print(df1_1['Artist'].value_counts().idxmax())
df1_2=df1_1[df1_1['Artist']!=df1_1['Artist'].value_counts().idxmax()]
print(df1_2['Artist'].value_counts().idxmax())
df1_1=df1_2[df1_2['Artist']!=df1_2['Artist'].value_counts().idxmax()]
#Using the current data about me and trying to find songs that i might like
crit_max=copy.deepcopy(avg)
for i in crit_max.keys():
crit_max[i]=crit_max[i]+stat.stdev(df1[i])
print("crit_max:")
print(crit_max)
print('')
crit_min=copy.deepcopy(avg)
for i in crit_min.keys():
crit_min[i]=crit_min[i]-stat.stdev(df1[i])
print('crit_min')
print(crit_min)
#Plotting correlation between variables
def Pearson_correlation(X,Y):
if len(X)==len(Y):
Sum_xy = sum((X-X.mean())*(Y-Y.mean()))
Sum_x_squared = sum((X-X.mean())**2)
Sum_y_squared = sum((Y-Y.mean())**2)
corr = Sum_xy / np.sqrt(Sum_x_squared * Sum_y_squared)
return corr
a=1
while a<5:
for idx_i, i in enumerate(valueColumns):
for z in valueColumns[idx_i + 1:]:
x=copy.deepcopy(pd.Series(df1[i].values))
y=copy.deepcopy(pd.Series(df1[z].values))
correlation = y.corr(x)
plt.scatter(x, y)
plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))
(np.unique(x)), color='red')
plt.xlabel(i)
plt.ylabel(z)
plt.show()
cor=Pearson_correlation(x, y)
print(cor)
ideal_correlation[i+'-'+z]=cor
a=a+1
Output

*Only some of the graphs are shown
These are not all the coorelations, but some, the highest being 0.7 between loudness and dancibility which makes sense and lowest being alsot 0 between duration and liveleness which aloso makes sense
Input
#Working on prediction
df_predict=pd.DataFrame()
for i in valueColumns:
condition = (df[i] > crit_min[i]) & (df[i] < crit_max[i])
df_predict[i] = df.loc[condition, i].reset_index(drop=True)
print(df_predict)
print(len(df_predict))
print(len(df))
# In[20]:
condition = pd.Series(True, index=df.index)
df_predict = pd.DataFrame()
for i in valueColumns:
condition &= (df[i] > crit_min[i]) & (df[i] < crit_max[i])
output = df.loc[condition, i].index.tolist()
df_predict=df.iloc[output].reset_index(drop=True)
# Remove rows from df_predict that are present in df1
df_predict = df_predict.merge(df1, how='left', indicator=True).query('_merge == "left_only"').drop(columns=['_merge']).reset_index(drop=True)
print(df_predict.head())
#Exporting df_predict to a csv
df_predict.to_csv('predictedsongs.csv',index=False)
print('df_predict: ')
print(len(df_predict))
print(len(df))
Output
This considers the criteria and then using that predicts new songs, and finally sends to a new csv file
Creates about 5500 songs

Skills Demonstrated in This Project
Data Analysis
-
Data Cleaning & Preprocessing: pandas (read_csv, filtering, .mean(), .mode(), .value_counts())
-
Statistical Analysis: numpy, statistics (mean, stdev, describe)
-
Correlation & Similarity Analysis: Custom Pearson correlation function, .corr(), deepcopy
Data Visualization
-
Distribution & Trend Analysis: seaborn.catplot, matplotlib.pyplot
-
Scatter Plots with Trend Lines: plt.scatter, np.polyfit
-
Probability Distribution: scipy.stats.norm.pdf
Libraries Used
-
pandas, numpy, seaborn, matplotlib, scipy.stats, statistics, copy
Conclusions
-This project was all about using data analytics to explore my music preferences and predict songs that might be a good match. By breaking down song features like danceability, energy, and loudness, I identified trends and patterns. Using statistical methods like averages, standard deviations, and correlation analysis, I built a model to find songs that closely align with my taste. I love data because it solves real problems like mine. Beyond that this project helped me test and even develop valuable skills in data wrangling, predictive modeling, and statistical analysis.
There’s plenty of room for improvement, like integrating machine learning models to refine predictions, incorporating more behavioral data.
Overall, this was a fun and insightful dive into how data analytics can turn raw information into meaningful, actionable insights. In part II, I analyze another dataset, find songs both parties would like and finally make a visual dashboard.
Code
Go To The Top