Project Music I | Ansh Vyas Portfolio

Music Analysis

About

I have a problem of listening to too many songs, I always run out of songs pretty fast. This project was born as a solution to that.
I collected by personal music data, using spotipy API I was able to get the songs' components, using this I was able to analyse and create a criteria for my personal music preference. Using this criteria and a bigger (about half a million rows) dataset, I predicted about 2000+ new songs that I would like.

Code File

Code

Skills

Conclusion

Go to..

Dataset

Project Code

Libraries

Head

Body

Input

print(df1.head(2))

print(df1['Artist'].mode())

print(df1['Artist'].value_counts(['BTS'])

total_count = df1['Artist'].str.count('Post').sum()

print(total_count)

print(len(df1))

avg={}
valueColumns=['Danceability','Energy','Loudness','Speechiness','Acousticness','Instrumentalness','Liveness','Tempo','Duration_ms']

avg_values=df1[valueColumns].mean()
for a,b in zip(avg_values,valueColumns):
avg[b]=a
avg['Key']=df1['Key'].mode()[0]
#avg['Artist']=df1['Artist'].mode()
print('avg: ')
print(avg)

Output

Input

#Analysing dataset attributes

for i in valueColumns:
print(i,":")
sns.catplot(x=df1.index,y=df1[i],data=df1)
plt.show()

Output

Input

#Analysing dataset attributes

for i in valueColumns:
print(i,':')
#font={'size':10}
#plt.rc('font',**font)
x = np.arange(min(df1[i]), max(df1[i]), 0.001)

#create range of y-values that correspond to normal pdf with mean=0 and sd=1
y = norm.pdf(x,df1[i].mean(),stat.stdev(df1[i]))

#define plot
fig, ax = plt.subplots(figsize=(5,5))
ax.plot(x,y)

#choose plot style and display the bell curve
plt.style.use('fivethirtyeight')
plt.show()

Output

Input

#Printing artists Ranking:
print(df1['Artist'].value_counts().idxmax())
a=0
df1_1=df1[df1['Artist']!=df1['Artist'].value_counts().idxmax()]
while a<5:
a=a+1
print(df1_1['Artist'].value_counts().idxmax())
df1_2=df1_1[df1_1['Artist']!=df1_1['Artist'].value_counts().idxmax()]
print(df1_2['Artist'].value_counts().idxmax())
df1_1=df1_2[df1_2['Artist']!=df1_2['Artist'].value_counts().idxmax()]

#Using the current data about me and trying to find songs that i might like
crit_max=copy.deepcopy(avg)
for i in crit_max.keys():
crit_max[i]=crit_max[i]+stat.stdev(df1[i])
print("crit_max:")
print(crit_max)
print('')
crit_min=copy.deepcopy(avg)
for i in crit_min.keys():
crit_min[i]=crit_min[i]-stat.stdev(df1[i])
print('crit_min')
print(crit_min)

#Plotting correlation between variables

def Pearson_correlation(X,Y):
if len(X)==len(Y):
Sum_xy = sum((X-X.mean())*(Y-Y.mean()))
Sum_x_squared = sum((X-X.mean())**2)
Sum_y_squared = sum((Y-Y.mean())**2)
corr = Sum_xy / np.sqrt(Sum_x_squared * Sum_y_squared)
return corr

a=1
while a<5:
for idx_i, i in enumerate(valueColumns):
for z in valueColumns[idx_i + 1:]:
x=copy.deepcopy(pd.Series(df1[i].values))
y=copy.deepcopy(pd.Series(df1[z].values))

correlation = y.corr(x)
plt.scatter(x, y)

plt.plot(np.unique(x), np.poly1d(np.polyfit(x, y, 1))
(np.unique(x)), color='red')
plt.xlabel(i)
plt.ylabel(z)
plt.show()
cor=Pearson_correlation(x, y)
print(cor)
ideal_correlation[i+'-'+z]=cor
a=a+1

Output

*Only some of the graphs are shown

These are not all the coorelations, but some, the highest being 0.7 between loudness and dancibility which makes sense and lowest being alsot 0 between duration and liveleness which aloso makes sense

Input

#Working on prediction
df_predict=pd.DataFrame()
for i in valueColumns:
condition = (df[i] > crit_min[i]) & (df[i] < crit_max[i])
df_predict[i] = df.loc[condition, i].reset_index(drop=True)
print(df_predict)
print(len(df_predict))
print(len(df))

# In[20]:

condition = pd.Series(True, index=df.index)
df_predict = pd.DataFrame()

for i in valueColumns:
condition &= (df[i] > crit_min[i]) & (df[i] < crit_max[i])
output = df.loc[condition, i].index.tolist()

df_predict=df.iloc[output].reset_index(drop=True)
# Remove rows from df_predict that are present in df1
df_predict = df_predict.merge(df1, how='left', indicator=True).query('_merge == "left_only"').drop(columns=['_merge']).reset_index(drop=True)

print(df_predict.head())

#Exporting df_predict to a csv

df_predict.to_csv('predictedsongs.csv',index=False)

print('df_predict: ')
print(len(df_predict))
print(len(df))

Output

This considers the criteria and then using that predicts new songs, and finally sends to a new csv file

Creates about 5500 songs

Skills Demonstrated in This Project

Data Analysis

Data Cleaning & Preprocessing: pandas (read_csv, filtering, .mean(), .mode(), .value_counts())
Statistical Analysis: numpy, statistics (mean, stdev, describe)
Correlation & Similarity Analysis: Custom Pearson correlation function, .corr(), deepcopy

Data Visualization

Distribution & Trend Analysis: seaborn.catplot, matplotlib.pyplot
Scatter Plots with Trend Lines: plt.scatter, np.polyfit
Probability Distribution: scipy.stats.norm.pdf

Libraries Used

pandas, numpy, seaborn, matplotlib, scipy.stats, statistics, copy

Conclusions

-This project was all about using data analytics to explore my music preferences and predict songs that might be a good match. By breaking down song features like danceability, energy, and loudness, I identified trends and patterns. Using statistical methods like averages, standard deviations, and correlation analysis, I built a model to find songs that closely align with my taste. I love data because it solves real problems like mine. Beyond that this project helped me test and even develop valuable skills in data wrangling, predictive modeling, and statistical analysis.

There’s plenty of room for improvement, like integrating machine learning models to refine predictions, incorporating more behavioral data.

Overall, this was a fun and insightful dive into how data analytics can turn raw information into meaningful, actionable insights. In part II, I analyze another dataset, find songs both parties would like and finally make a visual dashboard.

Code

- Code Link

Music Analysis

About

Go to..

Dataset

Project Code

Libraries

Head

Body

Input

Data Visualization

​Libraries Used

Libraries Used