www.mongodb.com Open in urlscan Pro
2600:9000:2127:1400:7:7859:3840:93a1 Public Scan

Back to summary

Submitted URL:
https://email.mongodb.com/u/click?_t=86ba93b39a0e415ea7cd8e8ed81051cd&_m=22da38678a114318b4d9dd3fef097039&_e=9XQUFk_3AJD2v...
Effective URL:
https://www.mongodb.com/developer/code-examples/python/song-recommendations-example-app/
Submission: On April 27 via manual (April 27th 2023, 9:45:50 am UTC) from US — Scanned from DE

Form analysis
2 forms found in the DOM

GET https://developer.mongodb.com/learn#main

<form role="search" method="GET" action="https://developer.mongodb.com/learn#main" class="css-dc0gsv">
  <div class="css-87svlz">
    <div class="css-36i4c2"><input type="text" placeholder="Search articles and topics..." class="css-etrcff" value=""></div>
    <div class="css-v2nqhr">
      <div class="css-aef77t"><button role="label" type="button" class="css-14k7wrz"><span data-testid="selected-value" class="css-6k4l2y">Developer Articles &amp; Topics</span>
          <div class="css-109dpaz"><svg data-testid="icon" width="16" height="9" viewBox="0 0 16 9" fill="none" xmlns="http://www.w3.org/2000/svg" class="css-1yzkxhp">
              <path d="M1.06689 0.799988L8.00023 7.73332L14.9336 0.799988" stroke-linecap="round" stroke-linejoin="round" class="css-1tlq8q9"></path>
            </svg></div>
        </button>
        <div class="css-hn9qqo">
          <ul data-testid="options" role="listbox" class="css-ac9zo2">
            <li role="option" tabindex="0" class="css-11dtrvq">General Information</li>
            <li role="option" tabindex="0" class="css-11dtrvq">All Documentation</li>
            <li role="option" tabindex="0" class="css-11dtrvq">Realm Documentation</li>
            <li role="option" tabindex="0" class="css-11dtrvq">Developer Articles &amp; Topics</li>
            <li role="option" tabindex="0" class="css-11dtrvq">Community Forums</li>
            <li role="option" tabindex="0" class="css-11dtrvq">Blog</li>
            <li role="option" tabindex="0" class="css-11dtrvq">University</li>
          </ul>
        </div>
      </div><input type="hidden" id="text" name="text" value=""><input type="hidden" id="content" name="content" value="Articles">
      <div class="css-1myrko"><button type="submit" tabindex="0" class=" css-13l1z36" data-track="true"><img alt="search icon" src="https://webimages.mongodb.com/_com_assets/cms/krc3hljsdwdfd2w5d-web-actions-search.svg?auto=format%252Ccompress"
            class="css-r9fohf"></button></div>
    </div>
  </div>
</form>

GET https://developer.mongodb.com/learn#main

<form role="search" method="GET" action="https://developer.mongodb.com/learn#main" class="css-11a71ad">
  <div class="css-7590ag"><input type="text" placeholder="Search articles and topics..." class="css-xrkki1" value=""></div>
  <div class="css-abpu8v"><select class="select-overlay css-15v6p12" id="filter-select">
      <option value="General Information">General Information</option>
      <option value="All Documentation">All Documentation</option>
      <option value="Realm Documentation">Realm Documentation</option>
      <option selected="" value="Developer Articles &amp; Topics">Developer Articles &amp; Topics</option>
      <option value="Community Forums">Community Forums</option>
      <option value="Blog">Blog</option>
      <option value="University">University</option>
    </select><input type="hidden" id="text" name="text" value=""><input type="hidden" id="content" name="content" value="Articles">
    <div class="css-1myrko"><button type="submit" tabindex="0" class=" css-31biy7" data-track="true">Search</button></div>
  </div>
</form>

Text Content

___

Developer Articles & Topics

 * General Information
 * All Documentation
 * Realm Documentation
 * Developer Articles & Topics
 * Community Forums
 * Blog
 * University


 * Products
   Atlas→
   
   Developer data platform
   
   --------------------------------------------------------------------------------
   
   Enterprise Advanced→
   
   Enterprise software and support
   
   --------------------------------------------------------------------------------
   
   Community Edition→
   
   Free software used by millions
   
   --------------------------------------------------------------------------------
   
    * Database→
    * Search→
    * Data Lake (Preview)→
    * Charts→
    * Device Sync→
    * APIs, Triggers, Functions→
   
    * Enterprise Server→
    * Ops Manager→
    * Enterprise Kubernetes Operator→
   
    * Community Server→
    * Cloud Manager→
    * Community Kubernetes Operator→
   
   
   Tools→
   
   Build faster
   
   --------------------------------------------------------------------------------
   
    * Compass→
    * Shell→
    * VS Code Plugin→
    * Atlas CLI→
    * Database Connectors→
    * Cluster-to-Cluster Sync→
    * Mongoose ODM Support→
    * Relational Migrator→
   
   
 * Solutions
   
   By Industry
   
   
   
   --------------------------------------------------------------------------------
   
   By Use Case
   
   
   
   --------------------------------------------------------------------------------
   
    * Financial Services→
    * Telecom→
    * Healthcare→
    * Retail→
    * Public Sector→
    * Manufacturing→
    * All Industries→
   
    * Analytics→
    * Internet of Things→
    * Mobile→
    * Payments→
    * Serverless Development→
    * All Use Cases→
   
   
   
   Developer Data Platform
   
   Innovate fast at scale with a unified developer experience
   
   Learn More
   
   --------------------------------------------------------------------------------
   
   White Papers & Presentations
   
   Webinars, white papers, datasheets and more
   
   View All
 * Resources
   Documentation→
   
   
   
   --------------------------------------------------------------------------------
   
    * Atlas→
    * Server→
    * Drivers→
   
    * Develop Applications→
    * Launch and Manage MongoDB→
    * View and Analyze→
    * Start with Guides→
   
   
   
   Community
   
   
   
   --------------------------------------------------------------------------------
   
   Education
   
   
   
   --------------------------------------------------------------------------------
   
    * Developer Center→
    * Events & Webinars→
    * Forums→
    * Champions→
    * Find a User Group→
   
    * University→
    * Certification→
    * Academia→
    * Intro to MongoDB Course→
    * Browse All Courses→
   
   
 * Company
   
   About
   
   
   
   --------------------------------------------------------------------------------
   
   Services
   
   
   
   --------------------------------------------------------------------------------
   
   Partnerships
   
   
   
   --------------------------------------------------------------------------------
   
    * Who We Are→
    * Customer Stories→
    * Blog→
    * Careers→
    * Pressroom→
    * Leadership→
    * Investors→
   
    * Consulting→
    * Training→
    * Customer Support→
    * Customer Success→
   
    * Become a Partner→
    * Find a Partner→
    * MongoDB for Startups→
   
   
 * Pricing

Sign In
Try Free

General InformationAll DocumentationRealm DocumentationDeveloper Articles &
TopicsCommunity ForumsBlogUniversity
Search
MongoDB Developerchevron-down
 * Topics
    * Languagesplus
    * Technologiesplus
    * Productsplus
    * Expertise Levelsplus
    * All Topics

 * Documentation
 * Articles
 * Tutorials
 * Events
 * Code Examples
 * Podcasts
 * Videos

MongoDB Developer
 * Topicschevron-down
 * Documentation
 * Articles
 * Tutorials
 * Events
 * Code Examples
 * Podcasts
 * Videos

close



PYTHON

plus

Sign in to follow topics
Articles
Code Examples
Documentation
external
Quickstarts
Tutorials
Videos
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Languageschevron-right
Pythonchevron-right
Code Examples


A SPOTIFY SONG AND PLAYLIST RECOMMENDATION ENGINE

Rachelle PalmerPublished Jun 23, 2022 • Updated Jul 13, 2022
SparkMongoDBData VisualizationPython
FULL APPLICATION

Rate this code example

social-githubView Code
Try it


CREATORS

Lucas De Oliveira, Chandrish Ambati, and Anish Mukherjee from University of San
Francisco contributed this amazing project.

BACKGROUND TO THE PROJECT

In 2018, Spotify organized an Association for Computing Machinery (ACM) RecSys
Challenge where they posted a dataset of one million playlists, challenging
participants to recommend a list of 500 songs given a user-created playlist.
As both music lovers and data scientists, we were naturally drawn to this
challenge. Right away, we agreed that combining song embeddings with some
nearest-neighbors method for recommendation would likely produce very good
results. Importantly, we were curious about how we could solve this
recommendation task at scale with over 4 billion user-curated playlists on
Spotify, where this number keeps growing. This realization raised serious
questions about how to train a decent model since all that data would likely not
fit in memory or a single server.

WHAT WE BUILT

This project resulted in a scalable ETL pipeline utilizing
 * Apache Spark
 * MongoDB
 * Amazon S3
 * Databricks (PySpark)

These were used to train a deep learning Word2Vec model to build song and
playlist embeddings for recommendation. We followed up with data visualizations
we created on Tensorflow’s Embedding Projector.

THE PROCESS

COLLECTING LYRICS

The most tedious task of this project was collecting as many lyrics for the
songs in the playlists as possible. We began by isolating the unique songs in
the playlist files by their track URI; in total we had over 2 million unique
songs. Then, we used the track name and artist name to look up the lyrics on the
web. Initially, we used simple Python requests to pull in the lyrical
information but this proved too slow for our purposes. We then used asyncio,
which allowed us to make requests concurrently. This sped up the process
significantly, reducing the downloading time of lyrics for 10k songs from 15
mins to under a minute. Ultimately, we were only able to collect lyrics for
138,000 songs.

PRE-PROCESSING

The original dataset contains 1 million playlists spread across 1 thousand JSON
files totaling about 33 GB of data. We used PySpark in Databricks to preprocess
these separate JSON files into a single SparkSQL DataFrame and then joined this
DataFrame with the lyrics we saved.
While the aforementioned data collection and preprocessing steps are
time-consuming, the model also needs to be re-trained and re-evaluated often, so
it is critical to store data in a scalable database. In addition, we’d like to
consider a database that is schemaless for future expansion in data sets and
supports various data types. Considering our needs, we concluded that MongoDB
would be the optimal solution as a data and feature store.
Check out the Preprocessing.ipynb notebook to see how we preprocessed the data.

TRAINING SONG EMBEDDINGS

For our analyses, we read our preprocessed data from MongoDB into a Spark
DataFrame and grouped the records by playlist id (pid), aggregating all of the
songs in a playlist into a list under the column song_list. Using the Word2Vec
model in Spark MLlib we trained song embeddings by feeding lists of track IDs
from a playlist into the model much like you would send a list of words from a
sentence to train word embeddings. As shown below, we trained song embeddings in
only 3 lines of PySpark code:
from pyspark.ml.feature import Word2Vec word2Vec = Word2Vec(vectorSize=32,
seed=42, inputCol="song_list").setMinCount(1) word2Vec.sexMaxIter(10) model =
word2Vec.fit(df_play)

xxxxxxxxxx



 
1

from pyspark.ml.feature import Word2Vec

2

word2Vec = Word2Vec(vectorSize=32, seed=42, inputCol="song_list").setMinCount(1)

3

word2Vec.sexMaxIter(10)

4

model = word2Vec.fit(df_play)




We then saved the song embeddings down to MongoDB for later use. Below is a
snapshot of the song embeddings DataFrame that we saved:

Check out the Song_Embeddings.ipynb notebook to see how we train song
embeddings.

TRAINING PLAYLISTS EMBEDDINGS

Finally, we extended our recommendation task beyond simple song recommendations
to recommending entire playlists. Given an input playlist, we would return the k
closest or most similar playlists. We took a “continuous bag of songs” approach
to this problem by calculating playlist embeddings as the average of all song
embeddings in that playlist.
This workflow started by reading back the song embeddings from MongoDB into a
SparkSQL DataFrame. Then, we calculated a playlist embedding by taking the
average of all song embeddings in that playlist and saved them in MongoDB.
Check out the Playlist_Embeddings.ipynb notebook to see how we did this.

TRAINING LYRICS EMBEDDINGS

Are you still reading? Whew! We trained lyrics embeddings by loading in a song's
lyrics, separating the words into lists, and feeding those words to a Word2Vec
model to produce 32-dimensional vectors for each word. We then took the average
embedding across all words as that song's lyrical embedding. Ultimately, our
analytical goal here was to determine whether users create playlists based on
common lyrical themes by seeing if the pairwise song embedding distance and the
pairwise lyrical embedding distance between two songs were correlated.
Unsurprisingly, it appears they are not.
Check out theLyrical_Embeddings.ipynb notebook to see our analysis.

NOTES ON OUR APPROACH

You may be wondering why we used a language model (Word2Vec) to train these
embeddings. Why not use a Pin2Vec or custom neural network model to predict
implicit ratings? For practical reasons, we wanted to work exclusively in the
Spark ecosystem and deal with the data in a distributed fashion. This was a
constraint set on the project ahead of time and challenged us to think
creatively.
However, we found Word2Vec an attractive candidate model for theoretical reasons
as well. The Word2Vec model uses a word’s context to train static embeddings by
training the input word’s embeddings to predict its surrounding words. In
essence, the embedding of any word is determined by how it co-occurs with other
words. This had a clear mapping to our own problem: by using a Word2Vec model
the distance between song embeddings would reflect the songs’ co-occurrence
throughout 1M playlists, making it a useful measure for a distance-based
recommendation (nearest neighbors). It would effectively model how people
grouped songs together, using user behavior as the determinant factor in
similarity.
Additionally, the Word2Vec model accepts input in the form of a list of words.
For each playlist we had a list of track IDs, which made working with the
Word2Vec model not only conceptually but also practically appealing.

DATA VISUALIZATIONS WITH TENSORFLOW AND MONGODB

After all of that, we were finally ready to visualize our results and make some
interactive recommendations. We decided to represent our embedding results
visually using Tensorflow’s Embedding Projector which maps the 32-dimensional
song and playlist embeddings into an interactive visualization of a 3D embedding
space. You have the choice of using PCA or tSNE for dimensionality reduction and
cosine similarity or Euclidean distance for measuring distances between vectors.
Click here for the song embeddings projector for the full 2 million songs, or
here for a less crowded version with a random sample of 100k songs (shown
below):

The neat thing about using Tensorflow’s projector is that it gives us a
beautiful visualization tool and distance calculator all in one. Try searching
on the right panel for a song and if the song is part of the original dataset,
you will see the “most similar” songs appear under it.

USING MONGODB FOR ML/AI

We were impressed by how easy it was to use MongoDB to reliably store and load
our data. Because we were using distributed computing, it would have been
infeasible to run our pipeline from start to finish any time we wanted to update
our code or fine-tune the model. MongoDB allowed us to save our incremental
results for later processing and modeling, which collectively saved us hours of
waiting for code to re-run.
It worked well with all the tools we use everyday and the tooling we chose - we
didn't have any areas of friction.
We were shocked by how this method of training embeddings actually worked. While
the 2 million song embedding projector is crowded visually, we see that the
recommendations it produces are actually quite good at grouping songs together.
Consider the embedding recommendation for The Beatles’ “A Day In The Life”:
Or the recommendation for Jay Z’s “Heart of the City (Ain’t No Love)”:
Fan of Taylor Swift? Here are the recommendations for “New Romantics”:
We were delighted to find naturally occurring clusters in the playlist
embeddings. Most notably, we see a cluster containing mostly Christian rock, one
with Christmas music, one for reggaeton, and one large cluster where genres span
its length rather continuously and intuitively.
Note also that when we select a playlist, we have many recommended playlists
with the same names. This in essence validates our song embeddings. Recall that
playlist embeddings were created by taking the average embedding of all its
songs; the name of the playlists did not factor in at all. The similar names
only conceptually reinforce this fact.

NEXT STEPS?

We felt happy with the conclusion of this project but there is more that could
be done here.
 1. We could use these trained song embeddings in other downstream tasks and see
    how effective these are. Also, you could download the song embeddings we
    here: Embeddings | Meta Info
 2. We could look at other methods of training these embeddings using some
    recurrent neural networks and enhanced implementation of this Word2Vec
    model.

social-githubView Code
Try it


--------------------------------------------------------------------------------

Rate this code example


RELATED

Tutorial


ADDING AUTHENTICATION TO YOUR FARM STACK APP

--------------------------------------------------------------------------------

Sep 23, 2022
Tutorial


CALLING THE MONGODB ATLAS ADMINISTRATION API: HOW TO DO IT FROM NODE, PYTHON,
AND RUBY

--------------------------------------------------------------------------------

Apr 13, 2023
Tutorial


UPGRADE FEARLESSLY WITH THE MONGODB STABLE API

--------------------------------------------------------------------------------

May 16, 2022
Code Example


GETTING STARTED WITH MONGODB AND TORNADO

--------------------------------------------------------------------------------

Sep 23, 2022
Request a Code Example

TECHNOLOGIES USED

Languages
Python
Technologies
Spark
Products
MongoDBData Visualization

TABLE OF CONTENTS

 * Creators
 * Background to the Project
 * What We Built
 * The Process
 * Notes on our Approach
 * Data Visualizations with Tensorflow and MongoDB
 * Using MongoDB for ML/AI
 * Next Steps?

© 2023 MongoDB, Inc.

About

 * Careers
 * Investor Relations
 * Legal Notices
 * Privacy Notices
 * Security Information
 * Trust Center

Support

 * Contact Us
 * Customer Portal
 * Atlas Status
 * Paid Support

Social

 * Github
 * Stack Overflow
 * LinkedIn
 * Youtube
 * Twitter
 * Twitch
 * Facebook

© 2023 MongoDB, Inc.






PRIVACY PREFERENCE CENTER

"Cookies" are small files that enable us to store information while you visit
one of our websites. When you visit any website, it may store or retrieve
information on your browser, mostly in the form of cookies. This information
might be about you, your preferences or your device and is mostly used to make
the site work as you expect it to. The information does not usually directly
identify you, but it can give you a more personalized web experience. Because we
respect your right to privacy, you can choose not to allow some types of
cookies, but essential cookies are always enabled. Click on the different
category headings to find out more and change our default settings. However,
blocking some types of cookies may impact your experience of the site and the
services we are able to offer.
MongoDB Privacy Policy
Allow All


MANAGE CONSENT PREFERENCES

STRICTLY NECESSARY COOKIES

Always Active

These cookies are necessary for the website to function and cannot be switched
off in our systems. They are usually only set in response to actions made by you
which amount to a request for services, such as setting your privacy
preferences, logging in or filling in forms. You can set your browser to block
or alert you about these cookies, but some parts of the site will not then work.
These cookies do not store any personally identifiable information.

PERFORMANCE COOKIES

Performance Cookies

These cookies allow us to count visits and traffic sources so we can measure and
improve the performance of our site. They help us to know which pages are the
most and least popular and see how visitors move around the site. All
information these cookies collect is aggregated and therefore anonymous. If you
do not allow these cookies we will not know when you have visited our site, and
will not be able to monitor its performance.

FUNCTIONAL COOKIES

Functional Cookies

These cookies enable the website to provide enhanced functionality and
personalisation. They may be set by us or by third party providers whose
services we have added to our pages. If you do not allow these cookies then some
or all of these services may not function properly.

TARGETING COOKIES

Targeting Cookies

These cookies may be set through our site by our advertising partners. They may
be used by those companies to build a profile of your interests and show you
relevant adverts on other sites. They do not store directly personal
information, but are based on uniquely identifying your browser and internet
device. If you do not allow these cookies, you will experience less targeted
advertising.

SOCIAL MEDIA COOKIES

Social Media Cookies

These cookies are set by a range of social media services that we have added to
the site to enable you to share our content with your friends and networks. They
are capable of tracking your browser across other sites and building up a
profile of your interests. This may impact the content and messages you see on
other websites you visit. If you do not allow these cookies you may not be able
to use or see these sharing tools.


BACK BUTTON PERFORMANCE COOKIES



Vendor Search Search Icon
Filter Icon

Clear
checkbox label label
Apply Cancel
Consent Leg.Interest
checkbox label label
checkbox label label
checkbox label label

Confirm My Choices


By clicking "Accept All Cookies", you agree to the storing of cookies on your
device to enhance site navigation, analyze site usage, and assist in our
marketing efforts. You can enable and disable optional cookies as desired. Read
our Privacy Policy. Read our Privacy Policy

Manage Cookies Accept All Cookies

www.mongodb.com Open in urlscan Pro 2600:9000:2127:1400:7:7859:3840:93a1 Public Scan

Form analysis 2 forms found in the DOM

GET https://developer.mongodb.com/learn#main

GET https://developer.mongodb.com/learn#main

Text Content

www.mongodb.com Open in urlscan Pro
2600:9000:2127:1400:7:7859:3840:93a1 Public Scan

Form analysis
2 forms found in the DOM