ML Interview Q Series: How would you build a system that recommends music tracks to users?

Apr 27, 2025

📚 Browse the full ML Interview series here.

Short Compact solution

A common approach is to rely on collaborative filtering, which takes into account how different users interact with various songs. This is similar to how movie recommendations work, but there are some distinctions. Unlike a typical 1–5 star rating system in movie platforms, music ratings often do not exist explicitly, so one might collect data on how often users listen to each track or whether they skip it quickly. Because people may return to a song many times, the feedback dynamics differ from those of one-time movie views. The overall music catalog also tends to be much larger and more diverse than a movie catalog.

Connect with me on X (Twitter)

A practical strategy is to form a matrix of users versus songs (or users versus artists) based on listening counts or binary indicators of whether a user has streamed a particular track. One can then apply matrix factorization, for instance by solving

where the rows of P represent user embeddings and the rows of Q represent song embeddings. A standard technique for discovering these latent vectors is alternating least squares (ALS), which can be executed in a distributed computing environment if the dataset is large. Once the vectors are learned, their dot products predict the user’s interest in new tracks. Sorting those predictions per user can drive personalized recommendations. Those embeddings can also help compare similar items or similar users by applying methods like k-nearest neighbors.

Comprehensive Explanation

Constructing a system to recommend music generally involves modeling user preferences and matching them to songs or artists that are most relevant. The main challenge is that music services handle very large catalogs, often without explicit rating signals. Instead of star ratings, one can observe data about how many times a user has streamed a given track, how long they listened before skipping, or whether they added it to a playlist. This feedback can be viewed as implicit ratings that reflect user engagement and enjoyment.

The first step is to gather training data. Each row might represent a user, and each column might represent a song. The value in the matrix can be the count of how many times that user played the song, or a binary marker if the user has listened at least once. This forms the matrix R. However, R is often very sparse, since each user may only listen to a small fraction of the entire catalog.

To produce recommendations, a common strategy is to factorize this matrix into lower-dimensional embeddings for users and songs. If there are M songs and N users, R is an N×M matrix. One can approximate RR by the product of two matrices, P(size N×k) and Q (size M×k), where k is the latent embedding dimension. Once learned, the preference score of user u for song i can be inferred by the dot product of the corresponding embedding vectors. One popular algorithm for finding these embeddings at large scale is alternating least squares (ALS). After learning P and Q, new tracks can be recommended to each user by sorting the songs the user has not listened to, in descending order of predicted preference score.

An additional insight is that repeated consumption changes how one interprets feedback. A user might listen to a favorite song 50 times, which strongly signals that the user loves that track. This pattern is unlike movies, where a person rarely re-watches the same film many times. Designing the objective function for learning P and Q often involves weighting repeated plays differently from single plays.

An essential consideration for real-world systems is scalability. Music platforms can have hundreds of millions of users and tens of millions of tracks, so the matrix can be extremely large. Distributing the factorization process (for example, across a cluster with Spark or another large-scale framework) becomes essential. Once the learned embeddings are stored, real-time retrieval of recommendations is often done by efficient approximate nearest-neighbor search in the embedding space or simply by computing dot products with the user’s embedding.

After the embeddings are available, the system can leverage them beyond simple ranking by predicted score. For instance, it might cluster users with similar embeddings or find related songs for a certain track. This can be further refined by merging content-based features, metadata, or user demographic data to alleviate cold start situations for new users or new tracks.

Below is a very brief illustration of how one might implement a collaborative filtering approach with an ALS-like method in Python pseudocode:

import numpy as np

# Suppose R is the user-song matrix: shape (num_users, num_songs)
# Define dimension of latent factors
k = 50

# Randomly initialize user and song embeddings
P = np.random.rand(num_users, k)
Q = np.random.rand(num_songs, k)

# Hyperparameters like regularization and learning rate would be defined
num_iterations = 10
alpha = 0.01
lambda_reg = 0.1

# Simplified ALS-like process (just a conceptual sketch):
for _ in range(num_iterations):
    # Solve for P given Q
    # Solve for Q given P
    # Typically done by closed-form solution in ALS or a system of linear equations.
    pass

# At the end, P[u] and Q[i] represent embeddings for user u and song i.
# Predicted preference for user u and song i:
# score = np.dot(P[u], Q[i])

In a production environment, one would use established libraries and distributed data processing pipelines. That allows scaling to massive user-song matrices without manually coding each step.

Handling the absence of explicit ratings

When there is no star-like rating for music, it can be beneficial to rely on implicit feedback such as listening counts, skip rates, and streaming duration. These signals convey user engagement but also bring a risk of noise. For instance, if someone casually plays a song as background music but does not skip it, the system might overestimate how much they like that track. Mitigating this risk involves carefully weighting repeated listens or normalizing by overall user activity to avoid bias toward users who listen to music all day.

Addressing large-scale data and distributed training

Many recommendation algorithms become computationally expensive as the dataset grows. Matrix factorization can involve significant matrix operations that do not fit into memory for extremely large matrices. Distributed frameworks like Spark’s MLlib can run an ALS-based approach in parallel, splitting the user-song matrix across multiple workers. Each iteration of ALS alternates between solving for user embeddings and for item embeddings, which can also be parallelized by subdividing the matrix. Communication overhead must be minimized, often with a partitioning strategy that keeps data for the same subset of users or songs on the same node.

Handling repeated consumption

When users listen to tracks multiple times, it demonstrates ongoing preference. One approach is to treat each count beyond a certain threshold as increasing confidence in that preference. Another approach is to view repeated consumption over time and model temporal dynamics. This can help the system recommend fresh content that still matches the user’s tastes while potentially rotating older favorites. It is helpful to keep track of negative feedback signals, such as skipping a track after 10 seconds, to balance the repeated consumption data.

Cold start considerations

A newly registered user has no data, so we cannot directly factorize the matrix to find their preferences. Possible mitigation strategies include presenting default popular songs, prompting the user to pick from a short list of artists or genres, or incorporating demographic or contextual signals. For a new track with no plays, the system can rely on content features (genre tags, acoustics) or information about the artist. Hybrid approaches that combine collaborative filtering with metadata can reduce the impact of cold starts.

Improving beyond simple collaborative filtering

Matrix factorization based on implicit feedback is a strong baseline. However, more advanced setups can incorporate features such as:

Deep learning models that predict engagement from audio embeddings or textual metadata. A neural network might combine collaborative filtering with track content features (e.g., waveforms or mel spectrograms). Contextual signals such as time of day, user location, or device type, which can be critical for suggesting the right music at the right moment. Social network features indicating that the user’s friends or influencers listen to certain artists.

Incorporating such signals typically boosts accuracy, but also increases complexity, requiring specialized data pipelines and real-time models.

Balancing exploration and exploitation

A recommendation engine can exploit the embeddings to deliver the most relevant content. However, to keep the user engaged with new music and to gather valuable data, the system often injects some diversity or novelty. This can happen via random sampling within a preferred genre or offering new song releases the user might enjoy based on latent similarity. The feedback from these exploratory recommendations further refines the embeddings.

Evaluating the recommendation system

A robust evaluation might rely on offline metrics and online experiments:

Offline evaluation involves splitting user listening history into a training set and a validation or test set. One can compute how accurately the model ranks held-out tracks. Online A/B testing measures user engagement changes (streams, skips, time on platform) when comparing a new algorithm to a baseline in a live environment. Diversity, coverage, and novelty are also important. A system that recommends only top hits may achieve decent accuracy but fail to introduce variety or specialized content.

Potential follow-up question: How to incorporate metadata or content-based features?

It is sometimes helpful to enrich collaborative filtering with data about each song’s genre, mood, or other relevant properties. This can be done by appending features that capture textual tags or acoustic embeddings to the factorization. Alternatively, a neural network could produce embeddings for each track from its audio waveform, then be fine-tuned with collaborative signals. Integrating these features allows recommendations even when user interaction data is sparse, and it also enables the system to push content that is sonically or thematically related to songs the user already enjoys.

Potential follow-up question: Could we apply more sophisticated deep learning architectures?

Yes. Neural collaborative filtering architectures use neural networks in place of or in combination with standard matrix factorization. Autoencoders can be used to reconstruct user preferences from partially observed data, or multi-task learning can combine user classification tasks with preference prediction. Transformers can help model sequences of user listening events, capturing the order in which songs are played. These approaches often outperform simpler methods if enough training data is available, but they require more computation, hyperparameter tuning, and careful engineering.

Potential follow-up question: What are the trade-offs in using kNN for similarity?

Nearest-neighbor methods can be straightforward for finding songs similar to a user’s already-enjoyed track by comparing embeddings. However, pure kNN-based recommendation does not always generalize well across the entire user base because it can be slow to query at scale and can miss broader patterns across many users and items. Usually, kNN is combined with matrix factorization or used as a second-stage method for refining or explaining the recommendations.

Potential follow-up question: How to handle real-time adaptation?

Users’ musical preferences can shift quickly. If someone suddenly starts listening to a new genre, the system should respond in near-real-time rather than waiting for the next batch training cycle. Real-time adaptation can be tackled through incrementally updated factorization approaches or shallow online learning methods that adjust embeddings as new consumption data arrives. Another approach is to have frequent partial retraining steps on the newest data or keep a short-term preference model that can be combined with a more stable long-term preference model.

Potential follow-up question: How to ensure fairness and reduce bias?

Bias can appear if recommendation algorithms over-promote major label artists or popular genres, marginalizing niche artists. Fairness can be partially addressed by calibrating recommendations, encouraging diversity, or even applying constraints that balance exposure among different content providers. The system design might also regularly audit the recommendations to detect skew toward certain subgroups of users or certain types of music.

These considerations become vital for a system serving hundreds of millions of users worldwide, each with varied tastes.

Below are additional follow-up questions

What if the user’s listening patterns are highly influenced by external events (e.g., a live concert they just attended)?

External events can abruptly shift a user’s interests. For instance, a user might have historically listened mostly to pop music but then attends a rock concert and begins exploring similar rock tracks. If your recommendation system does not adapt quickly, it could continue to serve mostly pop songs. One approach is to employ session-based or sequence-aware models that capture short-term changes in user behavior alongside long-term preferences. A potential pitfall is overfitting to the new pattern and neglecting past favorites. Carefully balancing short-term and long-term signals is key. You could maintain two sets of embeddings: one updated more frequently to capture sudden changes (session embeddings), and one more stable to represent the user’s general taste (long-term embeddings). A subtle pitfall is that external events often do not appear in the user’s listening data until a spike in plays occurs, so real-time detection of these changes might require continuously monitoring streaming data for anomalies or surges in new song categories.

How can we handle the situation when a user’s listening history is extremely skewed to a few specific genres or artists?

Certain users might focus heavily on just a handful of artists or genres, resulting in extremely sparse and biased data. A pitfall is that the model might over-recommend those same few artists, causing a feedback loop. One solution is to add diversity constraints or re-rank results to ensure the user is exposed to a broader set of content that still aligns with their general taste profile. Another option is to leverage content-based features—e.g., musical embeddings derived from audio—to suggest artists that share acoustic or stylistic qualities with the user’s favorites, but are not exactly the same. An edge case arises when the user truly only wants that single artist or genre. The system must differentiate between a user with a genuinely narrow interest and a user who would be open to new discoveries if introduced properly. Qualitative user surveys, or an explicit diversity-promoting user interface, may be needed to resolve that ambiguity.

What approaches can be used when some songs or user interactions raise privacy concerns?

In certain regions or contexts, storing explicit user-song interaction data could pose privacy or regulatory problems. For instance, some musical choices may reveal religious or political leanings. One potential solution is to aggregate or anonymize data before it is fed into the recommendation model. Differential privacy can also be employed, injecting controlled noise into interaction logs so that it becomes difficult to trace a specific user’s behavior. A pitfall is that too much anonymization may degrade recommendation accuracy, especially for niche interests. You might adopt federated learning, sending model updates to user devices rather than collecting raw data centrally. However, federated approaches bring new challenges in communication overhead and ensuring consistent model convergence across many devices with intermittent connectivity.

How do we handle extremely long-tail tracks that have minimal listening data?

The catalog in music platforms often has a massive tail of tracks that receive very few streams. Classic collaborative filtering might fail to learn meaningful embeddings for these underrepresented songs. A potential fix is to use metadata and content-based signals. For example, you can compute acoustic features or text embeddings from lyrics, then train a model that combines these embeddings with collaborative signals. This allows the system to infer that a new or obscure track is similar to a more popular track. A subtle pitfall arises if the metadata is incorrect or incomplete, possibly associating these tracks with the wrong cluster, so thorough data validation is important. Another edge case is if the music is so unique that no other track shares its characteristics, making the standard similarity-based approach less effective. The system might have to rely on marketing push strategies or specialized playlists to give these truly unique tracks some initial exposure and gather more explicit interactions.

How might we incorporate multi-objective optimization, such as balancing user satisfaction with content provider goals?

Sometimes the recommendation system must serve multiple objectives, including user satisfaction, promoting certain artists, or satisfying contractual obligations. This leads to a multi-objective optimization challenge. One might define a weighted combination of metrics like user engagement (play counts, skip rates) and coverage for different labels or genres, then adjust the weighting based on business rules. A risk is that purely optimizing for coverage can degrade user experience if too many forced recommendations are unappealing. Conversely, focusing exclusively on personalization might result in an echo chamber that disadvantages smaller creators. A practical approach is to use a two-phase ranking: first generate a broad set of candidates that includes content from multiple objectives, then re-rank to ensure user relevance is not overly compromised. Edge cases occur when one objective conflicts with another (e.g., user preference for explicit content vs. brand safety requirements), so developing a robust and flexible constraint-handling mechanism is important.

How do we prevent the recommendation quality from degrading over time when user behavior shifts gradually?

User preferences can evolve subtly. If the training pipeline updates embeddings only occasionally, the system may grow stale. A pitfall is that incremental changes may not trigger alarms if they are small but accumulate to significantly alter listening habits. Scheduling frequent model updates or implementing an online training loop can mitigate this. Monitoring drift is crucial: you might track the distribution of user listening vectors or compare predicted scores with actual consumption. If the model’s performance dips, you retrain or adjust hyperparameters. A subtle edge case is that some users’ tastes remain stable, so frequent updates may produce unnecessary churn for them. A personalized refresher schedule—where the system updates more frequently for users who exhibit high novelty-seeking behavior—can be a refined solution.

How do we ensure that large-scale negative sampling strategies reflect realistic user preferences?

In implicit feedback settings, the absence of a stream is often treated as a “negative” example, but it might just mean the user never encountered that track. Overzealous negative sampling can incorrectly penalize songs that the user simply did not discover. One approach is to limit negative sampling to tracks the user has had at least some chance of seeing (e.g., songs shown on their recommended pages or in a curated playlist). Additionally, weighting the negative samples to reflect confidence (e.g., a direct skip is a stronger negative signal than never having seen the song) can reduce mislabeling. An edge case arises when the user had a brief exposure to a track but decided to skip it not because they disliked it, but because they were busy or had a phone call. Accounting for context is crucial—skip data can be misleading if not interpreted alongside session information or user environment.

What if the model’s recommendations lead to high user satisfaction but discourage users from exploring new music?

It is possible for a highly accurate system to consistently recommend exactly what the user already likes, limiting exposure to new artists or genres. This “filter bubble” or “echo chamber” effect can reduce the user’s long-term satisfaction or stifle platform growth. One solution is to introduce a controlled degree of novelty or diversity in each recommendation session, perhaps letting the user choose how adventurous they want to be. Another approach is to track longer-term satisfaction signals, such as whether users eventually remove tracks from their library or skip them after repeated plays, in order to reevaluate stale recommendations. A subtle pitfall is that pushing too much diversity might annoy users who just want the familiar. Fine-tuning the diversity injection and using A/B tests to measure resulting user behaviors is usually necessary to find the right balance.

How do we address the possibility of malicious behavior, such as bots artificially inflating the popularity of certain songs?

Malicious actors might use automated scripts or bot farms to stream certain tracks repeatedly, aiming to manipulate recommendation algorithms. This can distort the popularity metrics and degrade the user experience. A robust anomaly detection system can flag suspiciously high play counts from a small set of IP addresses or user accounts. One method is to maintain a baseline distribution of legitimate behavior (e.g., typical session length, skip patterns) and compare new usage data against these profiles. A pitfall is that real users in certain regions or with unique tastes might generate unusual patterns, so abrupt banning or penalizing them would be unfair. Therefore, combining anomaly detection with manual review or a multi-step verification can mitigate false positives. The model should also degrade gracefully if suspicious items remain in the dataset, perhaps by weighting extremely high-play counts less to reduce the impact of potential gaming.

How to manage platform-level constraints such as licensing and territory restrictions in recommendations?

Some songs might only be licensed for certain countries or might not be available on certain user plans. If a recommendation system does not account for these constraints, it might suggest content users cannot access, leading to poor user experience. Hence, each user’s accessible catalog must be well-defined and integrated into the recommendation pipeline. A tricky edge case arises if licenses expire or if a user travels across regions with different content availability. The system may need dynamic re-ranking or fallback recommendations. Another subtlety is that continuously updating region-based constraints can be technically complex when you have large catalogs and real-time model serving. Therefore, designing an architecture that can quickly identify a user’s available music and filter out or replace blocked tracks at the final ranking stage becomes critical.

Rohan's Bytes

Discussion about this post