Designing Data Product, From Scratch

8 min readDec 3, 2019

Having the right product roadmap is crucial for any product, yet it is a challenge to prioritize the right features and requirements. In this article, I talked about how we can de-risk a product so as to maximize the success of the product once it is released in the market. But before we reach that stage, it is critical that we know what items should we add in the list and why do we need to add the item at all.

Here, I discuss how we can design a data product by taking the example of a content-based recommendation system that is commonly used by media streaming companies like Netflix, Amazon Prime Video, Hulu, and most recently by Disney+. I will also talk about the ways in which we can optimize our model that will augment our product usability, something that is very commonly skipped by most industry practitioners.

Let’s start from the beginning!

Defining Product Vision

First and foremost, we need to have a product vision in place. For that we should be able to confidently identify the following:

Why our users would need the product we are planning to build,
How would the product solve the problem,
What will the product do to solve the problem,
Will the product create value for us, i.e. will the product or feature increase revenue, or will it lead to better brand positioning, or will it reduce the cost of achieving the same results

Let’s have a look at our example. We consider movies dataset and try to recommend the top 10 movies that are most similar to the movie that the user wants to watch or has watched previously. For this example, the product vision would be: “The product will empower our users to easily discover and watch the type of movies they like and would instantly be able to create a curated list of movies by simply taking a hint about what the users want to watch at the moment”.

The product vision is then essentially a fifteen-second version of our product that describes the problem we see and our solution for it. The idea is to demonstrate that there is a clearly defined market with fluent revenue streams. The audience, in this way, will get a taste of our product but will be hungry for more.

Caveat: We are assuming that the user has not rated any movie in the database yet. If the user had given ratings to some movies, the model will need to be calibrated in a way that it is able to take the user-movie interactions into account too.

Building Product Roadmap

Product Roadmap for our content recommender system — Example Product Roadmap

Designing a data requires product managers to engage with data scientists much more directly on models, predictability, how products work in production, how/why users interact with our products, and how success is measured by our end users. At the same time, a product manager needs to have a good sense of how accurate the application needs to be and what exceptions should be handled, and should be able to clearly communicate this to the Development team. Simultaneously, a product manager should be able to facilitate technical discussions between front-end developers and data scientists to ensure the product’s aesthetics, usability, and scalability. A product roadmap is built while keeping all of this in mind. According to Roman Pichler, a product roadmap communicates how a product is likely to evolve across several major releases.

For simplicity, let’s keep our discussion within the bounds of the initial release of our content recommender system product. Ideally, a product roadmap should tell a coherent story about the product's upcoming features and releases. The product roadmap for our product will look like the image above.

Developing the Data Product

Let’s wear the hat of a data scientist and develop an actual content-based recommender system that would help media organizations in showcasing their content to the users based on the demand. The dataset and source code can be downloaded from the link below:

https://github.com/pj263/pj263.github.io/blob/master/Content%20based%20recommender%20systems.rar

We start by importing all the necessary libraries and loading our dataset. Our model is based on natural language processing, therefore, we only select the relevant columns from the dataset that gives us a good overview and description of each movie we have in our dataset. To decrease the complexity, we merged the text columns into a single column that would essentially be used for further analysis.

We need to ensure that there are no null values in either of ‘titles’ column or ‘description’ column, and if there are any then we should know what proportion of the data is actually null values. This helps us in determining whether our model will be of any use if we drop the rows with null values.

Next, we need to drop the rows that have null values in it and at the same time we need to make sure that we do not have duplicate titles or descriptions. Having duplicate titles in the dataset tricks numpy into thinking that one or more elements in the dataframe are trying to assume binary status, which throws a ‘ambiguity in the truth value of an array’ value error. All these steps fall under the category of data cleaning.

Given that we have text data in our dataframe, we need to process the words that are there in the movie descriptions. For this purpose, we use PyStemmer. According to pypi.org, PyStemmer provides access to efficient algorithms for calculating a “stemmed” form of a word. This is a form with most of the common morphological endings removed; hopefully representing a common linguistic base form. This is most useful in building search engines and information retrieval software; for example, a search with stemming enabled should be able to find a document containing “cycling” given the query “cycles”.

TF-IDF is an occurrence-based numeric representation of text. TF-IDF is composed of 2 terms:

Term Frequency (TF) measures how frequently a term appears in the document
Inverse Document Frequency (IDF) measures the relative importance of words in a corpus. Hence commonly appearing words would have a low IDF score whereas rarer words (which are potentially more informative) would have a higher IDF score

Essentially, TF-IDF vectorizer generates a ‘vectorized bag of words’ model. Since, the function is processing words, we need to ensure that all the words are in a common form i.e. lowercased and stemmed to accommodate for the variations of the same word in our dataset. Getting rid of stop-words (stop words are words which are filtered out before processing of natural language data. Stop words are generally the most common words in a language) is a good idea here as they just add to the noise without creating any value.

The final pieces of the puzzle are ‘min_df’ and ‘ngram_range’. By using ‘min_df’, we tell the function to consider only those words that occur in at least the descriptions of two movies. Since we are building a content-based recommender system that measures the similarity between any two movies based on the descriptions to produce the output, the words that occur only in one movie description are not very helpful. ‘ngram_range’ defines if we want to consider a combination of two, three, or more words in addition to the singular words that we have. For a range of (1,3), we are essentially telling the function to consider all the singular, two-words, and three-words combinations that are possible in our dataset. It makes sense to have this range as we want to capture a wider sentiment of a movie that might lead to recommendations that are complements rather than substitutes.

Finally, we use the cosine similarities function from sklearn package to find the similarity between the 2 item vectors for the whole dataset of such vectors. To learn more about cosine similarities, click here.

After we have computed the pairwise cosine similarities, we need to assign each similarity metric to its associated movie title so that the most similar movies to a given movie can be determined.

We are almost done here, the last step is to tell the model to give us the top ten movies that are most similar to the movie that the user wants to watch. Since our database is not being updated regularly and our user enters a movie that does not yet exist in our database, then we also need to let the user know about this. This is called exception handling in the technical terms and edge-case in the product terms. The idea is to provide the best possible user experience that decides whether our product will be a hit or a miss.

At last, the moment we all have been waiting for! To know if this thing really works? Well, see for yourself, the model has recommended the top 10 most similar movies based on the movie (Superman) we told the model we want to watch.

The hard part is done here, now all we need to do is develop a couple of APIs and web elements to render this list into a beautiful webpage. I will cover this part in my next blog in the series where I will demonstrate how we can save a model as a ‘pickle’ and use it any time we want without running the complete model whenever we want our model predictions or recommendations.

Advantages of Content-Based Recommendation System

Content-based recommender systems do not use a lot of user data.

2. Only requires item data and you can start giving recommendations to users.

3. No Cold start problem — Content-based recommendation system does not depend on lots of user data, so it is viable to give recommendations to even your first customer as long as you have sufficient data to build his user profile.

4. Cheaper to build and maintain

Challenges in Content-Based Recommendation System

The item data needs to be well distributed

2. Recommendations to the user will be mostly substitutes and not complements.

3. Less dynamic

Takeaways: Buiding a content-based recommender system is easy. We just need to be user obsessed and always prioritize user experience. This means doing everything possible that would augment user engagement with our product and lead to higher user satisfaction.

If you liked this, then Clap/Share!