Reading “between the lines” of product reviews on a large scale
When people consume in the digital space, they not only click and buy, but often comment on products, brands and services on social media, on platforms or the sites of online stores. The enormous amount of textual data consumers produce online is in fact a treasure box that hasn’t yet been fully opened. But as data volumes grow, so do new algorithms to process and analyze unstructured data. Artificial Intelligence (AI) is one of the domains that can help open this treasure box further to better understand consumer decision-making.
In a GfK research project, we tested how we can learn consumer preferences and predict choices from publicly available social media and review data which are related to sales data. The common AI tool “Word Embeddings” has shown to be a powerful way to analyze the words that people use. It enabled us to reveal consumers’ preferred brands, favorite features and main benefits. Language biases uncovered by the analysis can indicate preferences, and they fit reasonably well to actual brand sales within various categories. Especially when data volumes were large, the method produced very accurate results and it is completely passive (see Box 1). We have been using free, widespread online data without affecting respondents or leading them into ranking or answering questions they would otherwise not have even thought of. The analysis is fast to run, and no fancy processing power is needed.
Predicting the most preferred brands in one category
To test if brand preferences could be derived from online reviews, we first ran the AI-based text analysis for one category (TVs) and different amounts of data and compared the outcome of the analysis to actual sales data.
Specifically, we ran 3 experiments: using data from one online retailer only, encompassing a total of 3,000 reviews; using data from multiple retailers totalling 4,500 reviews (a random subsample of the whole data); and using the entire data set of 53,000 reviews.
The results are displayed in Figure 1. The first column shows the sales ranks of 5 brands in the category. It is important to note that the sales difference between brands C, D and E was quite small, and therefore we had expected some confusion. The 2nd column shows the results of 3,000 reviews scraped from a single online retailer. With this limited amount of data, the rank is clearly wrong, ranking the most-sold brands A and B on ranks 3 and 4 instead of 1 and 2. The 3rd column introduces multiple retailers with a random subsample of 4,500 reviews. In this experiment, Brand A is now in the correct position (1), but we see confusion on Brand B and the others. The 4th column, using the complete dataset of 53,000 reviews, shows the correct ranking for Brand A and B - the major volume drivers in the category - and confusion of Brands C, D, E.