Workshop on Recommender Systems:

Algorithms and Evaluation

Ian Soboroff, Charles Nicholas, and Michael Pazzani

Introduction

The world of recommender systems has undergone quite an expansion since the Communications of the ACM published their feature issue on the topic two years ago. Projects such as GroupLens have gone on to be successful commercial ventures, and recommendation systems are de rigeur for Internet commerce. The basic technology has very quickly moved from the research world to popular applications.

Several methods for implementing recommender systems have emerged, including approaches that base recommendations on correlations of groups of users and methods that learn about individual users. However, the architectural issues of cold-start, sparse ratings, and scalability continue to dominate the field.

The state of the art in recommender systems will be enhanced by the development of evaluation methodologies for recommender systems. User studies are difficult to conduct and generalize from, and issues of presentation and relevance make traditional IR evaluation measures not entirely suited to the domain. Furthermore, test collections such as DEC SRC's EachMovie data set are becoming standard tools, but the need for larger collections in different domains is great.

Thus, the goal of the workshop was to discuss moving to the next phase of recommender systems research, from the basic "how do we do it" to "how can we do it better", and "how do we know that it's better".

SIGIR was a good forum for such a workshop. The recommender systems community has existed at the crossroads of information retrieval, machine learning, computer-human interfaces, digital libraries, and the World-Wide Web. The information retrieval community has long wrestled with questions of experiment and evaluation, and we hoped that for our particular goals we would benefit from the SIGIR enviroment.


The basic format of the workshop was three technical sessions with talks on submitted papers. This was surrounded by a keynote at the start of the day, and an animated panel discussion at the end.

Keynote Address

Our keynote speaker was Joseph Konstan of the University of Minnesota, who is well-known from his involvement in the GroupLens project. GroupLens formed some of the earliest work in collaborative recommendation, and much of the underlying algorithmic approaches used throughout the field come from GroupLens. GroupLens itself has grown into a successful commercial company, NetPerceptions, whose software is used in several popular Internet storefronts.

Historical Perspective

Dr. Konstan first gave a brief history of recommender systems, from its beginnings around 1992. He observed that in the "early days", the expectations were fairly low by today's standards: 10,000 users and 100 predictions per second was good performance, and "better than random" was effective. This received a lot of laughs, but Konstan emphasized that people used it and they seemed to come back. Not much of an evaluation measure from the point of view of researchers, but for a web portal this might really be only measure that matters.

At the 1994 workshop in Berkeley, several big questions were identified: tackling scalability, sparse ratings, handling implicit ratings, adding content to collaboration, and whether the idea was even economically viable. That last question didn't take long to answer, as many groups quickly went commercial. Netperceptions attacked the scalability problem by hiring Cray employees when Cray was bought by SGI. Morita and Shinoda's 1992 paper on implicit ratings had a lot of skeptics, but now buying histories drive web recommendation. More advanced use of implicit ratings, as well as integration of content, are still open questions.

In 1997, the CACM article gave us widespread use of the label of "Recommender Systems." Konstan remarked that this reflects a change of focus from the algorithm to the interface; now, of course, a recommender system might not involve any collaborative aspects at all.

The summer of 1997 also brought us the first community corpus, DEC's EachMovie. EachMovie has increased the number of CF experiments being done, and it means that a research group doesn't necessarily need a working system with many users to have enough data to work with. In general the field has experienced a vast "mushrooming" in the last year and a half, with workshops and papers in the IR, AI/ML, CHI, CSCW, and Agents communities. Growth is good, but it is now much harder for anyone to know everything that is being done.

Opportunities

While commercial adoption has been high, it's hard to gauge what this will mean in the research community. All the studies we have so far indicate that people like believing that they are getting personalized recommendations. However, it might not even be worthwhile to work hard to make good recommendations or even to do real personalization: CDNow at one point tried sending random recommendations, and found that people liked what they got anyway.

However, Konstan's sense was that there is a lot of opportunity now for the community. Information overload is a huge problem, and personalization is a very hot topic. There is a substantial interest in leveraging human expertise and knowledge. Finally, the core technology is mature, so the interesting questions are broadening.

To Konstan, these interesting questions are (a) dealing with massive scale and sparsity, (b) dealing with "recurring startup" problems, such as with a newspaper which generates new articles every day which have no ratings, (c) actually making users more effective, that is, did the system improve users' decision-making process, and (d) building and leveraging recommendation communities that users identify and interact with.

Obstacles

Konstan also identified several obstacles standing in the way of these opportunities. One is that it is hard to do interesting recommender systems research: user experiments require users, and only large groups have the resources to construct systems which can attract enough users to get useful data. But a larger problem is that recommender systems science is fairly disorganized. Systems are mostly incomparable, metrics and data collections aren't standard, and there is no infrastructure for controlled experiments across research groups. There are widespread anecdotal feelings about what should work, but no well-defined "best practices". It has to be possible to do both "off-line" research on algorithms, and "on-line" research with user interfaces.

Solution

Given these opportunities and obstacles, Konstan presented a five-point solution:

Data Sets
We need more than EachMovie, and largely to get past recommending movies. These datasets should have a minimum of ratings and content and demographic information on users. Moreover, the datasets need to have defined extracts with known properties: size, number of ratings per user, etc.
Recommendation Engine
A free or cheap recommendation engine is possibly the most often-heard request on the mailing list. It should be research-focused: flexible, controllable, and customizable. It should support both on-line and off-line analysis.
Analysis Tools and Metrics
The metrics being used need to be strongly analyzed: mean absolute error, mean squared error, area under an ROC curve, reversal rate, error rate, coverage, ranked accuracy in top k. The experimental methods are also varied: temporal order, randomly sampled, last five recommendations as most recent. Agreement on some small subset of these is essential for comparable research. There also need to be available scripts and tools for working with the data.
On-line Systems
On-line systems can provide a source of "borrowable" users. This is a long way down the road, and the issues of consent, privacy, and an experimental infrastructure are all hard.
Resources
System-building templates, construction tools, mailing lists, links, research bibliography, etc.

Konstan said that GroupLens can offer some of these things in the near future, such as some data sets, a small engine, and statistical tools. They also have a lot of experience running experimental, on-line sites, and can help with network hosting and seeking funding for management and workshops.

For the workshop, Konstan said that the focus should be on brainstorming some of these ideas, and consolidating the list of available resources. He also said that the critical mass is there for a global standalone workshop. In the medium-term, the need is for a steering committee and sustained funding in the area. In the long term, maintain and document resources, hold user workshops, and more supported research.

David Evans asked about the relationship of recommender systems to data mining and statistical modeling. Konstan said that recommender systems are real-time, on-line, model-free associations among people, and are easily explainable. Evans pointed out that the current practice is not model-free, and that marketing people have the same goals. Konstan replied that data mining is presently very analyst-intensive.

Heinrich Schuetze asked why is there a need to move away from movies. Konstan said that the right domains are human-created media with a target audience that you expect to cluster people by tastes. With movies, the community knowledge is high but user investment is low. Can we work with mutual funds? Eric Glover also pointed out that movies have relatively stable value, and that other artifacts have issues of temporal decay and portfolio effects.

Another question was asked on the value of exposing the rules and enhancing explainability. Konstan pointed to Jon Herlocker's forthcoming dissertation. There are different ways of explaining the reason for what is recommended: what you rated that was most pivotal, success rate, simple reverse engineering of the computation ("You liked these other movies...").

Recommendation Algorithms

Our first session consisted of three papers on improving the standard correlation-based prediction algorithms. Joaquin Delgado first presented a prediction algorithm which combines the correlation prediction with a weighted-majority voting scheme. In his paper, he was able to prove a bound for prediction errors in his algorithm, based on the size of the pool of predictions and the error of the best voting predictor. He used a subset of EachMovie, and simulated a chronological ordering. He reported precision and recall results, and compared them to those by Billsus and Pazzani. Questions asked include whether we can capture user's changes as they're exhibited, and whether the technique could be used to weight communities rather than users. Doug Oard pointed out that the metric was really relative recall.

Second, Ken Goldberg presented a technique to reduce the time needed to compute predictions. Rather than computing a full correlation matrix at the time of prediction, they would compute the principal components of the ratings matrix off-line, and use this to compute predictions quickly. Their test system, Jester, recommended jokes and was also a featured SIGIR demo. Jester has 30,000 registered users and around 1,000,000 ratings, but a small pool of jokes. Their evaluation focused on high-variance jokes.

Jon Herlocker presented work-in-progress on applying clustering and partitioning algorithms to the ratings matrix, and computing predictions within the clusters. Graph partitioning algorithms had a bias towards equal-sized partitions, while average link and ROCK produced a lot of single-item clusters. While there are scalability gains, the results were mixed on prediction accuracy. For evaluation, they used data from the MovieLens project. The approach might be suitable for parallelizing the problem without harming accuracy and coverage too much.

Combining Content and Collaborative Recommendation

In the second session, three authors presented techniques for integrating document content with collaborative information. Mark Claypool presented his group's work in generating an online newspaper, which uses a weighted average of content and collaborative predictions. The weight is tunable on a per-user basis. Content was initially better that collaboration, which eventually performed better than content. It was asked if the misses might be useful for topic detection.

Michelle Condliff presented a Bayesian model which integrates content and collaborative information. The model fits a naive Bayes classifier for each user based on item features, and then combines the classifiers using a regression model to approximate maximum covariance. There were a set of user covariates: age, gender, geography region (zip codes, which didn't work very well for university undergraduates) integrated into an information matrix with document features and ratings. She evaluated the model using a small example dataset and EachMovie.

Finally, Raymond Mooney presented a book recommender which used text categorization methods on information extracted from Amazon.com book descriptions and reviews. The content information was modeled as a semistructured bag of words. Mooney's questions were, is it possible to use content-based learning based on the content of collaborative recommendations, and is it useful to let users control the training samples and avoid the "rate these 100 sample items" approach.

Models for Users and Information Need

The last session featured two papers with rather different subjects. In the first presentation, David McDonald discussed ongoing work in building recommender systems to support "expertise networks" in organizations. In the particular project he discussed, they conducted a field study of how people go about finding an expert in some domain in their organization. Based on this study, a recommender system was designed to support this process electronically.

Lastly, Eric Glover presented a World-Wide Web metasearch engine which models categories of user search needs. These prototypes are translated into engine-specific search queries, and are also used to filter and organize the results from each engine.

Global Questions

Each author was asked to try to address in their presentation how they evaluated their system, and what such approaches were available to them.

Several researchers enhancing traditional collaborative filtering, and also exploring combinations of collaboration and content, were able to test their systems using the EachMovie database from Digital. This was generally agreed as helping broaded our understanding of the dataset and its limitations. Many projects are looking at prediction error as a metric, but there was not a lot of discussion of metrics and their pros and cons.

Additionally, there are lots of domains which we think of as recommendation, but which are fairly distinct. For example, the problem of recommending experts, both in McDonald's work and in Henry Kautz's classic ReferralWeb project, have different requirements and data than a collaborative filtering system. Glover's work in metasearch engines is perhaps illustrative of one way of overcoming privacy issues, by asking users to assume a prototype need rather than providing personal information.

Panel Session

At the close of the workshop, a panel discussion was held to try and address some of these overarching issues. The panelists were Ali Kamal of Tivo, Inc., Clifford Lynch of the Coalition for Networked Information, and presenters Joseph Konstan and Raymond Mooney. Kamal brought some industry experiences in recommending television programs, and Lynch illutsrated several important social issues in collaborative systems, such as privacy and reliability. The panel was started off with a kind of open-ended, get-your-cards-on-the-table question, but the workshop participants quickly threw in a wide variety of interesting issues.

What are good problem areas for researchers/grad students to focus on?
MacDonald considered wants and needs together in a social network...
Game theory can solve a lot of problems, and is orthogonal.
What about CF as information brokers?
What about other sources of user data, such as "usage"?
A comment about privacy... part of the motivation of movie recommenders was to build a community; they wanted real neighbors. Here we have the narrow view of a classification problem... ML has really come down to this, where everything being done is classification.
What about customer relations/management in retail... most are based on rules on how they fill the cart, and suggest purchases. They also have even bigger numbers!
But commerce is a choice scale - "do you want fries with that?" It's all constrained to the situation.
The commercial interest is for you to have a good time, to feel good. [Not really interested in predictive accuracy...]
What about "distance" [of a user to a product, goal, artifact] ... you can take secondary data.
Often people say they want something but really want something else... they say they want good news, but everyone watches bad news.
Why would we want to extend CF to other tasks? Some tasks CF isn't good at (e.g. news), some it's too risky (e.g. medical). What are the good and bad applications?
What about the algorithmic approach of similarity of users using all features. Input features are item dependent.
Taste occurs in a social setting... social data as implicits is good. Privacy is important but the social aspects can't be cut out for it...
At this point, we had extended our schedule quite a bit, and discussions tended to continue off-line.

Conclusion

It's clear that even the overload of conference and workshop venues can't seem to dwindle interest. There are plainly three important levels of discussion which occurred at the workshop. At the first level, there are the essential computer science questions of building fast and effective systems. At the second level, there are community issues of how to support the burgeoning research effort despite fragmentation among many focus areas such as machine learning and computer-human interfaces. At the last level are broad issues of privacy and the other "side-effects" of collaborative, community-oriented software.
The problems in the first level are easy to point to, and are probably well defined at the state of the art. There are issues we have known about for years, such as utilizing content and user data, implicit ratings, sparsity and scale; some of these only get bigger with time. There are also low-level issues pointed to by several of the research papers and also many of Clifford Lynch's comments in the panel, where CF and recommender systems are best seen as a part of a larger solution.
Joe Konstan gave a superb roadmap for the second level. Most importantly, this structure can help support and focus the research on the first. It only needs manpower to implement it.
The last level has some well-known hard problems, such as balancing privacy and community, and it's certain that there are others waiting in the wings. Amazon.com was recently sharply criticized for publicizing actual "information gatekeeper" communities, such as corporations or university departments, where anyone could see what they were buying as a whole. Many were concerned with Microsoft's purchase of Firefly, and its subsequent apparent demise, and what this would mean for privacy and user control of personal data. The fact is that these issues will only be answered alongside running systems, which need the solid foundation of the other two levels to exist.
Acknowledgements: We would like to thank Doug Oard for his invaluable notes taken at the workshop, SIGIR conference chair Frederic Gey for not being too annoyed when more than half of the workshop participants registered on-site and we couldn't fit in the room, and all the workshop participants for making such a successful and fascinating meeting!
??