Beyond Bag-of-Words: Machine Learning for Query-Document Matching in Web Search

Bio | Summary

Hang Li is senior researcher and research manager at Microsoft Research Asia. He is also adjunct professors at Peking University, Nanjing University, Xi‘an Jiaotong University, and Nankai University. His research areas include information retrieval, natural language processing, statistical machine learning, and data mining. He graduated from Kyoto University in 1988 and earned his PhD from the University of Tokyo in 1998. He worked at the NEC lab in Japan during 1991 and 2001. He joined Microsoft Research Asia in 2001 and has been working there until present. Hang has about 100 publications at top international journals and conferences, including SIGIR, WWW, WSDM, ACL, EMNLP, ICML, NIPS, and SIGKDD. He and his colleagues‘ papers received the SIGKDD‘08 best application paper award and the SIGIR‘08 best student paper award. Hang has also been working on the development of several products. These include Microsoft SQL Server 2005, Microsoft Office 2007 and Office 2010, Microsoft Live Search 2008, Microsoft Bing 2009 and Bing 2010. He has also been very active in the research communities and severed or is serving the top conferences and journals. For example, in 2011, he is PC co-chair of WSDM‘11; area chairs of SIGIR‘11, AAAI‘11, NIPS‘11; PC members of WWW‘11, ACL-HLT‘11, SIGKDD‘11, ICDM‘11, EMNLP‘11; editorial board members of Journal of the American Society for Information Science and Journal of Computer Science & Technology.

Jun Xu is Associate Researcher at Microsoft Research Asia. He received his PhD in computer science from Nankai University China in 2006. After that, he joined Microsoft Research Asia. His research interest focuses on information retrieval and text mining. Jun has published extensively in prestigious conferences and journals including SIGIR, WWW, JMLR, ECML, and ECIR. Jun is very active in the research communities and severed or is serving the top conferences and journals. He developed the learning to rank algorithms of IR-SVM and AdaRank, as well as the LETOR dataset. He released the AdaRank algorithm and LETOR dataset to the academic. Jun has also been working on the development of Microsoft products including Microsoft Bing 2010 and Office 2011.


Dealing with mismatch between query and document is one of the most critical research problems in web search. Recently researchers have spent significant effort to address the grand challenge. The major approach is to conduct more query and document understanding, and perform matching between enriched query and document representations. With the availability of large amount of log data and advanced machine learning techniques, this becomes more feasible and significant progress has been made recently. In this tutorial, we will give a systematic and detailed survey on newly developed machine learning technologies for query document matching in web search. We will focus on the descriptions on the fundamental problems, as well as the novel solutions. Matching between query and document is not limited to search, and similar problems can be observed at online advertisement, recommendation system, and other applications, as matching between objects from two spaces.