SIGIR 2004
















About Sheffield


Important Dates




Contact Us


The University of Sheffield


Sheffield, July 25-29



Morning Tutorials (08:30-12:15)

XML Information Retrieval
Ricardo Baeza-Yates (University of Chile) & Norbert Fuhr (University of Duisberg-Essen)

XML Standards - In this part, the major concepts of the different standards will be introduced, and their relevance for IR will be discussed, as well as available XML software tools.

Structured text models, algorithms and data structures - this part covers structured text models and their relation to XML. Structured text models predate XML and were based on SGML databases. We compare their expressivity and efficiency and show how some of those models can be used to implement XML query languages.
We continue with specific indices for XML as well as the associated searching algorithms. The solutions come from different viewpoints: databases, where in most cases XML is mapped to relational databases; and IR, where the index structure is combined with a full-text inverted index. The content of this part is the basis for developing new XML IR approaches.
We also mention other XML retrieval problems, in particular searching XML streams.

Approaches for XML IR - In principle, all standard IR tasks can be performed with XML as well. So far, research has mainly focused on ad-hoc retrieval and clustering of XML documents. Due to the properties of XML, the standard methods have to be extended and/or modified in order to deal with XML structure as well as with different types of content occurring in an XML document. In this part, we will first outline the different types of tasks that are considered in XML IR and then present examples for the different approaches.

Machine Learning For Text Classification Applications
David Lewis

A wide variety of information access and data mining tasks can be viewed as text classification problems. This perspective allows machine learning techniques to be used to reduce manual effort. Attendees of this tutorial will learn what machine learning can and can't do, how to choose learning techniques and software, and processes and techniques for improving their effectiveness. Examples will be drawn from areas such as knowledge management, customer service, web directories, alerting and news services, filtering, bioinformatics, information security, and survey research. I will end by discussing areas where research progress could greatly aid text classification in operational settings.

David D. Lewis is a consultant on information retrieval, data mining, and natural language processing, as well as co-founder of Ornarose, Inc. He works with organizations of all sizes on the design, implementation, acquisition, and operation of systems for manipulating and mining text data. Lewis has published more than forty scientific papers and holds six patents on information retrieval and text mining technology. He helped organize several U.S. government evaluations of language processing technologies, and created several widely used test collections. Prior to setting up a consulting practice, he held positions at AT&T Labs, Bell Labs, and the University of Chicago.

Multilingual Information Retrieval
Fred Gey (University of California at Berkeley)

The growth of the Internet and the World Wide Web has made available vast written and spoken resources on a global scale from almost all countries in the world. The languages represented on the web are a reflection of this diversity of resources and, to the serious searcher, documents in languages other than English may provide unique news, cultural insight and altogether different perspectives on our electronic world. Moreover, most of the world's peoples speak a native tongue other than English. This fact will increasingly be felt on the Internet. During the past decade rapid progress has been made in developing techniques for Multilingual Information Retrieval. Use of electronic bilingual dictionaries and machine translation software has been augmented by lexicons assembled from aligned bilingual parallel corpora of translated documents, techniques for query expansion, phrase recognition and translation disambiguation. On the other hand, most of these resources have been developed and applied to the major European and Asian languages. This half-day tutorial will cover aspects of Multilingual Information Retrieval such as cross language search and retrieval, machine translation and statistical machine translation, multilingual search of the WWW and electronic digital library catalogues, evaluation strategies, evaluation campaigns and test collections for cross-language search effectiveness in the United States (TREC), Japan (NTCIR) and Europe (CLEF).


User Interfaces and Visualisation for Information Access
Marti Hearst (University of California at Berkeley)

This tutorial discusses user interfaces for information retrieval systems.

A search user interface should aid in the understanding and expression of users' information needs. It should also help users formulate their queries, select among available information sources, understand search results, and keep track of the progress of their search. Designing user interfaces for search systems is more difficult than for most other kinds of application, because of the huge variety of content that appears in queries and the collection.

The tutorial will center on how to design good search user interfaces, and what is currently known to work well. It will include a brief introduction to User-Centered Design from Human-Computer Interaction (HCI), which is key for developing usable interfaces, and a discussion of the information seeking process.

Tutorial topics will include Web search interface solutions, interfaces for query specification, query modification, viewing retrieval results, question answering, and viewing document collections. Where possible, ideas will be illustrated with examples from current live interfaces. Discussion will also cover techniques that, although of great interest, so far have not had strong usability results in practice for search interface. These techniques include automatic clustering and information visualization.

Dr. Marti Hearst is an Associate Professor in the School of Information Management and Systems at the University of California Berkeley. She obtained a PhD in Computer Science at UC Berkeley and was a member of the research staff at Xerox PARC for three years.

Dr. Hearst's research interests include user interfaces, visualization, and robust language analysis for information access systems. She wrote the chapter on User Interfaces and Visualization for the textbook ``Modern Information Retrieval'' (Baeza-Yates & Ribeiro-Neto, Eds.). Her research projects include the Flamenco project for incorporating hierarchical faceted categories into site search interfaces, the TileBars and Cat-a-Cone visualizations for search interfaces, applying Scatter/Gather clustering interactively to retrieval results, the TextTiling discourse segmentation algorithm, and the Cha-Cha web intranet search interface.

Dr. Hearst was program co-chair of SIGIR'99 and is a member of the editorial board of ACM Transactions on Information Systems and ACM Transactions on Computer-Human Interfaces.

Afternoon Tutorials (13:15-17:00)

Internet Search
Marti Hearst (University of California at Berkeley) & Knut Magne Risvik (Yahoo!)

Search has become a backbone service for the internet. Over 250M queries per day are issued worldwide, each of which resolved against indices of billions of documents with highly relevant results return in a few milliseconds. A number of technologies make this possible, ranging from distributed systems theory, computational linguistics, data mining and information retrieval. This tutorial will survey these technologies, including large-scale WWW crawling strategies, Web page content analysis, result set ranking algorithms, and evaluation methodologies. The tutorial will assume a familiarity with IR concepts, such as vector-space matching, but will otherwise develop from the ground-up all that is required to gain an operational understanding of Internet Search Engine fundamentals. The tutorial will also touch on active areas of research for Third Generation Internet Search Engines.

High Performance Indexing and Query Evaluation For Information Retrieval
Justin Zobel (RMIT University)

Text search is a key technology of modern computing. Search engines that index the web provide a breadth and ease of access to information that was inconceivable a decade ago. Text search has also grown in importance at the other end of the size spectrum; the help services built into operating systems, for example, rely on it. Since 1990, a range of new query evaluation algorithms and index representations have been appeared -- including those developed by the presenter of this tutorial -- that allow information retrieval queries to be efficiently resolved on document collections containing terabytes of text.

The challenges presented by text search have led to development of a wide range of algorithms and data structures. These include compact representations for text indexes, innovative index construction techniques, and efficient algorithms for evaluation of text queries. Systems with indexes based on these techniques can resolve queries with a small fraction of the resources required by traditional approaches to text indexing, and they allow the rapid response expected of web search engines.

The tutorial concerns the practical problems of indexing, querying, storing, and updating large text databases, including those that are the result of web-based information harvesting. While some of these developments are well known and have been consolidated in textbooks, many specific innovations are not widely known, and the textbook descriptions are incomplete. This tutorial introduces the key techniques in the area, describing both a core implementation and how that core can be enhanced through a range of extensions and innovations. The main elements of these innovations include fast index construction techniques, novel index representations, and efficient query evaluation strategies. The tutorial consolidates recent innovations in search engine implementation, and provides attendees with a basis for further research in the area.

Professor Justin Zobel works in School of Computer Science and Information Technology, RMIT University, where he is leader of the RMIT Search Engines research group. He is best-known in the research community for his work on efficient indexing techniques, and has published widely in the areas of information retrieval, text databases, algorithms and data structures, genomics, and compression. His most recent text is the second edition of "Writing for Computer Science". He is active in the information retrieval community; he is currently Treasurer of SIGIR, an editor of Kluwer Information Retrieval, and an associate editior of ACM Transactions on Information Systems.

Bioinformatics and Genomics For IR
Bill Hersh (Oregon Health & Science University)

The goal of this tutorial is to present an overview of bioinformatics and genomics for information retrieval researchers and developers. By the end of the tutorial, the attendee should have a better understanding of basic molecular biology as well as where to find resources to learn more. He or she will also obtain an overview of the work done in the field of bioinformatics.

The tutorial begins with definitions of terms used in bioinformatics. It then provides an overview of basic molecular biology, covering cell structure and function, basic genetics (DNA structure and function, genes, chromosomes, protein production, and the “central dogma” of biology), and the Human Genome Project and its accomplishments.

This is followed by a description of the genome technologies that have promoted the most scientific advancement, including DNA microarrays. The data-intensive nature of these technologies and how they have altered biological science are discussed.

With this foundation of biological science, attention is then turned to the myriad of genomic databases that have become available, most of which are freely available. Specific resources covered include:

  • The genomics resources produced by the National Center for Biotechnology Information (NCBI) of the National Library of Medicine (NLM)
  • Sequence databases and their significance
  • Textual databases and their linkages to other databases
  • The PubMed/Entrez interfaces to all of this data and how they are used by bioinformatics and genomics researchers for scientific discovery

The tutorial then provides a more detailed discussion of information retrieval in this domain and where it is likely to play a significant role. It concludes by summarizing all that has been taught and leaves time for general discussion of how information retrieval systems will be most effectively developed, used, and evaluated for bioinformatics and genomics data.

William Hersh is Professor and Chair of the Department of Medical Informatics & Clinical Epidemiology at Oregon Health & Science University (OHSU) in Portland, Oregon, USA. Dr. Hersh has been at OHSU since 1990, where he has developed research and educational programs in biomedical informatics. Dr. Hersh’s main research focus is on the development and evaluation of information retrieval (IR) systems. His current focus is on IR systems for bioinformatics and genomics researchers. He initiated and currently serves as Chair of the Text Retrieval Conference (TREC) Genomics Track.

Text Summarisation
Dragomir Radev (University of Michigan)

With the explosion in the quantity of online text and multimedia information in recent years, demand for text summarisation technology is growing. The goal of automatic text summarisation is to take a source document or documents, extract information content from it, and present the most important content in a condensed form in a manner sensitive to the needs of the user and task. In this tutorial, the principal challenges, approaches, solutions and remaining problems in summarization will be described. The tutorial will provide an overview of the latest developments in automatic summarization, including new problem areas, approaches, and evaluation methods.

Home | About Sheffield | Important Dates | Call For Papers | Organisers | Travel | Contact Us