Tutorials
Morning
Tutorials (08:30-12:15)
XML
Information Retrieval
Ricardo Baeza-Yates (University of Chile) & Norbert Fuhr (University
of Duisberg-Essen)
XML Standards - In this part, the major concepts
of the different standards will be introduced, and their relevance for IR
will be discussed, as well as available XML software tools.
Structured text
models, algorithms and data structures - this part covers structured text
models and their relation to XML. Structured text models predate XML and
were based on SGML databases. We compare their expressivity and efficiency
and show how some of those models can be used to implement XML query
languages.
We continue with specific indices for XML as well as the associated
searching algorithms. The solutions come from different viewpoints:
databases, where in most cases XML is mapped to relational databases; and
IR, where the index structure is combined with a full-text inverted index.
The content of this part is the basis for developing new XML IR approaches.
We also mention other XML retrieval problems, in particular searching XML streams.
Approaches for XML
IR - In
principle, all standard IR tasks can be performed with XML as well. So far,
research has mainly focused on ad-hoc retrieval and clustering of XML
documents. Due to the properties of XML, the standard methods have to be extended
and/or modified in order to deal with XML structure as well as with
different types of content occurring in an XML document. In this part, we
will first outline the different types of tasks that are considered in XML
IR and then present examples for the different approaches.
Machine
Learning For Text Classification Applications
David Lewis
A wide variety of
information access and data mining tasks can be viewed as text
classification problems. This perspective allows machine learning
techniques to be used to reduce manual effort. Attendees of this tutorial
will learn what machine learning can and can't do, how to choose learning
techniques and software, and processes and techniques for improving their
effectiveness. Examples will be drawn from areas such as knowledge
management, customer service, web directories, alerting and news services,
filtering, bioinformatics, information security, and survey research. I
will end by discussing areas where research progress could greatly aid text
classification in operational settings.
David D. Lewis is a
consultant on information retrieval, data mining, and natural language
processing, as well as co-founder of Ornarose, Inc. He works with
organizations of all sizes on the design, implementation, acquisition, and
operation of systems for manipulating and mining text data. Lewis has
published more than forty scientific papers and holds six patents on
information retrieval and text mining technology. He helped organize
several U.S. government evaluations of language processing technologies,
and created several widely used test collections. Prior to setting up a
consulting practice, he held positions at AT&T Labs, Bell Labs, and the
University of Chicago.
Multilingual
Information Retrieval
Fred Gey (University of California at Berkeley)
The growth of the Internet
and the World Wide Web has made available vast written and spoken resources
on a global scale from almost all countries in the world. The languages represented
on the web are a reflection of this diversity of resources and, to the
serious searcher, documents in languages other than English may provide
unique news, cultural insight and altogether different perspectives on our
electronic world. Moreover, most of the world's peoples speak a native
tongue other than English. This fact will increasingly be felt on the
Internet. During the past decade rapid progress has been made in developing
techniques for Multilingual Information Retrieval. Use of electronic
bilingual dictionaries and machine translation software has been augmented
by lexicons assembled from aligned bilingual parallel corpora of translated
documents, techniques for query expansion, phrase recognition and
translation disambiguation. On the other hand, most of these resources have
been developed and applied to the major European and Asian languages. This
half-day tutorial will cover aspects of Multilingual Information Retrieval
such as cross language search and retrieval, machine translation and
statistical machine translation, multilingual search of the WWW and
electronic digital library catalogues, evaluation strategies, evaluation
campaigns and test collections for cross-language search effectiveness in
the United States (TREC), Japan (NTCIR) and Europe (CLEF).
MIR_tutorial.pdf
User
Interfaces and Visualisation for Information Access
Marti Hearst (University of California at Berkeley)
This tutorial discusses user
interfaces for information retrieval systems.
A search user interface
should aid in the understanding and expression of users' information needs.
It should also help users formulate their queries, select among available
information sources, understand search results, and keep track of the
progress of their search. Designing user interfaces for search systems is
more difficult than for most other kinds of application, because of the
huge variety of content that appears in queries and the collection.
The tutorial will center on
how to design good search user interfaces, and what is currently known to
work well. It will include a brief introduction to User-Centered Design
from Human-Computer Interaction (HCI), which is key for developing usable
interfaces, and a discussion of the information seeking process.
Tutorial topics will include
Web search interface solutions, interfaces for query specification, query
modification, viewing retrieval results, question answering, and viewing
document collections. Where possible, ideas will be illustrated with
examples from current live interfaces. Discussion will also cover
techniques that, although of great interest, so far have not had strong
usability results in practice for search interface. These techniques
include automatic clustering and information visualization.
Dr. Marti Hearst is an
Associate Professor in the School of Information Management and Systems at
the University of California Berkeley. She obtained a PhD in Computer
Science at UC Berkeley and was a member of the research staff at Xerox PARC
for three years.
Dr. Hearst's research
interests include user interfaces, visualization, and robust language
analysis for information access systems. She wrote the chapter on User
Interfaces and Visualization for the textbook ``Modern Information
Retrieval'' (Baeza-Yates & Ribeiro-Neto, Eds.). Her research projects
include the Flamenco project for incorporating hierarchical faceted
categories into site search interfaces, the TileBars and Cat-a-Cone
visualizations for search interfaces, applying Scatter/Gather clustering
interactively to retrieval results, the TextTiling discourse segmentation
algorithm, and the Cha-Cha web intranet search interface.
Dr. Hearst was program
co-chair of SIGIR'99 and is a member of the editorial board of ACM
Transactions on Information Systems and ACM Transactions on Computer-Human
Interfaces.
Afternoon
Tutorials (13:15-17:00)
Internet
Search
Marti Hearst
(University of California at Berkeley) & Knut Magne Risvik
(Yahoo!)
Search has become a backbone
service for the internet. Over 250M queries per day are issued worldwide,
each of which resolved against indices of billions of documents with highly
relevant results return in a few milliseconds. A number of technologies
make this possible, ranging from distributed systems theory, computational
linguistics, data mining and information retrieval. This tutorial will
survey these technologies, including large-scale WWW crawling strategies,
Web page content analysis, result set ranking algorithms, and evaluation
methodologies. The tutorial will assume a familiarity with IR concepts,
such as vector-space matching, but will otherwise develop from the
ground-up all that is required to gain an operational understanding of
Internet Search Engine fundamentals. The tutorial will also touch on active
areas of research for Third Generation Internet Search Engines.
High
Performance Indexing and Query Evaluation For Information Retrieval
Justin Zobel (RMIT University)
Text search is a key
technology of modern computing. Search engines that index the web provide a
breadth and ease of access to information that was inconceivable a decade
ago. Text search has also grown in importance at the other end of the size
spectrum; the help services built into operating systems, for example, rely
on it. Since 1990, a range of new query evaluation algorithms and index
representations have been appeared -- including those developed by the
presenter of this tutorial -- that allow information retrieval queries to
be efficiently resolved on document collections containing terabytes of
text.
The challenges presented by
text search have led to development of a wide range of algorithms and data
structures. These include compact representations for text indexes,
innovative index construction techniques, and efficient algorithms for
evaluation of text queries. Systems with indexes based on these techniques
can resolve queries with a small fraction of the resources required by
traditional approaches to text indexing, and they allow the rapid response
expected of web search engines.
The tutorial concerns the
practical problems of indexing, querying, storing, and updating large text
databases, including those that are the result of web-based information
harvesting. While some of these developments are well known and have been
consolidated in textbooks, many specific innovations are not widely known,
and the textbook descriptions are incomplete. This tutorial introduces the
key techniques in the area, describing both a core implementation and how
that core can be enhanced through a range of extensions and innovations.
The main elements of these innovations include fast index construction
techniques, novel index representations, and efficient query evaluation
strategies. The tutorial consolidates recent innovations in search engine
implementation, and provides attendees with a basis for further research in
the area.
Professor Justin Zobel works
in School of Computer Science and Information Technology, RMIT University,
where he is leader of the RMIT Search Engines research group. He is
best-known in the research community for his work on efficient indexing
techniques, and has published widely in the areas of information retrieval,
text databases, algorithms and data structures, genomics, and compression.
His most recent text is the second edition of "Writing for Computer
Science". He is active in the information retrieval community; he is
currently Treasurer of SIGIR, an editor of Kluwer Information Retrieval,
and an associate editior of ACM Transactions on Information Systems.
Bioinformatics
and Genomics For IR
Bill Hersh (Oregon Health & Science University)
The goal of this tutorial is
to present an overview of bioinformatics and genomics for information
retrieval researchers and developers. By the end of the tutorial, the
attendee should have a better understanding of basic molecular biology as
well as where to find resources to learn more. He or she will also obtain
an overview of the work done in the field of bioinformatics.
The tutorial begins with
definitions of terms used in bioinformatics. It then provides an overview
of basic molecular biology, covering cell structure and function, basic
genetics (DNA structure and function, genes, chromosomes, protein
production, and the “central dogma” of biology), and the Human Genome
Project and its accomplishments.
This is followed by a
description of the genome technologies that have promoted the most
scientific advancement, including DNA microarrays. The data-intensive
nature of these technologies and how they have altered biological science
are discussed.
With this foundation of
biological science, attention is then turned to the myriad of genomic
databases that have become available, most of which are freely available.
Specific resources covered include:
- The genomics resources produced by the
National Center for Biotechnology Information (NCBI) of the National
Library of Medicine (NLM)
- Sequence databases and their significance
- Textual databases and their linkages to
other databases
- The PubMed/Entrez interfaces to all of this
data and how they are used by bioinformatics and genomics researchers
for scientific discovery
The tutorial then
provides a more detailed discussion of information retrieval in this domain
and where it is likely to play a significant role. It concludes by
summarizing all that has been taught and leaves time for general discussion
of how information retrieval systems will be most effectively developed,
used, and evaluated for bioinformatics and genomics data.
William Hersh is Professor
and Chair of the Department of Medical Informatics & Clinical
Epidemiology at Oregon Health & Science University (OHSU) in Portland,
Oregon, USA. Dr. Hersh has been at OHSU since 1990, where he has developed
research and educational programs in biomedical informatics. Dr. Hersh’s
main research focus is on the development and evaluation of information
retrieval (IR) systems. His current focus is on IR systems for
bioinformatics and genomics researchers. He initiated and currently serves
as Chair of the Text Retrieval Conference (TREC) Genomics Track.
Text
Summarisation
Dragomir Radev (University of Michigan)
With the explosion in the
quantity of online text and multimedia information in recent years, demand
for text summarisation technology is growing. The goal of automatic text
summarisation is to take a source document or documents, extract
information content from it, and present the most important content in a
condensed form in a manner sensitive to the needs of the user and task. In
this tutorial, the principal challenges, approaches, solutions and
remaining problems in summarization will be described. The tutorial will
provide an overview of the latest developments in automatic summarization,
including new problem areas, approaches, and evaluation methods.
|