|
Instructor |
David Maier maier at cs dot pdx dot
edu, 115-14 FAB. Note: Please put ‘cs510’ at the
beginning of the subject line. |
|
Phone: |
Susan: 503-725-2419 Dave:
503 725-2406 |
|
Class Meeting |
Tuesday,
Thursday OND 220 |
|
Office Hours |
Dave:
Thursdays 2 – 3 (can stay longer if necessary) Susan:
Mondays 2 – 3 (can stay later if necessary) You
are welcome to ask questions by e-mail or phone. (Note:
Susan checks email much more frequently than phone messages) |
|
Guest Lecturers |
William
Hersh (OHSU), Melanie Mitchell (PSU) |
|
Wk |
Date |
Topic |
|
Slides |
Homework Due Tuesdays (at 10 AM) |
|
1a |
Jan 9 SP |
Introduction to Information Retrieval; |
|
Homework 1 |
|
|
1b |
Jan 11 SP |
Text Processing Overview; Introduction to Lucene |
-article by Martin Porter
|
Homework 3a (Lucene Project I) assigned (date corrected due Jan 30) |
|
|
2a |
Jan 16 SP |
PSU CLOSED DUE TO SNOW |
|
|
PSU CLOSED DUE TO SNOW |
|
2b |
Jan 18 SP |
Models in IR |
|
|
Homework 1 Homework 2 & 7 (Experimental Project I & II) assigned |
|
3a |
Jan 23 SP |
Queries |
Chapter 10 of Ultraseek Admin. Guide (pp 233-248) Recommended: |
|
Homework 2 |
|
3b |
Jan 25 SP DM |
Introduction to indexing: manual and automated |
Recommended: look at Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies |
||
|
Indexing structures |
Ch 8: 8.1 – 8.3, 8.8 Inverted files for text search engines (you can skip Section 13) |
|
|
||
|
4a |
Jan 30 DM |
Similarity; Query processing |
|
Lecture 7 (rev) (rev) |
Homework
4 assigned Homework 3a due (10 points) |
|
4b |
Feb 1 DM |
Similarity; Query processing, cont. |
|
|
Homework 3b due (10 points total for parts a and b of homework 3) |
|
5a |
Feb 6 SP |
Evaluation and Relevance in IR |
Ch. 3: 3.1 – 3.31 Recommended: Common
Evaluation Measures (Appendix to TREC 2005 Proceedings) |
Lecture 8 (rev) |
Homework 4 Homework 5 assigned Homework 9 (Lucene Project II)
assigned |
|
5b |
Feb 8 SP |
Issues in ranking; Query expansion; Relevance Feedback |
Ch 5: 5.1 – 5.4 |
|
|
|
DM |
Classification and clustering |
Ch. 3 of van Rijksbergen Machine Learning in
Automated Text Categorization (Read Sections 1-4 carefully, skim 5 &
6) |
Homework 6 assigned |
||
|
6 |
Feb 13, 15 DM |
Collection building on the Internet; Challenges of Web Content |
Ch 13 up through 13.6 |
Homework 5 |
|
|
7a |
Feb 20 SP |
Digital Libraries |
Ch 15 |
|
|
|
7b |
Feb 22 DM |
Making Information Findable |
Homework 6 |
||
|
8a |
Feb 27 SP |
IR-Related Tasks: Segmentation, Summarization, and Information Extraction |
|
Homework 8 assigned Homework 7 |
|
|
8b |
Mar 1 DM |
Information Extraction in the Large |
Required: Lydia: A System
for Large-Scale News Analysis |
|
|
|
9a |
Mar 6 |
Multimedia I: |
Required: Advancing Biomedical Image Retrieval Recommended: A Review of Content-Based Image Retrieval Systems in Medical Applications (Sections 1&3) |
Homework 8 due (8 points) |
|
|
9b |
Mar 8 |
Multimedia II: Dr. Melanie Mitchell |
Required: Content-Based Image Retrieval at the End of the Early Years Recommended: |
|
|
|
10 |
Mar 13, 15 DM |
Database-Style Information Extraction; Personal Information Management; Semantic Components |
|
Homework 9 |
|
|
11 |
Mar 20, 22 |
Final exam week No final exam |
|
|
|
The e-mail list
for this class will be cs510iri@cs.pdx.edu.
It will be used for announcements from the instructor. You can also send
questions and answers to this mail list. You can subscribe to the list at
https://mailhost.cecs.pdx.edu/mailman/listinfo/cs510iri.
The Internet has
seen the most extensive application of information retrieval (IR) techniques to
date. At the same time, the Internet has often stressed traditional IR methods
to the breaking point. This course introduces classical IR concepts, but also
discusses how they are stressed when applied in the large, distributed and
dynamic setting of the Internet, and covers some of the techniques used to get
around the limitations. The first half of the course will address standard IR
topics: keyword-based retrieval and indexing, classification, clustering and
evaluation metrics for different approaches. The second half of the course will
cover several different challenges for IR on the Internet, and technologies
being developed to address them, such as
• Collection building: crawling the web,
duplicate detection, sampling, digital libraries
• Providing context: annotation,
meta-data
• Distribution: scalable architectures
• Heterogeneity: scraping, wrapping,
translation
•
The course will
also examine particular systems for searching and intermediation on the
Internet. The main assignments for the course will be a series of projects,
done individually and in groups. This course may be used in the Databases track
of the CS MS.
REQUIRED:
Modern Information
Retrieval. By Ricardo Baeza-Yates and Berthier
Ribeiro-Neto, ACM Press/Addison Wesley, 1999, ISBN 0-201-39829-X.
There will be
written assignments and project assignments. The written assignments will generally
involve some manual exercise, with a short write-up due (possibly with answers
to specific questions).
There will also be
two projects. The first will involve formulating a hypothesis about the
behavior of a search engine (such as Google), devising a way to test the
hypothesis, conducting that test and analyzing your results. The second will
involve collecting, indexing and searching web content using the Lucene
search-engine library.
Students
registered for CS 510 (rather than 410) will have an additional section of each
assignment to complete.