Winter 2007 CS 510 Information Retrieval on the Internet

This Document is stored at www.cs.pdx.edu/~maier/cs510iri

Announcements (Last update 15 March, 2:55pm):

·        Lecture notes 22 on line

·        Lecture notes 21 on line

·        Notes from Dr. Hersh’s lecture on image retrieval are on line

Instructor

David Maier maier at cs dot pdx dot edu, 115-14 FAB.
Susan Price prices at cs dot pdx dot edu, 115-09 FAB.

 Note: Please put ‘cs510’ at the beginning of the subject line.

Phone:

Susan:  503-725-2419

Dave: 503 725-2406

Class Meeting

Tuesday, Thursday  OND 220

Office Hours

Dave: Thursdays 2 – 3 (can stay longer if necessary)

Susan: Mondays 2 – 3 (can stay later if necessary)

You are welcome to ask questions by e-mail or phone.

(Note: Susan checks email much more frequently than phone messages)

Guest Lecturers

William Hersh (OHSU), Melanie Mitchell (PSU)

Weekly Schedule

[This schedule is preliminary and subject to change]

Assignments due Tuesdays

Wk

Date

Topic

Reading (will be refined) 

Slides

Homework

Due Tuesdays

(at 10 AM)

1a

Jan 9

SP

Introduction to Information Retrieval;

 

Ch. 1: 1.1 – 1.4

 

Lecture 1

Homework 1
assigned

1b

Jan 11

SP

Text Processing Overview;

Introduction to Lucene

Ch. 7: 7.2.1 –  7.2.4 (suggested:

-article by Martin Porter

-Lancaster Stemming Algorithm site )

Lecture 2a

Lecture 2b

Homework 3a (Lucene Project I) assigned

(date corrected due Jan 30)

2a

Jan 16

SP

PSU CLOSED DUE TO SNOW

 

 

 

PSU CLOSED DUE TO SNOW

2b

Jan 18

SP

Models in IR

Ch. 2: 2.1 – 2.5, 2.10

 

Lecture 3



Homework 1
due (8 points)

Homework 2 & 7 (Experimental Project I & II) assigned

3a

Jan 23 SP

Queries

 

 

Ch. 4: 4.1 – 4.3

Chapter 10 of Ultraseek Admin. Guide (pp 233-248)

Recommended:

A taxonomy of web search

Lecture 4

 

Homework 2
due (10 points)

 

3b

Jan 25

SP

DM

Introduction to indexing: manual and automated

Recommended: look at Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies

Lecture 5

Homework 3b assigned

Indexing structures

Ch 8: 8.1 – 8.3, 8.8

Inverted files for text search engines (you can skip Section 13)


Lecture 6
(rev)

 

4a

Jan 30

DM

Similarity; Query processing

 

 

Lecture 7 (rev)

Lecture 7a

(rev)

Homework 4 assigned Homework 3a due (10 points)

4b

Feb 1

DM

Similarity; Query processing, cont.

 

 

Homework 3b due (10 points total for parts a and b of homework 3)

 

5a

Feb 6

SP

Evaluation and  Relevance in IR

 

 

 

 

 

Ch. 3: 3.1 – 3.31

Recommended:

Common Evaluation Measures (Appendix to TREC 2005 Proceedings)

The Philosophy of Information Retrieval Evaluation

Lecture 8 (rev)

 

Lecture 9

Homework 4
due (8 points)

Homework 5 assigned

Homework 9 (Lucene Project II) assigned

5b

Feb 8

SP

 

Issues in ranking; Query expansion; Relevance Feedback

Ch 5: 5.1 – 5.4

Lecture 10

 

DM

Classification and clustering

Ch. 3 of van Rijksbergen

Machine Learning in Automated Text Categorization (Read Sections 1-4 carefully, skim 5 & 6)

Lecture 11

Lecture 12

Homework 6 assigned

6

Feb

13, 15

DM

Collection building on the Internet; Challenges of Web Content

Ch 13 up through 13.6

Searching the Web

Lecture 13

Lecture 13a

Homework 5
due (8 points)

7a

Feb 20

SP

Digital Libraries

Ch 15

Lecture 14

 

 

 

7b

Feb 22

DM

Making Information Findable

RSS Tutorial

Lecture 15

Lecture 15a

Homework 6
due (8 points)

 

8a

Feb 27

SP

IR-Related Tasks: Segmentation, Summarization, and Information Extraction

 

Lecture 16

Homework 8 assigned Homework 7
due (20 points)

8b

Mar 1

DM

Information Extraction in the Large

Required: Lydia: A System for Large-Scale News Analysis
Recommended: The Google Similarity Distance

Lecture 17

Lecture 17a

 

9a

Mar 6

Multimedia I:
Dr. William Hersh

Required:

Advancing Biomedical Image Retrieval

Recommended:

A Review of Content-Based Image Retrieval Systems in Medical Applications

(Sections 1&3)

 Lecture 18

Homework 8 due (8 points)

9b

Mar 8

Multimedia II:

Dr. Melanie Mitchell

Required:

Content-Based Image Retrieval at the End of the Early Years

Recommended:

Image Retrieval from the World Wide Web

Lecture 19

 

10

Mar

13, 15

DM

Database-Style Information Extraction; Personal Information Management; Semantic Components

 

Lecture 20

Lecture 21

Lecture 22

 Homework 9
due (20 points)

11

Mar 20, 22

Final exam week

No final exam

 

 

 

Class E-mail

The e-mail list for this class will be cs510iri@cs.pdx.edu.  It will be used for announcements from the instructor.  You can also send questions and answers to this mail list.  You can subscribe to the list at https://mailhost.cecs.pdx.edu/mailman/listinfo/cs510iri.

Catalog Description

The Internet has seen the most extensive application of information retrieval (IR) techniques to date. At the same time, the Internet has often stressed traditional IR methods to the breaking point. This course introduces classical IR concepts, but also discusses how they are stressed when applied in the large, distributed and dynamic setting of the Internet, and covers some of the techniques used to get around the limitations. The first half of the course will address standard IR topics: keyword-based retrieval and indexing, classification, clustering and evaluation metrics for different approaches. The second half of the course will cover several different challenges for IR on the Internet, and technologies being developed to address them, such as

 • Collection building: crawling the web, duplicate detection, sampling, digital libraries

 • Providing context: annotation, meta-data

 • Distribution: scalable architectures

 • Heterogeneity: scraping, wrapping, translation

 Enterprise and Organizational issues: standards, interoperation

The course will also examine particular systems for searching and intermediation on the Internet. The main assignments for the course will be a series of projects, done individually and in groups. This course may be used in the Databases track of the CS MS.

Textbooks

REQUIRED:
Modern Information Retrieval. By Ricardo Baeza-Yates and Berthier Ribeiro-Neto, ACM Press/Addison Wesley, 1999, ISBN 0-201-39829-X. 

Reading

Readings will come both from the textbook and from supplementary materials.

Assignments

There will be written assignments and project assignments. The written assignments will generally involve some manual exercise, with a short write-up due (possibly with answers to specific questions).

There will also be two projects. The first will involve formulating a hypothesis about the behavior of a search engine (such as Google), devising a way to test the hypothesis, conducting that test and analyzing your results. The second will involve collecting, indexing and searching web content using the Lucene search-engine library.

 

Students registered for CS 510 (rather than 410) will have an additional section of each assignment to complete.

 

Grading
Assignments: There are 9 assignments, worth 100 points (100%) of your grade. Five will be written assignments worth a total of 40 points (8 points each).  There will be two projects and each project will consist of two assignments.  The first part of each project is worth 10 points; the second part of each project is worth 20 points. Some of the assignments are to be done individually; other assignments will have the option of being done individually or in groups of 2 or 3.  If you work in a team, then turn in one paper with the names of all team members on it.  Make sure your assignments are legible. You may seek help from your partner (if you have one) the instructors and the class mailing list, but otherwise work independently.  Assignments are due on TUESDAYS at the beginning of class.

 

There will be no exams in this course.

Information

The Google Garden is here.

A second planting is here.

Links planted to help with project 7:

http://www.geocities.com/kimyj0823/

http://cse564.hostfreeweb.info/

trorfy trorfy1 trorfy2 trorfy3

Policies

Students are responsible for anything that transpires during a class – therefore if you're not in a class, you should get notes from someone else (not the instructor).  

Assignments are due at the beginning of the class period. 

Late homework and projects will not be accepted without prior approval from one of us.  Lack of prior approval is an automatic 50% off, or 0% if that assignment has been discussed in class. 

Requests for regrading must be submitted in writing within one week of the time the graded assignment was returned.  You must be specific in saying why you feel your answer deserves additional credit. 

Students with disabilities who are in need of academic accommodations should contact us as soon as possible to arrange needed supports.  Students are also encouraged to contact the Disability Resource Center (DRC) for additional information on  support services and available accommodations at 503/725-4240 or 503 725-4150.

Academic Integrity

[Excerpt from the 2004-2005 PSU Catalog, pages 29-30]
The policies of the University governing the rights, freedoms, responsibilities, and conduct of students are set forth in the Statement of Student Rights, Freedoms, and Responsibilities, as supplemented and amended by the Portland State University Student Conduct Code, which has been issued by the President under authority of the Administrative Rules of the Oregon State Board of Higher Education. The code governing academic honesty is part of the Student Conduct Code. Students may consult these documents in the Office of Student Affairs, 433 Smith Memorial Student Union or by visiting the OSA Web site.  Observance of these rules, policies, and procedures helps the University to operate in a climate of free inquiry and expression and  assists it in protecting its academic environment and educational purpose.

Academic honesty: Academic honesty is a cornerstone of any meaningful education and a reflection of each student’s maturity and integrity. The Office of Student Affairs is responsible for working with University faculty to address complaints of academic dishonesty.  The Student Conduct Code, which applies to all students, prohibits all forms of academic cheating, fraud, and dishonesty.  These acts include, but are not limited to, plagiarism, buying and selling of course assignments and research papers, performing academic assignments (including tests and examinations) for other persons, unauthorized disclosure and receipt of academic information, and other practices commonly understood to be academically dishonest.  For a copy of the Student Code of Conduct see the OSA Web site.  Allegations of academic dishonesty may be addressed by the instructor, may be referred to the Office of Student Affairs for action, or both. Allegations referred to the Office of Student Affairs are investigated following the procedures outlined in the Student Conduct Code.  Acts of academic dishonesty may result in one or more of the following sanctions: a failing grade on the exam or assignment for which the dishonesty occurred, disciplinary reprimand, disciplinary probation, loss of privileges, required community service, suspension from the University for a period of up to two years, and/or dismissal from the University.  Questions regarding academic honesty should be directed to the Office of Student Affairs, 433 Smith Memorial Student Union.

Supplementary Readings

Online textbook by C. J. van Rijsbergen: http://www.dcs.gla.ac.uk/Keith/Preface.html

Useful IR Resources

·        Lucene:

o       Overview: http://lucene.apache.org/java/docs/index.html

o       Downloads: http://www.apache.org/dyn/closer.cgi/lucene/java/

o       API: http://lucene.apache.org/java/docs/api/

·        Nutch (open source project for writing web crawlers; sibling project to Lucene)

o       Home page: http://lucene.apache.org/nutch/index.html

o       Download: http://www.apache.org/dyn/closer.cgi/lucene/nutch/

·        Stemming:

o       Article by Martin Porter

o       Lancaster Stemming Algorithm site

·        Probabilistic model

o       A Probabilistic model of information retrieval: development and status. by K. Sparck Jones, S. Walker, and S.E. Robertson

·        Controlled vocabularies/Thesauri

o       ANSI/NISO Z39.19-2005 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies

o       Medical Subject Headings (MeSH)

o       Art & Architecture Thesaurus

o       Agrovoc

·        Indexing

o       Inverted files for text search engines. J. Zobel and A. Moffat. ACM Computing Surveys Volume 38,  Issue 2  (2006).

o       The Term Vector Database: Fast access to indexing terms for Web pages. R. Stata, K. Bharat, F. Maghoul. Computer Networks 33(1-6), June 2000.

·        Sources of Advanced Readings

o       Recommended Reading for IR Research Students

o       Readings in Information Retrieval, edited by Karen Sparck Jones and Peter Willett.  Morgan Kaufmann Publishers, Inc., San Francisco, CA, 1997.

·        Evaluation

o       Cumulated Gain-Based Evaluation of IR Techniques.  K. Järvelin and J. Kekäläinen.  ACM Transactions on Information Systems, Vol. 20, No. 4, October 2002, pp 422-446.