Assignment 6

CS410/510 Information Retrieval on the Internet

Due: Thursday, February 22, 2007

8 points

 

Goal: The goal of this assignment is to become familiar with some of the association measures presented in Chapter 3 of van Rijksbergen, and get some experience with the graph-based clustering methods presented there.

Details: You will be computing association measures for the documents given at the end of this assignment. The features you will be using for the first questions will be the authors of each document. You should ignore lexical variations when considering if two authors are the same. For example, treat “J. Widom” and “Jennifer Widom” as the same.

1. For each of the association measures below, determine the 10 pairs of documents with the highest association measure. (Note, these measures are defined in the section “Measures of Association” near the beginning of the paper.)

a. Simple matching coefficient

b. Jaccard’s coefficient

c. Overlap coefficient

2. Draw the graph that results among all the documents using the simple matching coefficient and a threshold of 2. (That is, two documents are linked if their simple matching coefficient is 2 or more.)

3. Give the clusters that result if each connected component of the graph in Part 2 is a cluster.

4. Propose an alternate feature space and association measure for these documents. What clusters result under your choice? Note: You are permitted to “eyeball” the similarity of documents under your measure.

Document set: When referring to documents below in your assignment, use their document numbers. For example, the first document below has document number 2007-4.

http://dbpubs.stanford.edu:8090/pub/2007-4

http://dbpubs.stanford.edu:8090/pub/2007-3

http://dbpubs.stanford.edu:8090/pub/2007-2

http://dbpubs.stanford.edu:8090/pub/2006-21

http://dbpubs.stanford.edu:8090/pub/2006-1

http://dbpubs.stanford.edu:8090/pub/2006-3

http://dbpubs.stanford.edu:8090/pub/2006-4

http://dbpubs.stanford.edu:8090/pub/2004-42

http://dbpubs.stanford.edu:8090/pub/2004-24

http://dbpubs.stanford.edu:8090/pub/2004-35

http://dbpubs.stanford.edu:8090/pub/2004-2

http://dbpubs.stanford.edu:8090/pub/2005-30