CS410/510
Information Retrieval on the Internet
Due:
Thursday, February 22, 2007
8 points
Goal: The goal of this assignment is to become familiar with
some of the association measures presented in Chapter 3 of van Rijksbergen, and get some experience with the graph-based
clustering methods presented there.
Details:
You will be computing association
measures for the documents given at the end of this assignment. The features
you will be using for the first questions will be the authors of each document.
You should ignore lexical variations when considering if two authors are the
same. For example, treat “J. Widom” and
“Jennifer Widom” as the same.
1. For each of the association measures
below, determine the 10 pairs of documents with the highest association
measure. (Note, these measures are defined in the
section “Measures of Association” near the beginning of the paper.)
a. Simple matching coefficient
b. Jaccard’s
coefficient
c. Overlap coefficient
2. Draw the graph that results among all the
documents using the simple matching coefficient and a threshold of 2. (That is,
two documents are linked if their simple matching coefficient is 2 or more.)
3. Give the clusters that result if each
connected component of the graph in Part 2 is a cluster.
4. Propose an alternate feature space and
association measure for these documents. What clusters result under your
choice? Note: You are permitted to “eyeball” the similarity of
documents under your measure.
Document
set: When referring to documents
below in your assignment, use their document numbers. For example, the first
document below has document number 2007-4.
http://dbpubs.stanford.edu:8090/pub/2007-4
http://dbpubs.stanford.edu:8090/pub/2007-3
http://dbpubs.stanford.edu:8090/pub/2007-2
http://dbpubs.stanford.edu:8090/pub/2006-21
http://dbpubs.stanford.edu:8090/pub/2006-1
http://dbpubs.stanford.edu:8090/pub/2006-3
http://dbpubs.stanford.edu:8090/pub/2006-4
http://dbpubs.stanford.edu:8090/pub/2004-42
http://dbpubs.stanford.edu:8090/pub/2004-24
http://dbpubs.stanford.edu:8090/pub/2004-35
http://dbpubs.stanford.edu:8090/pub/2004-2
http://dbpubs.stanford.edu:8090/pub/2005-30