CS 430
Information Discovery
Fall 2001

Assignments


Submission Instructions (Revised 10/9/01)

Reports should be carefully written and well formatted, using a good word processor.  Use spelling and grammar checkers to remove errors.  

All assignments are to be submitted online. When you have completed your work, submit is as follows: 

For example

    \\goose.csuglab.cornell.edu\courses\cs430-fall01
        \Assignment 2
            \abc123
                \abc123-1
                    MySecondAssignment.doc
                    MyPerlFile.pl
                \abc123-2
                    MySecondAssignment.doc
                    MyPerlFile.pl
                    AnotherPerlFile.pl

For those not in the CSUGLAB, please see http//www.csuglab.cornell.edu/Info/access-files.html.  Note that if you do an Explore-Map Drive from outside of the CSUGLAB, you will need to use, when referring to a Windows Share, \\goose.csuglab.cornell.edu\courses.

For those outside of the CSUGLAB, you may also use ftp to ftp.csuglab.cornell.edu.

Additional CSUGLAB help can be found in http//www.csuglab.cornell.edu.

 

Assignment 1, Market Research
Due Friday, September 21, at 5:00 p.m.

The objective of this assignment is to explore a number of information retrieval systems and to compare them. Explore the following indexes and catalogs:

  1. Google – a web search engine (http://www.google.com/).

  2. ACM Digital Library – a collection of journal articles and related materials (http://www.acm.org/dl/).

  3. The Library of Congress catalog – a very large bibliographic catalog (http://catalog.loc.gov/).

  4. Inspec using OCLC's First Search system – an indexing and abstracting service (access through the Cornell University Library gateway http://campusgw.library.cornell.edu/).

  5. American Memory at the Library of Congress – a digital library of materials converted from physical artifacts (http://memory.loc.gov/).

Analyze each search service in three ways.

Technical
Describe the services offered from a technical viewpoint. Does the service search full text or surrogates? Are fielded searched offered? What Boolean operators are supported? What regular expressions? How does it handle non-Roman character sets? What is the stop list? How are results ranked? Are they sorted, if so in what order?
User interface
What style of user interface(s) is provided? What training or help services? If there are basic and advanced user interfaces, what does each offer?
Experience
Experiment with a number of searches on each system. Try some long queries and some short ones. Can you find a search that is very slow?  Or one that fails?  Try searches with simple and complex syntax. How effective is each service? Do you find it easy to use? What do you consider its strengths and its weaknesses?

The assignment has 5 search services, with three questions about each. Please answer each question separately for each service. The amount of information that you can discover about each question varies greatly. You may find that you write several paragraphs to answer one question for one service, or have only brief information. If you have been unable to find much information about a question, list where you searched.

Hints.  Some of these questions can be answered by reading information provided at the search site.  Others may require detective work.  Search the computer science literature to see if you can find out about the underlying search engines.  There may be a technical paper that the creators have written or you may find a review article that compares search systems.  Some information can be discovered by experiment.  Some information is trade secret and you will not be able to answer all of the questions.  Remember to cite your sources.

 

Assignment 2, Building an Index
Due, Friday, October 12, at 5 p.m.

The object of this assignment is to write a suite of programs, in Perl, that take raw text and prepare the entries for an inverted file.

Test data

The text that you will use to test your programs is four source HTML files from the first issue of D-Lib Magazine, in July 1995. The files are:

    http://www.dlib.org/dlib/July95/07weibel.html
    http://www.dlib.org/dlib/July95/07birmingham.html
    http://www.dlib.org/dlib/July95/07arms.html
    http://www.dlib.org/dlib/July95/07clips.html

Programs

1. Lexical analysis and stop list.

Use the stop list in the file stoplist.txt and reject all HTML tags.  This program should read the input data and output two files:

    (1a) Sorted word list with frequencies.
    (1b) Postings file, sorted in alphabetic order

2. Term weighting, using TF.IDF weights.

Use the standard weighting scheme on Slide 11 of Lecture 9.  The program should read the files written by the first program and output one file:

    (2a) Postings file with term weightings, sorted in alphabetic order.

Report

Write a brief report (not more than one page) that describes the algorithms and data structures that you used for each program.

Submission

You should submit the following:

    Programs 1 and 2
    Files 1a, 1b, 2a using the test data
    Report

10/2/01


Assignment 3, A Fielded Search Engine 
Due:   Monday, November 5, 2001 at 5:00 p.m.
*** Revised Information, October 29, 2001 ***

******** HW 3 EXTENSION ********

******** HOMEWORK 3 is now due at 5 p.m. on Friday, November 9 ********

The aim of this assignment is to implement a Boolean information retrieval system for fielded searching.

Input data

The input data that is to be indexed is a file of tagged ASCII text of the following format:

Each record and each field begins on a new line. Fields may extend over more than one line. There may be blank lines.

See the test data file for examples of this format. Your search engine does not need to handle data in any other format, but must be prepared to recognize other field tags for different metadata types.

A file of test data test3.txt has been provided.  This data is the catalog records from one year of articles in D-Lib Magazine, lightly edited. (This test data is actually an XML file.  Your program can ignore the data before the first <metadata> tag.)

Query

A query is a string of printable ASCII text.  It may contain the following:

Here are some examples of queries, with interpretation:

Task

A.  Write two programs in Perl.

  1. The first program should be called build.pl.  It reads a test file in this format and builds an inverted index of the search terms. It should use the same stop list as for Assignment 2, stoplist.txt, and stemming using Porter's algorithm.  The program should create:
  2. The second program should be called search.pl.  It allows a user to search the indexes. It should have the following features:

B.  Provide a brief report that describes the algorithms that you use (not more than one page).

C.  Run your program with about five queries.  Your report should list the queries and indicate what function each is testing.

Submission

In your submission folder, the files listed below should be present. No fewer and no extra files should be there. Conforming to the name and path conventions will allow us to run automated tests on the code.  If your files are named differently from the ones below, we may deduct points.  

The following files need to be in your submission folder (e.g., abc123-1\stoplist.txt):

  1. stoplist.txt
  2. test3.xml
  3. build.pl 
  4. search.pl
  5. Your report, which should be named report.txt, report.doc, report.ps or report.pdf. 

The following files need to be in a subdirectory named "output" (e.g., abc123-1\output\word-list.txt):

  1. word-list.txt
  2. results.html

We will be running your code using a Perl driver.  The output of your programs should be written to the same directory level as the programs (e.g., abc123-1\word-list.txt).  Since we do not want to overwrite the output files that you have submitted, please submit the output files from your test runs in the "output" subdirectory. 

We will be testing your second program against a set of queries. These queries will be fed to your program one at a time through standard input. You need to ensure that an entire query string is read by your program in a single stdin read and that no other user input is required in order to display the results. After the results are displayed, your program should exit rather than loop.

10/30/01

Assignment 4
Due Friday, December 7 at 5:00 p.m.

*** Optional assignment for extra credit ***

Because of the length of Assignment 3, this is an optional assignment for extra credit.  It you do not submit this assignment your overall grade for the course will not suffer.  This assignment is offered so that (a) those people who submitted incomplete solutions to Assignment 3 have a second opportunity, and (b) people who have special concern about their grade have a chance to show special commitment (e.g., if you hope to turn an A into an A+, or if your Midterm examination was very poor.)

Task

Your task is to extend the system that you wrote for Assignment 3 to support the near operator.  Here are two examples of  queries that include a near operator:

Modify your programs build.pl (if necessary) and search.pl from Assignment 3.  Write a short report explaining the changes that you have made and describing how you have incorporated the near operator into the query syntax.  Test your work with suitable test queries.

The submission instructions are the same as for Assignment 3.

Because this is an optional assignment, absolutely no extensions will be allowed.

11/21/01


[CS 430 Home Page]

William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 21, 2001