CS 430: Information Discovery: Assignments

CS 430
Information Discovery
Fall 2001

Assignments

Submission Instructions (Revised 10/9/01)

Reports should be carefully written and well formatted, using a good word processor. Use spelling and grammar checkers to remove errors.

All assignments are to be submitted online. When you have completed your work, submit is as follows:

There is a shared network folder on \\goose.csuglab.cornell.edu\courses\cs430-fall01\. Students can submit to this folder, but not read it.
To submit your homework (in the CSUGLAB), Map Network Drive to \\goose.csuglab.cornell.edu\courses and use your CSUGLAB account to login. You will notice a cs430-fall01 folder in there. If you do not have a CSUGLAB account, please email cs430@cs.cornell.edu or go to Upson 311.
You will find a folder called with the number of the assignment. Within it there is a folder with your NetID. Create a folder called xxx-1, where xxx is your NetID, and copy your folder into your assignment folder. (For example, if your NetID is abc123, create a folder called abc123-1.) If you want to revise your assignment, submit the entire assignment again in a new folder called xxx-2, etc. We will grade the one with the highest sequence number.

For example

    \\goose.csuglab.cornell.edu\courses\cs430-fall01
        \Assignment 2
            \abc123
                \abc123-1
                    MySecondAssignment.doc
                    MyPerlFile.pl
                \abc123-2
                    MySecondAssignment.doc
                    MyPerlFile.pl
                    AnotherPerlFile.pl

For those not in the CSUGLAB, please see http//www.csuglab.cornell.edu/Info/access-files.html. Note that if you do an Explore-Map Drive from outside of the CSUGLAB, you will need to use, when referring to a Windows Share, \\goose.csuglab.cornell.edu\courses.

For those outside of the CSUGLAB, you may also use ftp to ftp.csuglab.cornell.edu.

Additional CSUGLAB help can be found in http//www.csuglab.cornell.edu.

Assignment 1, Market Research
Due Friday, September 21, at 5:00 p.m.

The objective of this assignment is to explore a number of information retrieval systems and to compare them. Explore the following indexes and catalogs:

Google – a web search engine (http://www.google.com/).
ACM Digital Library – a collection of journal articles and related materials (http://www.acm.org/dl/).
The Library of Congress catalog – a very large bibliographic catalog (http://catalog.loc.gov/).
Inspec using OCLC's First Search system – an indexing and abstracting service (access through the Cornell University Library gateway http://campusgw.library.cornell.edu/).
American Memory at the Library of Congress – a digital library of materials converted from physical artifacts (http://memory.loc.gov/).

Analyze each search service in three ways.

Technical: Describe the services offered from a technical viewpoint. Does the service search full text or surrogates? Are fielded searched offered? What Boolean operators are supported? What regular expressions? How does it handle non-Roman character sets? What is the stop list? How are results ranked? Are they sorted, if so in what order?
User interface: What style of user interface(s) is provided? What training or help services? If there are basic and advanced user interfaces, what does each offer?
Experience: Experiment with a number of searches on each system. Try some long queries and some short ones. Can you find a search that is very slow? Or one that fails? Try searches with simple and complex syntax. How effective is each service? Do you find it easy to use? What do you consider its strengths and its weaknesses?

The assignment has 5 search services, with three questions about each. Please answer each question separately for each service. The amount of information that you can discover about each question varies greatly. You may find that you write several paragraphs to answer one question for one service, or have only brief information. If you have been unable to find much information about a question, list where you searched.

Hints. Some of these questions can be answered by reading information provided at the search site. Others may require detective work. Search the computer science literature to see if you can find out about the underlying search engines. There may be a technical paper that the creators have written or you may find a review article that compares search systems. Some information can be discovered by experiment. Some information is trade secret and you will not be able to answer all of the questions. Remember to cite your sources.

Assignment 2, Building an Index
Due, Friday, October 12, at 5 p.m.

The object of this assignment is to write a suite of programs, in Perl, that take raw text and prepare the entries for an inverted file.

Test data

The text that you will use to test your programs is four source HTML files from the first issue of D-Lib Magazine, in July 1995. The files are:

    http://www.dlib.org/dlib/July95/07weibel.html
    http://www.dlib.org/dlib/July95/07birmingham.html
    http://www.dlib.org/dlib/July95/07arms.html
    http://www.dlib.org/dlib/July95/07clips.html

Programs

1. Lexical analysis and stop list.

Use the stop list in the file stoplist.txt and reject all HTML tags. This program should read the input data and output two files:

(1a) Sorted word list with frequencies.
(1b) Postings file, sorted in alphabetic order

2. Term weighting, using TF.IDF weights.

Use the standard weighting scheme on Slide 11 of Lecture 9. The program should read the files written by the first program and output one file:

(2a) Postings file with term weightings, sorted in alphabetic order.

Report

Write a brief report (not more than one page) that describes the algorithms and data structures that you used for each program.

Submission

You should submit the following:

    Programs 1 and 2
    Files 1a, 1b, 2a using the test data
    Report

10/2/01

Assignment 3, A Fielded Search Engine
Due: Monday, November 5, 2001 at 5:00 p.m.
* Revised Information, October 29, 2001 *

******** HW 3 EXTENSION ********

******** HOMEWORK 3 is now due at 5 p.m. on Friday, November 9 ********

The aim of this assignment is to implement a Boolean information retrieval system for fielded searching.

Input data

The input data that is to be indexed is a file of tagged ASCII text of the following format:

The input data is delimited by a pair of tags: <metadata>, </metadata>.
Each record within the file is delimited by a pair of tags: <record>, </record>.
Each record consists of a sequence of fields, where each field is delimited by a pair of tags: <field name>, </field name>. The field name can be any alphabetic string.

Each record and each field begins on a new line. Fields may extend over more than one line. There may be blank lines.

See the test data file for examples of this format. Your search engine does not need to handle data in any other format, but must be prepared to recognize other field tags for different metadata types.

A file of test data test3.txt has been provided. This data is the catalog records from one year of articles in D-Lib Magazine, lightly edited. (This test data is actually an XML file. Your program can ignore the data before the first <metadata> tag.)

Query

A query is a string of printable ASCII text. It may contain the following:

One or more search terms, consisting of an optional field name enclosed in <>, followed by a search term. If no field name is given, the query is applied to all fields.
The Boolean, and and or operators. If the operator is omitted, the query is assumed to have an implied and. When a query has several operators, and has precedence over or.
Parentheses need not be supported.
Phrases need not be supported.
The truncation symbol, *, at the end of a search term to indicate 0 or more characters. (When a query ends with the * symbol, do not apply stemming to the query.)

Here are some examples of queries, with interpretation:

<creator> moll (The creator field includes the term "moll".)
<title> digit* (The title field includes a term that begins with the letters "digit".)
<creator> moll and <title> digit* (The creator field includes the term "moll" and the title field includes a term that begins with the letters "digit".)
digital or library (Some field(s) contains the term "digital" or the term "library" .)

Task

A. Write two programs in Perl.

The first program should be called build.pl. It reads a test file in this format and builds an inverted index of the search terms. It should use the same stop list as for Assignment 2, stoplist.txt, and stemming using Porter's algorithm. The program should create:

a file called word-list.txt which is a list of the terms in the index, after removal of words from the stop list and stemming. This list should be in alphabetical order with one term per line.
the index file(s) to be used as input to the second program.

The second program should be called search.pl. It allows a user to search the indexes. It should have the following features:

The program is run from a command line interface. The user types in a query and receives a response, which is either an error message or a message , "x records match your query", where x is a count, 0 or greater.
The results found should be written in html format to a file results.html. The results should contain the following:

For a single run of the program, the results of each search should be appended to the file in sequence.
For each query the results should begin with a line containing the query and should end with a horizontal rule (html tag <hr>).
For each hit, the program should print out a catalog record with the search terms highlighted.
If the record contains a URL field, the catalog record should have a clickable link to the actual item.

The interface should allow a search on any of the field names that are tagged in the metadata records. Searching on other field names is an error.

B. Provide a brief report that describes the algorithms that you use (not more than one page).

C. Run your program with about five queries. Your report should list the queries and indicate what function each is testing.

Submission

In your submission folder, the files listed below should be present. No fewer and no extra files should be there. Conforming to the name and path conventions will allow us to run automated tests on the code. If your files are named differently from the ones below, we may deduct points.

The following files need to be in your submission folder (e.g., abc123-1\stoplist.txt):

stoplist.txt
test3.xml
build.pl
search.pl
Your report, which should be named report.txt, report.doc, report.ps or report.pdf.

The following files need to be in a subdirectory named "output" (e.g., abc123-1\output\word-list.txt):

word-list.txt
results.html

We will be running your code using a Perl driver. The output of your programs should be written to the same directory level as the programs (e.g., abc123-1\word-list.txt). Since we do not want to overwrite the output files that you have submitted, please submit the output files from your test runs in the "output" subdirectory.

We will be testing your second program against a set of queries. These queries will be fed to your program one at a time through standard input. You need to ensure that an entire query string is read by your program in a single stdin read and that no other user input is required in order to display the results. After the results are displayed, your program should exit rather than loop.

10/30/01

Assignment 4
Due Friday, December 7 at 5:00 p.m.

*** Optional assignment for extra credit ***

Because of the length of Assignment 3, this is an optional assignment for extra credit. It you do not submit this assignment your overall grade for the course will not suffer. This assignment is offered so that (a) those people who submitted incomplete solutions to Assignment 3 have a second opportunity, and (b) people who have special concern about their grade have a chance to show special commitment (e.g., if you hope to turn an A into an A+, or if your Midterm examination was very poor.)

Task

Your task is to extend the system that you wrote for Assignment 3 to support the near operator. Here are two examples of queries that include a near operator:

digital near4 library (The terms "digital" and "library" appear in the document, in either order, within 4 words of each other.)
<title> digital near5 library (The terms "digital" and "library" appear in a <title> field of the document, in either order, within 5 words of each other.)

Modify your programs build.pl (if necessary) and search.pl from Assignment 3. Write a short report explaining the changes that you have made and describing how you have incorporated the near operator into the query syntax. Test your work with suitable test queries.

The submission instructions are the same as for Assignment 3.

Because this is an optional assignment, absolutely no extensions will be allowed.

11/21/01

[CS 430 Home Page]

William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 21, 2001

Submission Instructions (Revised 10/9/01)

Assignment 1, Market Research Due Friday, September 21, at 5:00 p.m.

Assignment 2, Building an Index Due, Friday, October 12, at 5 p.m.

Test data

Programs

Report

Submission

Assignment 3, A Fielded Search Engine Due: Monday, November 5, 2001 at 5:00 p.m. *** Revised Information, October 29, 2001 ***

Input data

Query

Task

Submission

Assignment 4 Due Friday, December 7 at 5:00 p.m.

Task

Assignment 1, Market Research
Due Friday, September 21, at 5:00 p.m.

Assignment 2, Building an Index
Due, Friday, October 12, at 5 p.m.

Assignment 3, A Fielded Search Engine
Due: Monday, November 5, 2001 at 5:00 p.m.
* Revised Information, October 29, 2001 *

Assignment 4
Due Friday, December 7 at 5:00 p.m.