![]() |
CS
430 Assignments |
Reports should be carefully written and well formatted, using a good word processor. Use spelling and grammar checkers to remove errors.
All assignments are to be submitted online. When you have completed your work, submit is as follows:
For example
\\goose.csuglab.cornell.edu\courses\cs430-fall01
\Assignment 2
\abc123
\abc123-1
MySecondAssignment.doc
MyPerlFile.pl
\abc123-2
MySecondAssignment.doc
MyPerlFile.pl
AnotherPerlFile.pl
For those not in the CSUGLAB, please see http//www.csuglab.cornell.edu/Info/access-files.html. Note that if you do an Explore-Map Drive from outside of the CSUGLAB, you will need to use, when referring to a Windows Share, \\goose.csuglab.cornell.edu\courses.
For those outside of the CSUGLAB, you may also use ftp to ftp.csuglab.cornell.edu.
Additional CSUGLAB help can be found in http//www.csuglab.cornell.edu.
The objective of this assignment is to explore a number of
information retrieval systems and to compare them. Explore the following indexes and catalogs:
Google – a web search engine (http://www.google.com/).
ACM Digital Library – a collection of journal articles and related materials (http://www.acm.org/dl/).
The Library of Congress catalog – a very large bibliographic catalog (http://catalog.loc.gov/).
Inspec using OCLC's First Search system – an indexing and abstracting service (access through the Cornell University Library gateway http://campusgw.library.cornell.edu/).
American Memory at the Library of Congress – a digital library of materials converted from physical artifacts (http://memory.loc.gov/).
Analyze each search service in three ways.
The assignment has 5 search services, with three questions about each. Please answer each question separately for each service. The amount of information that you can discover about each question varies greatly. You may find that you write several paragraphs to answer one question for one service, or have only brief information. If you have been unable to find much information about a question, list where you searched.
Hints. Some of these questions can be answered by reading information provided at the search site. Others may require detective work. Search the computer science literature to see if you can find out about the underlying search engines. There may be a technical paper that the creators have written or you may find a review article that compares search systems. Some information can be discovered by experiment. Some information is trade secret and you will not be able to answer all of the questions. Remember to cite your sources.
The object of this assignment is to write a suite of programs, in Perl, that take raw text and prepare the entries for an inverted file.
The text that you will use to test your programs is four source HTML files from the first issue of D-Lib Magazine, in July 1995. The files are:
http://www.dlib.org/dlib/July95/07weibel.html
http://www.dlib.org/dlib/July95/07birmingham.html
http://www.dlib.org/dlib/July95/07arms.html
http://www.dlib.org/dlib/July95/07clips.html
1. Lexical analysis and stop list.
Use the stop list in the file stoplist.txt and reject all HTML tags. This program should read the input data and output two files:
(1a) Sorted word list with frequencies.
(1b) Postings file, sorted in alphabetic order
2. Term weighting, using TF.IDF weights.
Use the standard weighting scheme on Slide 11 of Lecture 9. The program should read the files written by the first program and output one file:
(2a) Postings file with term weightings, sorted in alphabetic order.
Write a brief report (not more than one page) that describes the algorithms and data structures that you used for each program.
You should submit the following:
Programs 1 and 2
Files 1a, 1b, 2a using the test data
Report
10/2/01
******** HW 3 EXTENSION ********
******** HOMEWORK 3 is now due at 5 p.m. on Friday, November 9 ********
The aim of this assignment is to implement a Boolean information retrieval system for fielded searching.
The input data that is to be indexed is a file of tagged ASCII text of the following format:
Each record and each field begins on a new line. Fields may extend over more than one line. There may be blank lines.
See the test data file for examples of this format. Your search engine does not need to handle data in any other format, but must be prepared to recognize other field tags for different metadata types.
A file of test data test3.txt has been provided. This data is the catalog records from one year of articles in D-Lib Magazine, lightly edited. (This test data is actually an XML file. Your program can ignore the data before the first <metadata> tag.)
A query is a string of printable ASCII text. It may contain the following:
Here are some examples of queries, with interpretation:
A. Write two programs in Perl.
B. Provide a brief report that describes the algorithms that you use (not more than one page).
C. Run your program with about five queries. Your report should list the queries and indicate what function each is testing.
In your submission folder, the files listed below should be present. No fewer and no extra files should be there. Conforming to the name and path conventions will allow us to run automated tests on the code. If your files are named differently from the ones below, we may deduct points.
The following files need to be in your submission folder (e.g., abc123-1\stoplist.txt):
The following files need to be in a subdirectory named "output" (e.g., abc123-1\output\word-list.txt):
We will be running your code using a Perl driver. The output of your programs should be written to the same directory level as the programs (e.g., abc123-1\word-list.txt). Since we do not want to overwrite the output files that you have submitted, please submit the output files from your test runs in the "output" subdirectory.
We will be testing your second program against a set of queries. These queries will be fed to your program one at a time through standard input. You need to ensure that an entire query string is read by your program in a single stdin read and that no other user input is required in order to display the results. After the results are displayed, your program should exit rather than loop.
10/30/01
*** Optional assignment for extra credit ***
Because of the length of Assignment 3, this is an optional assignment for extra credit. It you do not submit this assignment your overall grade for the course will not suffer. This assignment is offered so that (a) those people who submitted incomplete solutions to Assignment 3 have a second opportunity, and (b) people who have special concern about their grade have a chance to show special commitment (e.g., if you hope to turn an A into an A+, or if your Midterm examination was very poor.)
Your task is to extend the system that you wrote for Assignment 3 to support the near operator. Here are two examples of queries that include a near operator:
Modify your programs build.pl (if necessary) and search.pl from Assignment 3. Write a short report explaining the changes that you have made and describing how you have incorporated the near operator into the query syntax. Test your work with suitable test queries.
The submission instructions are the same as for Assignment 3.
Because this is an optional assignment, absolutely no extensions will be allowed.
11/21/01
William Y. Arms
(wya@cs.cornell.edu)
Last changed: November 21, 2001