Phase III: Fault-tolerance Through Message-Logging
Due: 8:00am Tuesday, Oct. 30.
General Instructions.
Students are required to work together in teams.
You may work in the team you used for Phase II or you may form a new team.
An assignment submitted on behalf of a "team" having fewer than 3 or
more than 5 students will receive a grade of F.
All members of the team are responsible for understanding the entire
assignment.
This assignment builds on work that was done in Phase I.
Feel free to use any team's solution to Phase I as the basis for your
solution to Phase III.
No late assignments will be accepted.
Academic Integrity. Collaboration between groups is
prohibited and will be treated as a violation of the University's
academic integrity code.
Background: Disaster Recovery for Bank Branches
In the distributed banking system built for Phase I,
account balances are stored in
the volatile memory of processors --- not on secondary storage.
If the processor at a branch crashes, then access to information about those balances
is not only lost while the processor remains unavailable
but can be lost forever.
Neither the bank nor its customers are likely to be happy about
such a loss of account information.
Use of a message logging protocol is one way that a branch server could recover
information concerning that branch's accounts after a faulty processor has been repaired.
Customers might not be happy about losing access to their accounts while
"the computer is down" but at least there would be no irrevocable consequences of
a processor failure.
In this phase of the cs514 project, you will program a message-logging
protocol (of your choice) for the branch servers of
the distributed bank implemented for phase I.
This exercise should:
-
Help you to master ideas we have been discussing in class
(i.e., message-logging protocols).
-
Give you an opportunity to contemplate and evaluate various design options
a designer must confront:
-
Is additional difficulty in programming a more complex protocol worth the gain?
-
Is a given protocol too costly for the intended application (and what cost metrics
are the most relevant)?
What to Build
For this phase, assume:
- Processor failures are benign (i.e. failstop) and not Byzantine.
- Each communication channel is bidirectional.
- Communication channels are not necessarily reliable and are not necessarily FIFO.
- Hosts in the banking system do not have access to local disks for storing checkpoints
or for storing any other information needed for implementing recovery.
Branch GUI and Branch Server.
Extend the branch GUI from Phase I
with a "button" that instigates the failure of the corresponding branch server.
Activating this button should cause the branch server to simulate a failstop failure
and recovery by doing the following.
-
Send a message to each of neighbor, announcing that this branch server has
failed.
-
Cause the branch server application to terminate (perhaps by sending a suitable
message to a modified version of that branch server).
-
After some time has passed (say 30 seconds) since the branch server has terminated,
cause a new copy of the branch server to start and run some "recovery code".
Message Logging Protocol Implementation.
Also, write "recovery code" for the branch server.
Execution of this code enables the branch server to obtain correct
and current account balances before that
branch server commences processing subsequent requests
for the various operations it supports.
Use message logs stored in the volatile memories of the other branch servers.
(An industrial-strength implementation might also employ checkpoints to allow the truncation
of message logs.)
As mentioned in class, descriptions of some message-logging protocols and
citations to relevant literature
can be found in:
-
Mootaz Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson.
A survey of rollback-recovery protocols in message-passing systems.
Technical Report.
Available from
http://www.cs.utexas.edu/users/lorenzo/papers/Pap6.ps or
locally postscript or pdf.
Because this paper was written using MicroSoft Word,
windows ghostview (and not UNIX ghostview) must be used as the
Helper Application for reading.
The scheme you implement must be one of the schemes that is
discussed in this survey or in some other published paper on
the subject.
Submission Procedure.
Create a directory containing the files you wish us to grade.
Call this directory xxxxx0,
where xxxxx is the Cornell network id
of the team member whose net id is alphabetically smallest of your team.
Then copy this directory to the following folder:
\\Goose\courses\cs514-fall01\proj03.submit
Don't be disturbed by warnings informing you that the file cannot be accessed
after it has been copied.
Should you wish to revise your submission after you have copied it to our folder,
then simply correct the files and re-copy the entire directory---but this
time use the name xxxxx1.
Revisions to that should be named xxxxx2,
and so on.
We will grade only the largest-numbered file of a series.
No late submissions will be accepted.
Your directory should contain the following files (at least):
-
TEAM which contains the names (and net-ids) for all team members.
Also, for each team member give a 1 or 2 paragraph description of the tasks
this team member performed and the number of hours this required.
-
README which contains
- The names and a description of the contents for the other files in the directory.
- Instructions for installing, compiling, and running your software on our
Windows-NT system.
- A tutorial that the grader can follow to start your software and to convince
himself that your system implements the required functionality. Expect the grader
to spend at most 10 minutes on this task.
-
LOGIC which contains
- A brief description of the specific message-logging
protocol that was implemented and a pointer to a publication (paper or textbook)
where this protocol is discussed.
- An explanation of what other protocols were considered and
why this protocol was selected.
Discuss those aspects of the application or setting that impact the choice of protocol.
Give strengths and weaknesses of the protocol you selected as
compared with alternatives.
- A characterization of how many processor failures your protocol can
tolerate (relate this to network topology if appropriate).
-
TOPO should specify an interesting interconnection topology for a
multi-branch bank
that will be used to illustrate the operation of your system.
You may restrict consideration to systems in which
all links are bidirectional
-
TestPlan should describe the process and any tools (i.e. additional programs)
you wrote in order to test your system. This file should also explain what tests you
ran and why this was a reasonable set of tests to have run.
-
JAVA source file that contain the Java source needed to compile and test your system.
Grading.
Your grade will be based on the following elements:
-
Does your system correctly implement some message-logging protocol.
- Higher grades will go to protocols that are more sophisticated in their operation,
meaning that their correctness argument is more subtle.
- Higher grades will go to protocols that can tolerate larger numbers of failures.
- Higher grades will to to protocols that work in larger classes of topologies.
Be warned:
It is tricky to go beyond tolerating a single
failure and a topology where every branch is the neighbor
of every other branch.
Don't try this until you have a working solution for the case where
(i) there is at most one branch that has failed and not restarted and
(ii) all branches are neighbors of each other.
-
How easy is it to follow the README file installation and sample-execution script.
-
How thorough was your testing procedure and how creative were you in building
sufficient scaffolding (test drivers etc) to test your system.
-
Is the source code easy to understand and does it exhibit good structure?