Phase IV: Availability Through Active Replication
Due: 9:00am Monday, Nov. 19.
General Instructions.
Students are required to work together in teams.
You may work in the team you used for an earlier phase or you may form a new team.
An assignment submitted on behalf of a "team" having fewer than 3 or
more than 5 students will receive a grade of F.
All members of the team are responsible for understanding the entire
assignment.
This assignment builds on Phase I and some parts of Phase III.
Feel free to use any team's solutions to previous phases as the basis for your
solution to Phase IV.
No late assignments will be accepted.
Academic Integrity. Collaboration between groups is
prohibited and will be treated as a violation of the University's
academic integrity code.
Background: Making Bank Branches Available
In the distributed banking system built for Phase III, bank accounts (and the
funds they store) at crashed
branches are unavailable until the faulty host has been repaired and restarted.
Few bank customers would be willing to tolerate such outages often, if at all.
More generally,
the increasing dependence by business on computers for daily operations means that
high-availability is no longer a requirement only for
life-critical control settings.
Implementing availability in our banking systems is thus not an atypical concern.
The state machine approach is one way that an application can be made highly available.
Services and data are replicated;
requests are processed by enough replicas so that processor failures can be masked.
In this phase of the cs514 project, you will program such active replication.
This will give you an opportunity to master and get hands-on
experience with the state machine approach.
What to Build
For this phase, assume:
- Processor failures are benign (i.e. failstop) and not Byzantine.
- No branch GUI ever fails.
- Each communication channel is bidirectional.
- Communication channels are reliable and FIFO.
- Every bank branch is directly connected to every other. Thus, the network topology
is a completely connected graph (i.e., with N*(N-1) bi-directional edges).
Simulating Failstop Failures.
Extend the branch GUI from Phase I
with a "button" that instigates the failure of the branch server.
Activating this button should cause behavior that simulates the failstop failure
for the branch server, as follows.
-
Send a message to each of neighbor, announcing that this branch server has
failed.
-
Abruptly terminate the branch server application and any other software written for this
phase that is running on the failed host.
(Do not cause the branch GUI to terminate, because the branch GUI
will serve to simulate the
control panel for the host at that branch.)
-
After some time has passed (say 30 seconds) since the branch server has terminated,
cause a new copy of the branch server to start and run some "recovery code".
State Machine Protocols.
A branch server can be replicated by running a copy of it on other hosts
in the banking system.
There are two ways to orchestrate
running a replica of branch server for branch
B1 (say) on the host for branch B2 (say):
- Modify the branch server at branch B2.
The modified branch server not only processes the operations (commands)
that it used to
(e.g. Deposit, Withdraw, Query, Transfer for
that branch's accounts) but this modified branch server also handles these operations for
branch B1.
Effectively, we are merging new branch server replicas with the branch server
that already exists at each host.
-
Instantiate at branch B2
a copy of the branch server for branch B1, and associate this copy with a new socket
at the host running branch server B2.
Note that you may have to modify your "network wrapper class" to accommodate
any additional
sockets now in use.
As far as the state machine approach is concerned,
each branch GUI should be viewed as a client.
This means that the branch GUI will now be receiving multiple responses
for each operation it initiates and, therefore, it must
convert those responses into a single one (which is then presented to the human
user).
Because the mechanism for combining responses
resides in the branch GUI, it never experiences failures.
The state machine approach is mostly concerned with orchestrating the delivery of
client commands to various replicas.
Since we assume that failures are fail-stop,
the choice of stability test
should reflect that.
For additional credit, feel free to
- add periodic broadcasts by various system components in order
to speed-up request processing, and/or
- employ assumptions about message delivery delays and clock synchrony in order
to speed-up request processing.
Note that both of these additional-credit options require programming as well
as making design decisions.
What components should issue additional broadcasts and how often?
What assumptions about message delivery delays and clock synchrony
are reasonable for the actual hardware and software you are working with?
The amount of credit you receive will be based not only on your answers
to these questions but also on your discussion of how you obtained those answers.
In the state machine approach, each client request is broadcast to all state machine
replicas.
An agreement protocol is usually employed for this.
But for the purposes of this phase,
you may simulate the agreement protocol by coding a naive broadcast routine.
Your routine should simply send a single copy of the message being
agreed upon to each site (using the reliable FIFO channels that connect sites);
assume that failures never occur while the broadcast routine is executing, and
orchestrate any system tests accordingly.
Your simulation of a failstop failure has a branch server restart and run
"recovery code" 30 seconds after a failure.
For this phase, the "recovery code" should do nothing;
in the next phase, you will develop recovery protocols.
Configuration Management.
The fault-tolerance of a replicated state machine is defined by the degree of replication.
There are actually two design questions:
- How many replicas of each state machine to run.
- Which hosts should run these replicas (assuming that there are
fewer replicas than hosts).
For answers,
your implementation should read from a new file, called CONFIG.
This file should state
how many replicas to create for each branch server replica and which hosts will run
those replicas.
You may assume that CONFIG is readable by all branch servers.
Submission Procedure.
Create a directory containing the files you wish us to grade.
Call this directory xxxxx0,
where xxxxx is the Cornell network id
of the team member whose net id is alphabetically smallest of your team.
Then copy this directory to the following folder:
\\Goose\courses\cs514-fall01\proj04.submit
Don't be disturbed by warnings informing you that the file cannot be accessed
after it has been copied.
Should you wish to revise your submission after you have copied it to our folder,
then simply correct the files and re-copy the entire directory---but this
time use the name xxxxx1.
Revisions to that should be named xxxxx2,
and so on.
We will grade only the largest-numbered file of a series.
No late submissions will be accepted.
Your directory should contain the following files (at least):
-
TEAM which contains the names (and net-ids) for all team members.
Also, for each team member give a 1 or 2 paragraph description of the tasks
this team member performed and the number of hours this required.
-
README which contains
- The names and a description of the contents for the other files in the directory.
- Instructions for installing, compiling, and running your software on our
system.
- A tutorial that the grader can follow to start your software and to convince
himself that your system implements the required functionality. Expect the grader
to spend at most 10 minutes on this task.
-
LOGIC which contains
a brief description of the specific protocols that were implemented.
Include an analysis of how many and what kinds of failures your distributed banking system
can tolerate.
Do you make assumptions whose violations constitute failures of system components?
-
CONFIG which contains a description of the numbers and locations of
state machine replicas.
-
-
JAVA source file that contain the Java source needed to compile and test your system.
Grading.
Your grade will be based on the following elements:
-
Does your system correctly implement active replication with the state machine approach?
-
How easy is it to follow the README file installation
and sample-execution script?
-
Is the source code easy to understand and does it exhibit good structure?