Homework assignment 1: Due in 3 weeks, on Sept 22. Hand in
solutions via CMS. START SOON!
A common problem in data center settings is to implement a service
that has some form of leader. The hardest part is bootstrapping: the first
server to be launched should become the initial leader, while subsequent servers
should be backups, waiting to ascend to the throne if the leader fails.
But once a system is running, the issue of monitoring leader status and managing
the fail-over is also potentially tricky (see class notes on "split brain"
problems).
Implement a solution to this problem in Windows .net using C# or C++ in
Visual Studio. Use the Windows socket interface and UDP message passing
for all communication. Design your solution to have two "sides": a library
that could be reused by others, and an application that has a user interface
showing the system state, and that uses the library. Think
hard about the best way to handle the initial rendezvous, and make sure to
address races in which two servers are launched simultaneously.
Note: Actually, we won't be upset if you work on a Linux platform or use
Java. But we won't be able to provide nearly as much help if people
deviate from our main recommendation.
Some rules of the game:
- Your solution will take the form of a library of routines that can be
used by various applications. (In Windows, a library is often referred
to as a "DLL").
- An application using the DLL will need to tell it its "name", like "lock service" or "current
inventory". A file-pathname style of name would be best (i.e. "/amazon/us-nw/inventory").
Each distinct application would be handled by the DLL separately, having its own leader. The
DLL per-se should support large numbers of applications, and it should be
possible to run many applications on a single machine for purposes of
testing.
- You will need to convince yourself that the solution comes as close as
possible to ensuring that there is always a single leader, unless the
service shuts down entirely. If your implementation might sometimes
have no leader, or multiple leaders, convince yourself that the condition
can't last long. Ideally, come up with a time limit, like "2 seconds",
such that if a problem of this nature arises, it will resolve within that
time limit.
- Your API matters. Design your DLL as if it will become a
widely used standard on which the entire fate of Google, Amazon, or the
world financial system might depend. Think hard about elegance,
simplicity, and clarity.
- Decide how the DLL will notify the application when status changes,
e.g. when a new member joins, or when the leader role changes.
- Protect your solution against multithreaded access. Threads are
common in modern systems and your solution should be thread-safe. In
your own case, threads arise (at least in Visual Studio C#) because
otherwise once the user presses the "run" button, the application will be
unresponsive to console input if you don't launch a separate thread to run
the code. A warning though: launching a thread can have annoying
consequences by making the application harder to debug and introducing some
annoyances, such as difficulty accessing Windows Forms controls like text
boxes (they can only be updated from the thread that created them, or using
a special "thread.invoke" operation with a call-back function as an
argument: not hard, but irritating to get right).
.... plus one very important rule about doing your own work:
- This assignment is to be done individually. You may
discuss your approach with others in the class, but
all aspects of the implementation must be
entirely your own work, except for code
cut-and-pasted from Visual Studio itself. You must not show your code to
others in the class, and if you help someone debug their solution, limit
yourself to suggestions, not touch the keyboard!
We expect you to implement and test your solution, evaluate its performance
and scalability, and document the whole story.
A few hints: You will probably be confused about what is and isn't permitted
for the very first messages your application sends, because at that initial
step, the program has no idea whether there are other instances of the same
application running or not. Here are options to consider. You can
pick one, or rule them all out and do something of your own. The more
flexible and general the better.
- UDP broadcast. In effect, when starting up the program can
broadcast "hello out there!" and if anyone receives the message, the
receiver can reply "hi, welcome to the group". If you do use this
approach, think about the issue of UDP class-D addresses and port numbers:
how will you pick them? As a quick comment, Professor Birman
really likes this approach because no special extra services are needed and
you only need to implement one application. So if all else is equal,
this is what he recommends. Use a small
value for the TTL field when you send your UDP multicast, like TTL=1, to
ensure that messages can't "flood" the CSUG lab!!! Also, pick a very
random kind of port number and even so, think about how to deal with
incoming UDP multicast message sent by some other student (who accidentally
picked the same port number). If junk of that sort shows up, your
program needs to filter them out. For example, you can send messages
with your netid in a header and reject incoming messages that don't start
with your netid -- anything like this would protect against surprises.
Also do keep in mind that UDP multicast is unreliable and packets can be
dropped. This is very rare, but it argues for sending two or three
times before assuming that your program is the first service instant to be
started up.
- Rendezvous through a web service. You could build a little service
to help out just at the startup. Applications would "touch bases"
through it, and it would hand out suggestions along the lines of "If <IP,port>
is still running, he's in the group." Think about races where
processes run for very short periods of time (like fractions of a second).
Could someone have trouble joining because the service is always very out of
date? Comment: Professor Birman thinks this is not as good a
solution because it forces you to build two applications. But you
would get experience with web services this way, so he's ok with this if you
understand it better. You would need to hard-wire the location of the
service into your main application.
- Rendezvous using DNS in much the same role as case 2. Hard
subproblem: gaining program control over DNS entries in Windows isn't as
easy as you might wish. Second (easy) problem: application group names
would need to look like and behave like machine names, so that DNS can
understand them. This is ok -- it reuses an existing technology
that definitely has to be working for the Internet to be available at all.
But talking to the DNS at this level could be a whole project in itself.
- User types in rendezvous information through a console interface.
Think about the challenge of convincing the TA that this is actually a
useful option. What about the issue raised under "idea 2"?
This is kind of cheating but could be a good way to get started, so that
you can FIRST do the leader selection/monitoring technology and only then
add in the startup code...
- Fixed command-line arguments or configuration files. See comments
on item 4.
More hints that aren't related to startup:
- If you use the UDP multicast approach (point 1 in the first set of
hints), your entire solution can take the form of a single DLL plus a single
application that demos the technology. The application would be a
"Windows forms application" with a nice user interface. Some ideas for
the interface
- It can have a few "textBox" controls into which one enters the
application name, the port number it will use, timeout parameters, etc.
- A "button" can tell it to "Join". This way you can launch lots
of copies and then type in the needed parameter values on each, and then
tell them to join at the frequency and in the order that you find
convenient. So, you would start a few instances up, perhaps on a
single machine, and then click "Join" on a few of them..
- You can either have an "exit" button, or just click the "x" in the
top-right corner to kill the application off
- It can show who the leader is using a nice color-coded message
displayed into a textbox (maybe "Red" means "not me" and "Green" means
"I'm the leader"). You just need to set the background color for
the textbox control. Change the color when the status changes.
- You'll need to check periodically for input. Although you can
do this with threads, it may be easier to just set up a system timer
event and have it fire perhaps ten times per second. Then you can
do any time-based actions when the timer event handler is called, such
as checking for incoming messages, sending messages, etc.
- Use the "WinSock" message interface to create a socket, bind an IPMC
address to it (if you use option 1), and to send/receive/poll for
messages.
- If you DO use threads, be aware that only the main thread can access the
controls in your form application. You need to use something called
the "invoke" method to access a control from a thread other than the one
that created it. This is a pain, although not to the point of being a
crippling show-stopper. Use the Visual Studio help to get to the code
you need to cut and paste if you wander down this path...
- Again, use a small TTL value when sending a UDP multicast. The
value 1 will be fine. Do not ignore the advice or your application
could behave in a way that would disrupt the Cornell network!
Hand in via CMS (the CS Course Management System): a single document, ideally
in PDF format (you can create multiple PDF files and then combine them if you
like), with your name and netid on the front page. It should include:
- A short writeup explaining how the solution works. No more than 1
1/2 pages in length, not counting illustrations. If you include them,
they should be drawn to the same quality standards you would expect from
documentation of a library you might find yourself using.
- Documentation of the library API, mimicking the style of documentation
used by Visual Studio when you use the help interface
- A summary of the performance evaluation, giving the latency associated
with launching a new member of an application group graphed as a function of
the number of applications running on the platform, and the latency of
fail-over when the leader crashes. Note: If evaluating
scalability as a function of the number of application groups turns out to
be too hard, you can just measure performance for a single group and report
the amount of time for launching the first member, for launching additional
members, and for handling failures. So: one group is enough, but
evaluations for large numbers of groups would be way cool and would impress
us. A real library would need to work for large numbers of groups, but
you'll do fine in CS5140 even if you can't evaluate that case...
- A short discussion of the behavior of your solution if the leader has a
transient fault but then resumes "normal" operations
- A printout of the library code
Our course TA, Jonathan, may request a demo of your solution or pose other
questions about it.
Can't finish on time? Make sure to ask Jonathan for help!