The world, how it works, surroundings, myself, etc.

Wednesday, August 02, 2006

Server Room Correction #1

If you have been following this blog, you might have seen the sequence of posts labelled "Server Room Tragedy..". Well, things got out of hand. So some of decided to raise the issue to the higher levels. In a brief meeting with the director, we had scheduled a meeting just after the convocation. This meeting was held yesterday (Aug 1). However, since Dr Bruhadeshwar was not available then, he asked us to list out the issues and then discuss. I sent him the mail which is shown below. Since the mail is long, I'll discuss about the meeting in the next post.


From: Nirnimesh
To: bezawada[at]iiit.ac.in
Cc: sangal[at]iiit.ac.in
Subject: System Administration related issues

Dear Sir,

I'm listing out the issues which we wanted to discuss and which we feel
that need attention. Some of these are very critical and apply at the
fundamental level of system administration. I first list out specific
examples of concern. This list can practically be endless so I list out
only a few. Then I list out some general issues. In the end I list out
solutions which in my opinion would be required to solve the problems.

Specific examples:
1, Backup disks: In a CRC meeting once, it was decided that a backup
system be setup, which would backup data incrementally. 4 hard disks (2
TB in total) were approved to be purchased for the purpose. The hard
disks never got purchased. The current backup server is improper (if at
all). Checkpoint backups on to DVDs were planned but even the first
checkpoint never materialized.

2. DNS misconfiguration: Drastic server failures take place if the DNS
is misconfigured, since it forms the basis for delivering all mails,
etc. Recently, when the ISP was changed, the DNS was misconfigured
(ideally, it should be a planned transition from one set of IPs to the
other). Not only did several mails bounce, some of our servers got
classified as spammers by some prominent DNS black-lists, due to which
our mails were tagged as spam. This was reported even by some faculty
members (Dr Madhav Krishna, Dr PJN lately) when they discovered
undelivered mails. I'm sure there are more such instances which have
gone unnoticed.

3. Wireless LAN: The wireless LAN is in an abysmal state. Not only does
it go down without any warning, it works at poor speeds, sometimes even
lesser than the internet b/w! Given that the number of wlan users in
campus is now more than the wired lan ones, this issue is all the more
important. I do feel that it has been time enough that wlan become
stable and reliable rather than bearing the experimental tag forever.

4. OBH LAN: OBH LAN has been non-functional with proper settings for
more than 3 weeks now. This has been reported to the server room several
times but in vain. Some random network setting is working currently, but
it is non-optimal.

5. SPAMS: Faculty reported bombardment of spams from infected systems.
Methods to identify and block these infected systems are required apart
from proper configuration of the mail server's spam checkers.

6. Internet: The internet has been unreliable ever since the ISP was
changed, even though the b/w was increased and the internet graph shows
good throughput. I feel that no amount of b/w can circumvent the need
for proper monitoring of where the b/w really goes. My assessment is
that there are infected systems on the network which are hogging up the
b/w due to which the browsing speed has gone down. These should be
monitored, identified and treated. Besides, IIIT is dependent on a
single ISP link now, which means that mails cannot be routed through an
alternate path if this link is down.

7. Phishing: Proper monitoring of logs is are required to make sure that
our servers don't end up getting phished. The last phishing event went
unnoticed for 2 days, before I accidentally discovered it. No such
security-related logs are monitored currently.

8. Proxy crash: I had known from the proxy server logs that its hard
disk was about to crash in due course. I had informed this in the server
room and had prepared a backup system to be used. Unfortunately, this
backup system was recklessly formatted. Days later, proxy really
crashed, and it took quite some time before we set things up from scratch.

9. CDROMs in teaching labs: Whereas the systems in teaching lab #333
don't have cdrom drives even though they have hard disks, the disk-less
thin clients in the new teaching labs have cdroms each even though they
will never ever be used. For the 100 thin-client systems this adds up to
a straight misuse of Rs 1 lakh.

10. Gateway server: A gateway firewall was put in place to prevent
research server labs from getting hacked (LTRC server was hacked once
because of this reason). Besides, it also improves the internet
bandwidth. This server is missing now thereby posing security threats.

I will conclude by saying that the server room currently lacks a system.
Things happen ad hoc. I can best describe the current system as:
clueless, irresponsible and acting in damage-control mode.


General issues:

Mails: If configured properly, mail protocol is so versatile that
there's absolutely no possibility of mails getting lost or even getting
delayed. But proper configuration is required nevertheless, and slight
DNS misconfiguration can wreck havoc for all mails. Students have
started shifting to gmail. Not only is this shameful for an institute
such as ours, it causes a waste of costly internet bandwidth. (Blocking
gmail is not the solution, though. Instead, our mail servers should
function reliably enough)

ISP/Internet: IIIT should have at least two ISPs. The second one can be
low-cost, low b/w one. This is critical for 24x7 mail or internet
access. Besides, the internet traffic from the ISPs needs to be
monitored, since they often end up cheating. At the 4 Mbps b/w that IIIT
has, it is possible to have an excellent browsing and download speed, if
monitored properly to weed-out b/w-hogging infected systems.

Responsibility: Several institutes (even in India) have well-managed
system administration, along with websites as help-pages. We lack such a
thing altogether. Students and faculty often don't know how to configure
things and they don't know where to get help from. Having a web-page not
only eases things out, it ensures that the solutions can be reused for
recurring problems. This is not the case with the current server room
functioning.


Solutions:
I see the following 3 steps as necessary to ensure a reliable,
responsible and smart system administration.

1. System

Set up a planned system along which the server room works. The staff
need to be allocated jobs and be held accountable. I see three types of
jobs to be dealt with:
a) Routine: this involves all common monitoring and related activities
like managing accounts, managing network, checking logs, backups,
grievances, etc. This works in damage-control mode currently.
b) Development: this involves developing and automating ways to make the
working of the server room more efficient. This is missing currently.
c) Security: this involves maintaining the security of the network, the
servers and privacy-related issues. This requires expert-level
understanding. This is missing currently.

2. Have Student Sysadmins

There has to be at least one expert-level administrator who understands
the system throughout and can take quick actions. IIIT students are (and
will always be) better than any of the sysadmin staff or perhaps even
more costly sysadmins. This is why the model of having student-sysadmins
was ideal for IIIT. I feel that getting rid of student sysadmins is
suicidal because it leaves the server room with no one who understands
the system. All servers till date have been configured by student
sysadmins only. The notion of the baton being passed on to the successor
student-sysadmins has been carried out effectively in the past.

3. Strict accounting mechanism overseen by a faculty (that is CRC)

A faculty's involvement is the most critical to ensuring that the server
room functions properly, instead of in a clueless damage-control mode.
This would be too much pressure on the faculty, which is where the
student sysadmins come helpful. But accounting from a faculty is
necessary to ensure that the decisions are implemented by the server
room staff.


I have in the past tried several measures with the server room staff to
streamline the activities, but in vain. My current activities are like a
last-resort effort. If corrective actions are not taken immediately, I
can foresee severe inconsistencies and problems arising in immediate future.

I hope I could be helpful.

Thank You
Post a Comment