<$BlogRSDUrl$>

ic_temp

Interconnected's temporary home

Saturday, June 26, 2004

Unbelievable. It's gone down again. At about 05.30 this morning (2004-08-26). What's going on? What changes? What can cause a freeze? Insane. I've no idea what to do. It must be a hardware problem, mustn't it?

posted by genmon  # 6/26/2004 10:35:00 AM

 

Thursday, June 24, 2004

Of course,

I bet it bloody breaks again even before I get back to the office.

posted by genmon  # 6/24/2004 11:01:00 AM

 

More:

The rsync thing worked. Well, it didn't work but it failed due to a problem with rsync (the file list was too large, or something. Code 23).

Taking tcp-env and tcpd out of inetd produced weird errors in the logs, so I looked at what I had before (on the old drive) and put tcp-env back in. Why there were errors in the logs to do with tcpd when that was taken out, I don't know.

I rebooted.

Email notifications from the wiki still work, despite the fact I needed to put tcpd in inetd to make that work. hosts.allow still contains allowed IP addresses, but I've no idea how that's being accessed because I can't see where tcpd is being called. Maybe that happens automatically?

Current conclusion: Some weird collision of qmail, tcpd, the new stuff in hosts.allow (and perhaps the hostname in there), and possibly some changes in firewall rules upstream. Have they just started allowing IP6 for example, could that be the problem?

Dunno. It seems to be working now, but I won't be able to tell unless I sit here for a week, checking.

posted by genmon  # 6/24/2004 10:56:00 AM

 

Okay, this is quite weird. I take the slave hard drive out and the keyboard works. Of course, I can't do anything to the system (like edit /etc/fstab) because it's got scared and mounted the master drive in a read-only state. So I rebooted with both drives in again to edit fstab... and the keyboard doesn't work again.

My power supply says it's rated for 145W. I can't see power ratings on the 2 drives. There's a heat sink fan, a fan on the case, a network card and a video card. I wonder whether it's drawing too much power and that's why the keyboard doesn't work? (When the keyboard doesn't work, the lights on it don't come on either.) Oh, and 2 more fans behind the PSU, but they're only 2.5W each.

So I reboot, and this time the keyboard works fine. I've taken all references to tcpd/tcp-env out of inetd.conf and rebooted. Again again! Again again.

I've a feeling the keyboard thing is a red herring. As is whether the power supply is buggered or not -- I'm sure I got it rated for 2 drives when I first bought it.

Could it be tcpd? That's what's generating the last line of the log that last time it crashed, and possibly the time before too. [checks the logs] Yup, two times on the 16th too.

Given nothing else really gets logged out, that's not a surprise -- but it does seem very close in timing. But then these getaddr problems are happening every 8 minutes or so. No idea.

I'll try running the rsync thing again, see if that kills it.

posted by genmon  # 6/24/2004 10:56:00 AM

 

It's happened again! God damn. Turns out tcp-env is still doing getaddrinfo(), and it happened in the middle of a big rsync. A drive thing then? Let's go look at the box...

posted by genmon  # 6/24/2004 09:33:00 AM

 

Curious afterthought.

The server's died in the middle of two jobs I've left running -- or possibly after the jobs because I think they were running in screen. One was compiling X, another was rsyncing all the data from one drive to the other (about 6 Gb).

What's odd that both jobs appear to be complete, now I look at the end products. The reason that's odd is if the server fell down at the same time, there's a chance they would be incomplete. Since they're complete, that means the server could be running but refusing any kind of network connection.

Which doesn't explain why it works okay when reset. And means I'm now even more annoyed about having a dodgy keyboard this morning.

posted by genmon  # 6/24/2004 09:26:00 AM

 

[at the data centre]

Okay, that wasn't what I expected. The machine had error messages on the screen, last one

Jun 23 19:44:12 www tcp-env[19546]: warning: can't verify hostname: getaddrinfo(
dsl-209-183-5-172.tor.primus.ca, AF_INET) failed

(which is also the last in the log), which I presume is because hosts.allow has to look up the hostname every time is does filtering so that localhost can get through. I've taken localhost out now, and it just relies on IP address.

What I *had* expected is for the machine to be turned off, but it turns out everyone said it was off because I couldn't figure out how to hook up the power light on the front when I built the thing, and because the hard drive wasn't active that light wasn't flashing either. And the ambient noise + aircon means it's too loud to hear the fan.

So no thermal cutout.

What makes a machine hang at 19.44 BST, or in the few minutes after that? I did think the machine had completely hung because I couldn't type anything onto the screen, but then I tried the keyboard on the login prompt and it didn't work them either (plus I initially had the keyboard plugged into the wrong machine, I think. Jeez, what sort of sysadmin am I. I'll tell you: One who's only had 1 cup of tea so far today and got up more than an hour early).

Upshot: I'm not completely sure whether the machine hung or not. I know it wasn't accepting network connections, and I know when this happens it doesn't even answer to ping.

I can't rule out that it hung.

Possibilities, all of which are intermittent:

- network card failure
- hard drive problems
- software problem related to getaddrinfo()

If it's the last, then there won't be any more problems and we're home free.

If it's one of the first two, or related to there being a heat problem (which I think is less probable now), then there are going to be repeated bad and weird problems.

posted by genmon  # 6/24/2004 09:25:00 AM

 

Wednesday, June 23, 2004

To catalogue: Today (2004-06-23) at 19.50; today (2004-06-23) at 01.05 (coming back at 12.26).

Also, various coming-back times of:

- 15 Jun 2004 17:09:00 (down I think from about 15.45, and continuing to be down because of a powercut (perhaps?))
- 16 Jun 2004 11:56:31 (down from the night before I think. Because of a powercut, that time, while I was rsyncing.)
- 16 Jun 2004 15:14:58 (down from about 12.40ish)

On the upside this means BIND is running fine, as I thought it might be a software fault (how a software fault would turn the machine off, I don't know). And it's not a weekly thing, I don't think.

Here's another coming-back time, from before:

- 18 May 2004 09:21:00
- 18 May 2004 09:42:32

Those times are all BST, and there before I've got an in-progress file being edited, and nvi tells me about it every time the box boots. Messy, but I love it (and no idea how it happened in the first place).

The two close together in May are when it went down early-ish in the morning and I went to poke around with the box. That's me rebooting it a couple of time.

It went down the first time when I was doing work on the server (can't remember what), the second time when I was in the middle of compiling X, and then again in the middle of running a back-up. There's something going on, I don't know what.

Thermal cut-out say some people. I might try lifting out the back-up drive and seeing what happens. The 15th was the hottest day of the year, but not that that matters (the hosting centre is air-conditioned).

It could be the fan is bust and the thing is overheating. It could be the power supply is dodgy. It could be it doesn't like the two hard drives in there and it is indeed a thermal thing. Jeez. I'll have to go there tomorrow morning and have a look, but I'm not going to be able to tell just by looking.

posted by genmon  # 6/23/2004 08:22:00 PM

 

Back here? Back at ic_temp?

Why are you posting at ic_temp?, you ask.

Why fucking indeed. Why fucking indeed.

ps. My Powerbook hasn't been returned from Apple yet. Three weeks waiting for a keyboard, two weeks to pick it up. A keyboard? It needed the hard drive replaced.

posted by genmon  # 6/23/2004 08:15:00 PM

 

Archives

12/01/2003 - 01/01/2004   01/01/2004 - 02/01/2004   02/01/2004 - 03/01/2004   03/01/2004 - 04/01/2004   04/01/2004 - 05/01/2004   05/01/2004 - 06/01/2004   06/01/2004 - 07/01/2004   07/01/2004 - 08/01/2004  

This page is powered by Blogger. Isn't yours?