Outages

You are currently browsing the archive for the Outages category.

Subscribe to the Outages category feed

The StartCom SSL certificates will expire soon. For the last 10 hours web pages were served using a new set of certificates. However, Firefox reports the new certificates have been revoked. Reverted to old certificates for the time being.

Update: The problem was that Firefox does not trust newly issued StartCom SSL certificates!

sec_error_oscp_unknown_cert

All SSL access to hosted websites is currently broken using Firefox and Safari. The problem is that OCSP lookups are failing due to a misconfiguration by StartSSL, the certificate authority providing our SSL certificates. Other browsers are not affected because they have not implemented OSCP lookups. A temporary work-around is to disable the checks. In Firefox:

  1. Enter about:config into the Firefox address bar
  2. Accept the warning
  3. Search for security.ssl.enable_ocsp_stapling
  4. Double-click to change to false

StartSSL does not seem to have any blog or social media presence to indicate status on the outage, but a Twitter search yields a steady stream of corroborations. Hopefully they’ll fix it soon.

UPDATE 2015-04-05 11:45 pm: OCSP is working properly again as of a couple hours ago. Outage/technical problems on StartSSL’s part.

UPDATE 2016-11-26: OCSP responses have had bad signatures for the past two days. I believe this is once again an issue on the StartSSL side. Can be worked around the same way.

Updated kernel, added disk space. Brief outages around 12 am PST.

Updated server to Ubuntu 14.04 and bumped up RAM. Brief outage around 11 pm PDT.

Brief outages on kiwi over the weekend (website, blogs, mail) for opsys and disk space upgrades. For most clients, any SSL connections will now have perfect forward secrecy. One more brief outage coming soon to move to a cheaper server in the data center, which will require an IP address change.

Updated 2013-11-15: Moved to new server. IP addresses changed, DNS updated. Everything seems to be back up and running. Down for about an hour.

Two outages this month! Kiwi was down for about an hour Saturday evening for a bi-annual upgrade to a new Ubuntu LTS release. Everything should be back to normal now.

Web pages (including blogs) and email were intermittently unavailable between 8:15 and 9:15 pm (PST). Kiwi was moving to a new host and getting more RAM and disk space. IP address and DNS are unchanged.

kiwi was inaccessible for about an hour this evening due to a router failure. Don’t worry, it’s not the router failure the SOPA-supporting elephant killer had. RimuHosting runs the show here.

kiwi was down for about 3 minutes this afternoon to upgrade disk space. I also applied the latest WordPress security patch.

икона за подаръкJust before 10pm PST on 2012-02-16 Apache memory usage got out of control and made kiwi unresponsive for nearly an hour. Everything back to normal now but root cause remains elusive.

Update 2012-02-21: Most likely cause is that I simply allowed Apache to spawn too many threads. If correct, things are under control now.

Kiwi (hosting everything except the Gallery) was down for an hour starting 2012-01-09 13:45 (PST) due to RAID issues on the host machine. Alas, for a while now MySQL hasn’t been starting automatically on reboot. I forgot about this and therefore WordPress was out of commission for two days. Mea Culpa. I have now finally fixed the MySQL configuration so it should be able to start automatically again.

Подарък иконаикониKiwi (web presence other than Gallery; email) moved to new hardware in the RimuHosting Dallas data center. Outage was 12:51 – 1:24 am Pacific time. IP addresses changed from 72.249.17.232 to 74.50.48.111.

In unrelated news, I’ve taken down the public nerdylorrin.net wiki. It turns out not many people were excited about poking around my notes.

Kiwi ran completely out of memory and crashed hard just before noon (PDT) today. Back up around 9:15 pm (PDT).

Gallery, wiki unaffected (they run on different servers).

Update 2010-08-06:

I’ve belatedly taken some stops to reduce memory usage. Spamassassin and Apache will now fork fewer child processes, which will hopefully keep things under control.

It looks to me like Dreamhost did an upgrade that broke my PHP installation. Gallery was returning HTTP 500 errors from 9 am to 11 pm PDT. Fixed.

(belated notification)

Two weekends ago I upgraded kiwi to the latest Ubuntu release. I think the only visible effect is a major update to the WordPress administrative interface. For those of you using webmail (oh wait, that’s no one outside of this house) the upgrade paves the way major updates there as well. Blogs were down for a couple hours, everything else for a few minutes.

Kiwi ran out of disk space and memory (perhaps running out of disk space resulted in no room for the swap file to grow?) around 1 am PST this morning. Upgraded both and brought back online around 6:45pm PST.

Web sites (except Gallery and Wiki) and mail were down. Everything should be back to normal now.

Module misconfiguration on kiwi brought Apache down from 9:15 – 11:00 PM (PST). Web sites (except Gallery and Wiki) were down. Everything should be back to normal now.

An upgrade on kiwi didn’t go as smoothly as one might hope. Web sites (except Gallery and Wiki) were down starting around 9 pm PST. Blogs in particular were down for about three hours. Everything should be back to normal now.

A brief power outage exceeded the UPS’ capacity. Time to get new batteries! Services were down approximately 11:40 p.m. to midnight PDT.

Websites (other than wiki and gallery) and mail were out a few hours this evening (approx 4:00 – 7:15 p.m. PDT ). They’re hosted on kiwi, which locked up and needed to be restarted. Unlike the December outage, this time it was just kiwi and not the parent Xen host. Root cause unclear. Spam Assassin was churning through a bunch of spam at the time, but that could be a random correlation.

Websites (other than wiki and gallery) and mail were out a couple times this afternoon. The parent Xen host crashed twice and was rebooted by RimuHosting. Sigh. Hopefully it’ll stay up this time.

Update 12-25: I must have jinxed it. Host continued to have issues and was out overnight. (Sleigh riding with Santa?) Seems to be back up now after a Xen upgrade.

Another four months, another outage. Comcast was down for a couple hours, presumably due to snow. Gallery and wiki, which are hosted in the basement on carrot and tomato, were inaccessible.

Snowing Morning

There was also a gallery and wiki outage about two weeks ago due to an unexpected IP address change.

As usual, mail, blogs, and other websites were unaffected. They’re hosted on kiwi, which is located in a data center in Dallas.

I’ll be moving the servers rhubarb, carrot, and tomato back down to the basement today (basement renovation is nearly done!). Consequently the Gallery and Wiki will be down for a few hours.

Mail, blogs, and other websites will be unaffected. They’re hosted on kiwi, which is located in a data center in Dallas.

Update 7:35 p.m. PDT: Back online!

While we were out of town a power loss took down the gallery and wiki. They’re on UPSes but apparently not big enough ones!

A disk failure crashed carrot around 6:00 PDT. Everything is back up as of 16:30 PDT and the RAID array is rebuilding in the background.

I’ll be moving the servers rhubarb, carrot, and tomato today (basement renovation!). Consequently the Gallery and Wiki will be down for a few hours.

Mail, blogs, and other websites will be unaffected. They’re hosted on kiwi, which is located in a data center in Dallas.

Update 1:30 PDT: Move complete!

RimuHosting is moving our main server, kiwi, to a different cabinet on Monday, January 21st @ 4 PM PST. Services will down for approximately half an hour.

The Gallery and Wiki are unaffected. Inbound email will be accepted but not delivered until the outage is over.

Update: Maintenance was delayed by 24 hours.

The main server, kiwi, now has a little more than twice the memory it used to. This should improve website responsiveness and put an end to the sporadic outages of the email spam filter. The upgrade involved a few reboots over the weekend.

Gallery and Wiki were down for ~ 10 hours. The usual drill: Comcast changed my IP address and everything hosted in the basement became inaccessible. I didn’t notice right away because I had forgotten to update the the health checks to specifically test the Gallery and Wiki now that they’re hosted separately from everything else. Then again, Esmae keeps us so busy I probably wouldn’t have noticed the alert emails pouring in…

On July 25th rhubarb, the firewall machine in the basement, had a spectacular hard disk crash. It sounded like a cat fight and startled us out of bed! 😯

Rhubarb kept routing traffic, so I was lazy and didn’t get around to rebuilding on a new disk until today. Carrot, which hosts the Gallery and the Wiki, was inaccessible during rhubarb’s rebuild. Kiwi, which hosts everything else and lives in a data center in Texas, was unaffected.

The gap in the traffic graph below reflects the time during which rhubarb had no working hard disk and couldn’t record log files. If you had been attempting to hack in to my network, that would have been a good time to do so surreptitiously!

red-month.png

All services are now back up and running. I revived carrot (that’s the server in the basement). Everything that couldn’t be migrated to kiwi (that’s the new hosted server) is once again running on carrot. This includes the gallery and all Tomcat webapps (Wiki, Calendar).
I had liked the idea of no longer maintaining a server in the house any more, so I’m still looking into ways of migrating those.

While reviving carrot, I learned that there was no disk corruption. Carrot had long ago become unbootable and I just never noticed! In September 2006 the boot menu was incorrectly written. I don’t have logs to confirm it, but I assume this was the first time carrot had rebooted since then.

On Friday morning the server in my basement overheated and became completely unresponsive. It failed to properly reboot afterwards, perhaps due to slight corruption of the files needed for booting or a latent misconfiguration. I opted to accelerate the move to the new hosted server rather than try to bring the old server back to life. Mail and static web content was migrated by Saturday March 3rd. Blogs and some smaller features followed the next day. I’ve been ironing out the kinks and bringing remaining pieces back online throughout the week. Still down are:

  • Gallery
  • Web mail
  • Wiki

Web mail will be restored before long. The others will need more time. 🙁 If you see anything else amiss, please assume I’m unaware of the issue and let me know.

2007-Feb-08 Outage

All services down from 2:45 to 6:15 PST (plus up to fifteen minutes for DNS propagation). Comcast changed my IP address (again!). Faster recovery this time because I set up some monitoring and shortened the DNS timings. But I would have rather stayed in bed. Still looking into options for solving the problem permanently.

2007-Feb-06 Outage

All services down from 1:45 to 17:06 PST (plus up to two hours for DNS propagation). Comcast changed my IP address (again). It looks like they offer static IPs now; I’ll look into getting one.

http/https (web) services down from 7:28 to 9:07 PDT. Same as last week.

Update 2006-Jun-07: As a work-around, Tomcat is now being bounced on an automatic basis. This is keeping Apache stable at the price of a few minutes of unresponsiveness each night for the Java webapps (Calendar, Wiki, etc.). This was implemented in early May and seems to be working.

2006-Apr-28 Outage

http/https (web) services down from 12:24 to 20:53 PDT. Somehow Apache’s getting completely tied up waiting for Tomcat. Added timeouts and periodic connection recycling. This happened once earlier this year and I forgot to log it.