cmmiller – Princeton CS Status

Email Service Outage Unplanned Delay

The email server upgrade scheduled for this morning has run into unexpected issues. As a result, email service is not working properly. We are working to correct the situation as quickly as possible, and will update here as new information becomes available.

Update 08:57 – We continue to work to recover the mail systems, but they will not be ready in the original scheduled window. We apologize for the inconvenience.

Update 10:02 – We are working with the vendor to recover the mail systems. We apologize for the inconvenience.

Update 15:18 – We now believe the service is back to normal operation. Most incoming emails were likely queued and have probably been delivered by now. If you have ongoing issues, please reach out to CS Staff. Thank you for your patience!

Widespread Outage of CS Services – 2016-05-03

All / By cmmiller

At this time, CS Department services are experiencing a widespread outage. We are aware of the problem, and working to bring things back. More updates will be posted as we learn more.

Cancelled: CS Network Downtime, Wed, November 27, 2013, 06:30-08:30

All / By cmmiller

Due to a failure of our LDAP servers, the work announced below has been postponed to Tuesday, December 10, 2013, from 7:30am-8:00am (10 Gb/sec link) and Thursday, December 19, 2013, from 6:30am-8:30am (switch updates).

Who is affected:
All users of CS Department wired networks, including network users in the CS Building, 221 Nassau St, Sherrerd Hall, Friend Center, and 151 Forrestal Road (the HPCRC).

This outage will also affect the OIT wireless network ONLY in the CS Building.

What is happening:
During this time, the Department\’s network uplink (to the main campus, and onward to the greater internet) will be upgraded from its current 1Gbps link to a 10Gbps link.

Also during this downtime, several network switches will be updated to newer versions of their operating systems.

Actual outages are expected to be much shorter than the full window, but outages local to any particular location will occur at various times throughout the window for 5 to 10 minutes at a time.

Why is it happening:
The campus uplink upgrade will enable greater opportunities for collaboration by increasing the rate at which data can be moved across the network. This will enable high-speed data transfer between CS Department systems and systems in other departments on campus and, in some circumstances, other remote locations.

The switch software updates will address a few minor bugs which have come up lately.

CS Network Emergency Maintenance Downtime, Thu, September 26, 2013, 07:00-08:00

All / By cmmiller

Who is affected:
All users of the CS Department wired network or network services (including e-mail, web service, etc.)

What is happening:
Our core network router will be patched and rebooted to address a bug in its operating system that is causing some minor connectivity issues. During this maintenance, all traffic between CS subnets will be interrupted. Existing TCP connections may be disconnected.

Actual network downtime is expected to be on the order of 10-15 minutes, but a full hour has been scheduled in case of unexpected trouble.

Why is it happening:
As mentioned above, a router bug is causing some minor connectivity problems. We hope to address the problem before it becomes more widespread or causes bigger issues.

CS File System Maintenance, Mon, August 12, 2013, 09:00-16:00

All / By cmmiller

This maintenance is the work rescheduled from July 25, 2013.

Who is affected:
Potentially, all users of CS Department computing and storage infrastructure.

What is happening:
On Thursday, the storage cluster that provides file service for all CS Department public services will undergo upgrades to its network interfaces.

During this maintenance, connections to Samba/CIFS file shares will experience occasional, temporary interruptions. Users may need to re-establish connections to the file share in some cases.

We DO NOT anticipate actual downtime for other services, but since all maintenance involves the risk of broader problems, this announcement serves as a heads-up. In the event of problems with this maintenance, impacts could range up to outages of ALL CS Department public services that rely on the storage cluster (file service, public login hosts, e-mail, web service, etc). Note that wired and wireless networking connectivity will not be affected.

Why is it happening:
This upgrade will provide 10Gbit network connectivity to our storage cluster, getting us one step closer to end-to-end 10Gbit capability for our services.

Update – 14:12 – The upgrade has run into some issues that are resulting in some services being slow to respond or unavailable. We apologize for the trouble, but are working with the vendor presently to get things back in full working order.

Update – 16:40 – All services should be back to normal operation now. Please contact CS Staff if you continue to have problems.

Wireless networking in all Administrative and Academic Buildings on both Main Campus and Forrestal B Site 6/15/2013 from 06:00 to 10:00

All / By cmmiller

This is a repost of an OIT message. This downtime is being performed by OIT but affects wireless networking in the CS Building.

Service(s) affected: Wireless networking in all Administrative and Academic Buildings on both Main Campus and Forrestal B Site 6/15/2013 from 06:00 to 10:00.
Date/time of outage: 06/15/2013 6:00 am
Duration of outage: 4 hours 0 minutes

There will be a wireless networking outage in all administrative and academic buldings on both Main Campus and Forrestal B Site on 6/15/2013 from 06:00 to 10:00 AM. This is needed for a system reconfiguration.

CS Penguins / Cycles System Downtime / Replacement, Tuesday, June 18, 2013, 06:00-08:00

All / By cmmiller

Who is affected:

All users of the CS Department \”cycles\” or \”penguins\” systems (soak, wash, rinse, spin, opus or tux).

What is happening:

On Tuesday morning, the cycles systems will be replaced with newer, faster hardware with more memory. The existing cycles machines will later be added to the ionic compute cluster. At the same time, opus and tux will have OS updates applied.
SPECIAL NOTE: As we are replacing the hardware, the operating systems on the cycles systems will also be reinstalled, so all crontabs will be deleted. You will need to back up your crontabs before the downtime, and put them back after.

Why is it happening:

As part of routine maintenance, and to provide enhanced capabilities, the hardware running our systems is periodically replaced, and OS updates are periodically applied.
Further, OS updates are urgent at this time for all of our Linux systems as a result of recent kernel vulnerabilities for which we have seen active exploits occurring.

IMAP / Webmail Outage – Unplanned – 2011/08/26

All / By cmmiller

Due to a hardware problem with our VMWare Infrastructure, the IMAP and Webmail service for the department is currently offline. We are actively working to remedy the situation, but do not presently have an ETA for a fix. More information will be posted as we have it.

Update (23:54): The broken host has been successfully removed from operation, and things should be working on another host now. IMAP and Webmail service should be restored in the next few minutes.

Update (00:00): IMAP and Webmail service have been restored. It may take a short while for queued e-mails to be delivered, but no messages should be lost.

Further Emergency Downtime & Update

All / By cmmiller

In response to the file server problem we have been working on since Tuesday, a vendor field engineer will be coming out tonight to help troubleshoot and diagnose the problem. He is expected to arrive around 20:00 (8:00 PM) tonight, Thursday night.

It is very likely that downtime of the server will be required while the engineer is here. This will mean that all services provided by CS Staff, including public cycle servers, clusters, email service, web service, DNS, etc. will be shutdown, as they all depend on the file server. Please make sure to regularly save any data you are working on to protect against losing data when services are shutdown.

We appreciate your patience throughout these last few days, and apologize for any inconvenience.

Update 2007/07/13 @ 03:20 After over 7 grueling hours of emergency downtime and troubleshooting, the network and systems are again up and running, and initial signs look good. While we hesitate to declare victory, we would ask that you please report any instability you notice with as much detail as possible about what you were doing and what failed.

We thank you again for your patience, especially those of you working toward deadlines. If we have indeed licked this issue, look forward to some exciting announcements in the coming weeks.

Author name: cmmiller