Updated November 21, 2013

Background

There are almost 1200 computers in instructional labs throughout the College of Engineering. Just over 70% of these run Windows and the rest run Linux. The name “EWS" originally referred solely to the systems in the College-wide labs which make up not quite a third of the 1200. Today, “EWS” often gets used to refer to any of the labs, college-wide or departmental, as they are all now run by the same staff and use the same infrastructure. To students, they largely all look the same except for physical location.

Summary of Short Term Solution Changes

Setup Prior to Short Term Solution Deployment

Common home directories across both Windows and Linux workstations.

One very large file server running OpenIndiana, a community-source version of Solaris.

ZFS filesystem used to manage quotas and provide snapshots (allowing for quick file restores).

Server provides file shares via both NFS (for Linux clients) and CIFS (for Windows clients).

Server also provides shares for courses and project groups.

16TB of backend storage provided by four high-end Dell EqualLogic PS6010XV storage arrays.

All issues have been with the server and OpenIndiana, not with the EqualLogic storage.

Initial Intended Short Term Solution

Leave Linux home directories on existing server.

Move Windows home directories to four new servers (distributed roughly evenly by last name).

Move course and project group space used by Windows clients to a fifth new server.

Users will be able to transfer files between their Windows and Linux home directories

Implemented Short Term Solution

Windows home directories were migrated to four new servers

Linux home directories were migrated from the exitsting servers to two new servers

Course and project group space was moved to a new server.

The short term solution that was implemented added additional servers and resulted in linux home directories being migrated to new serevers. This was necessary due to continued performance problems discovered on the existing linux server even after the Windows home directories were migrated.

The new servers are Dell PowerEdge R720xd’s, each with dual 8-core Intel Xeon processors, 384GB of RAM and 14.4TB of raw storage (6TB usable). They were spec’d and purchased to be part of a very large virtual machine farm. All will be running Windows Server.

Individually, each new server is more powerful than the existing file server. More importantly for us, we have a much higher degree of confidence with Windows Server than we do with OpenIndiana at this point, especially for just serving Windows file shares and with the load being spread across four systems.

This does leave the Linux clients using the existing problematic server. We’re hopeful that removing the load of serving the Windows clients will be sufficient to stabilize it but this remains a point of concern and something we will be closely monitoring after the changes are complete. There are two factors that, when combined, give us the hope that the changes we are implementing will let the existing server adequately handle just the Linux clients. First, over 2/3 of the load on the server is typically from the Windows clients. Second, some of the issues we see on the existing server appear to be triggered by a cumulative load effect rather than just a heavy point-in-time load.

Implementation Plan

This plan for implementing the short-term solution has been approved.

Migration Scheme

All labs will close at 9pm on Friday, October 11. This includes all remote access services.

Windows file sharing will be turned off on the existing file server.

Logins from workstations in all departmental and college-wide instructional labs will be disabled.

All Windows clients will be changed to mount home directories from the new file servers.

Students who have exclusively used Windows this semester, all files except a set that is Linux specific (e.g. .bashrc, .mozilla cache directories) will be transferred to the new Windows home directory servers.

Students who have exclusively used Linux this semester will remain on the Linux file server.

Students who have used both platforms this semester would be polled in advance as to whether they want their files migrated or left on the Linux home directory server. Students who do not respond and who have used Windows at least twice as much as Linux will have their home directory files migrated to the new servers.

All students who do not have their files migrated to the new file servers will have a default profile setup as their new Windows home directory.

All labs, including remote access services, will be available again no later than 10am on Sunday, October 13.

Background on the Migration Scheme

Work was also done on alternative migration schemes that would not require any down time for the labs. One scheme would migrate user s in batches over a series of evenings until everyone had been migrated to the new setup. A second scheme would switch all users over to empty Windows home directories and then provide for migrating files from user home directories while the labs were up and running. Information regarding why neither of these options was chosen is included in the next section.

We have always believed that the best scheme would be to migrate everyone’s files in one evening while the labs are normally closed. The number of files and volume of data to be moved has been the barrier to being able to do this. The first estimate we made was that it would take two weeks to complete the migration. We now believe we can do this much faster thanks to two things.

A technical scheme has been developed to lower the amount of time to transfer the files from the old to new home directories and have the files be immediately usable (which would not be the case with one of the alternative migration schemes). This is being done by very careful and actively controlled load balancing of data transfers to all four new servers simultaneously. It cannot be done while the labs are open as it maxes out what capacity we do have on the old server.

In addition to the technical improvement, an analysis of lab usage patterns revealed two facts of which we can take advantage. First, thousands of students have yet to login to their accounts this semester. That is likely due to reasons such as not being enrolled in a course that uses any computer labs this semester or being in a course that only use the labs later in the semester. Regardless of the reason, we have far greater flexibility in dealing with these accounts since they aren’t currently active.

The usage analysis also showed that 90% of the users this semester have exclusively used either Linux or Windows. The remaining 10% have used both platforms at some point during the semester, with half of that number using one of the two more than three times as often as the other.

Waiting to deal with accounts that haven’t been used this semester and the identification of thousands of accounts that have exclusively used Linux and should be fine with an empty Windows home directory allows to us to dramatically drop the number of home directories that we have to migrate. In addition, while there are twice as many Windows-only users compared to Linux-only users, the average home directory size of the Linux-only users is dramatically larger. This further reduces the amount of data we need to transfer.

Alternative Migration Schemes Not Recommended

Staged Migrations

We could avoid an outage of the labs by migrating home directories in stages over multiple nights. We did explore moving users only between roughly 1am and 7am (an hour after close and an hour before open on weekdays). The original estimate was two weeks to complete all migrations in this manner. Even though we have reduced the amount of time required overall to transfer files, there is still some amount of startup and cleanup work required during any given migration window. This startup/cleanup overhead means that only a portion of each migration window can be used to actually move files.

Even with our improved transfer scheme and improved scoping of what we must migrate while the labs are offline, it would still take at least 3, and likely 4, evenings to complete the migrations. There are multiple reasons we prefer not to do this:

This increases the technical complexity of the home directory structure until the migrations are complete as both Windows home directories services are in operation. This increases the chances of unexpected issues and would make troubleshooting more difficult.

The Windows workstations should pick up the new home directory paths automatically but, based on past experience, some set of them won’t. The only way to guarantee they do is to reboot them. That would require us to potentially reboot over 850 workstations every morning.

The user experience would be inconsistent during the transition between those who had been migrated and those who had yet to be moved.

The remote access servers would have to be offline every evening so staging over multiple nights doesn’t truly avoid all service disruptions.

It extends the period of time before we remove all Windows file sharing load from the server that will still be serving the Linux home directories.

The risk of human error is greatly increased. The migration process is going to require manual oversight even though much of the work is automated. The staff who need to do this work are already working extended hours to ensure that issues that do occur with the current file server are corrected as soon as possible. The more overnight work shifts that are required to do the migrations, the greater the chance of mistakes occurring.

Delayed Migrations

In this scheme there would be a cut-over date when we would switch everyone over to having new Windows home directories after close on a Saturday evening. The remote access servers would only be offline for a single evening. All labs would open as scheduled on Sunday with a catch: most of the new home directories would be empty. The ones that weren’t empty would have a single ZIP file that would have all of the person’s files from their old home directory but they would still have to extract them on their own. Note that this is an issue only for those using the Windows systems.

The ZIP file comes in to help with the speed of transferring files from the old to new Windows home directory servers. It helps deal with the fact that the TBs of instructional lab home directory data is mostly made up of very small files. We couldn’t do this migration work for all students in a single evening so we were working on multiple mechanisms to determine who to move immediately after implementing the change. This included students in courses known to have deadlines in the near future and students who requested to move immediately via a webform.

This scheme keeps the labs open but has a number of issues that we believe would all end up being disruptive, many even if we were able to create the ZIP files for everyone in a single evening. The list of potential / probable issues includes:

Students accidentally delete or otherwise munge the ZIP file (e.g. by trying to work with it while it is still being created).

Students need assistance with the ZIP file and can’t work until they get it. We would have documentation and staff on-hand for some time after the transition but there would still be delays in students being able to work. This would be especially problematic in cases where files were needed for scheduled lab sessions.

Students need to request creation of ZIP file. Anyone not migrated the first evening would have to request creation of the ZIP file. We would provide several mechanisms to do this but it would still need to be done and create another delay before students who needed their old files could work. We could potentially continue to automatically create the ZIP files for students on subsequent evenings. This had not yet been fully worked out when we decided to recommend a different migration scheme.

Non-deterministic load on the current file server during operational hours. Many students are going to need their old files and given that we couldn’t migrate everyone during off hours, they would be triggering this to happen during the day and evening when the labs are open. This is believed to have a high likelihood of causing additional outages of the current server. This would negatively impact the Linux users and also everyone whose files were in the process of being migrated.



There are some other issues with this scheme that would require extra work by the IT staff (quota adjustments to deal with extra-large ZIP files) but these would not directly affect students or courses.

Windows Home Directory Migration Completed (as of 10:00am on October 13)

All Engineering computer labs re-opened at 10am on Sunday, October 13 as scheduled. The remote Linux access servers were accessible since 11:00pm on Saturday, October 12.

All college-wide and departmental, including remote access servers, were taken offline starting at 9:30pm Friday, October 11, 2013 in order to implement changes to address the ongoing performance problems with the labs. Additional information is available about:

Continued Linux Home Directory Problems Following Windows Migration

Following the successful migration of Windows home directories, Linux home directories stored on the old server continued to experience problems. In order to improve this issue, Linux file access was taken offline at midnight on October 25 and Linux home directories were migrated to two new Linux file servers.

Linux Home Directory Migration Completed (as of 8:00am on October 26)

The migration of Linux home directories to new file servers successfully completed by 8am on Saturday, October 26 as scheduled. The remote access servers were back in service by shortly after 8am and all Linux labs will be open for their normally scheduled hours. Details about the migration and issues of which to be aware:

Linux home directories were only migrated for users with at least 5 minutes of recorded Linux client usage as of October 13. This was necessary in order to complete the migration of active users in time to open the labs as scheduled this morning. Anyone who was not migrated and requires assistance, please email ews [at] illinois.edu . See below for how to access the old home directory contents on your own.

. See below for how to access the old home directory contents on your own. Empty directories were created for users whose directories were not migrated. Anyone missing a home directory, or having any issues with their new home directory. should email ews [at] illinois.edu or talk to one of the student workers on-duty in the labs.

or talk to one of the student workers on-duty in the labs. Read-only access to the home directories on the old file server is available through the path /ewshomeold/?/NetID where ? is the first letter of the NetID.

The ews-restore command does not currently work due to its dependencies on the old file server. It will be updated on Monday to work against the old home directories, primarily so that files moved from Linux to Windows during the October 12 home directory migration can still be recovered. Originally the ability to do this was only going to be available through the end of October. We are making the changes necessary to allow this functionality through the end of this semester.

There may be some brief, unannounced outages for the next few days while we adjust some configuration parameters on the new file servers. We will try and only make changes during off-hours but will do so during the day if necessary to ensure good performance of the new servers.

Snapshots are not currently enabled on the new file servers and it will take the CITES TSM service awhile to do an initial backup of the new servers. Please be cautious as no file recovery will likely be possible for at least a week. One of the features of the old file system (ZFS) that actually did work well was snapshots. The new servers are using XFS and we need to test the snapshots feature of it under Linux before enabling it.

We are going to leave in place through the end of the weekend the workstations converted to standalone mode as a partial workaround. They will be converted back to regular mode on Monday.

We are working to verify that all Linux workstations have picked up the configuration changes necessary to mount the new home directories. It is possible that a few did not pick up the changes and will need to be either rebooted or have manual reconfiguration work done to them.

All non-user home directory shares (e.g. /home/class) are still using the old file server. We are continuing to work on migrating this data off to new file servers with priority being given to /home/class. We should be able to switch /home/class over yet this weekend. It will be end of day Monday before all could be migrated.

Partial Workaround (from October 23 status update)

NOTE: This setup will remain in place through at least the weekend of October 26-27. Assuming the new servers perform as expected, the converted workstations listed below will be reverted back to a standard configuration on Monday, October 28.

We have setup a number of Linux workstations in Everitt, Grainger, and Siebel that use a local-to-the-system home directory instead of the troubled network storage. This allows working on the local system while avoiding most of the performance issues caused by the problematic server. A link is provided to the user's networked home directory so that those files can still be accessed.

Once files are copied to the local system (from the networked home directory, via SVN, from a thumb drive, or cloud storage ) it should be possible to do most work without being impacted by the server issues. It is critical that any work done be copied back to another location and not left on the local workstation. Files left on a workstation will not be accessible from any other workstation. In addition, files left may be cleaned out in order to keep the local disk from filling up.

The locations of the modified workstations:

252 Everitt: 4 stations immediately upon entering the lab

57 Grainger: 10 stations along the wall starting from the student consultant station

222 Siebel: all 36 stations

The modified workstations have a different background, with a picture of an island, and a warning notice that "This system uses a local home directory!!". Launching a terminal window will also provide instructions on how to access the networked directory.

This is NOT a permanent solution. It is being provided in an effort to allow students to work while a real solution is implemented.

Q&A About the Outage

Will the course websites also be affected?

No. All course websites should be accessible throughout the outage of the lab computers.

Is there any way to have some labs still open?

We may be able to accommodate some use cases even during the full outage. For example, there is a conference at Siebel Center that is scheduled to use a lab during the outage for a programming contest. However, they typically use the systems in a specialized standalone mode and we can make it full standalone.

Will Subversion / SVN still be available during the planned outages?

Yes. Subversion uses different storage and will stay up during the EWS lab outage.

If we use both Windows and Linux lab machines on a regular basis, would it be possible for us to have our files remain on the Linux servers and also be copied to the Windows servers so that we do not have to start with an empty Windows home?

You should have received an email the afternoon of Wednesday, October 9, explaining which option was selected for your home directory storage plus a reference to an online form at http://go.illinois.edu/EWSmigration to allow you to override that selection. At this time, the only option is a move, not a copy, of your home directory data. This is a deliberate design choice of the migration in order to reduce the strain on the Linux file server. A secondary consideration is that different sets of files are likely used on each platform (e.g. cs225 homework on Linux; ece391 in Windows) so this also reduces unneeded duplication.

For those who use both Windows and Linux machines regularly, we suggest you migrate your files to Windows. Post-migration, on the Linux side, you can use the ews-restore command to easily copy back the files you need in that environment.

Guides on moving data between the Windows and Linux home directories are being written and will soon be posted to the EWS and performance websites.

Where should I go if the lab is shutdown? Homework is due the following week and programs that are essential (Matlab and Creo) I only have access in EWS labs!

A number of courses will be giving extensions to assignments, especially if the due dates are very near this weekend. Some software packages, such as MATLAB, are available on the CITES ICS labs. MATLAB is also available for download at the CITES WebStore and is free for use by students on computers connected to the University network (either directly or via the CITES VPN service). Other software packages may be available via WebStore as well. If you have questions about a particular package, please email ews [at] illinois.edu. We are aware, however, that there are many licensed packages that simply won't be available for use while the labs are down. The only additional promise we can make is that we are continuing to work to reduce the amount of time need during the outage and that we will open the labs early if we the migrations complete early.

Can you release an image of the EWS machines so that students can install on their personal machines via a VM or on a partition so that we have something to work with?

Unfortunately this cannot be done at this time. The images we use contain a tremendous amount of licensed software that would not be legal to run on a personal system. As part of the more comprehensive long term changes to the lab environment, we may look at creating and actively maintaining some "slim" VMs that students could use on their own systems.

Would I have to back up my data in my student account before the system is revamped?

You do not need to do this. However, you may wish to copy your data to your own storage if there is work you can do with it using other computers (such as your own) during the outage.

Q&A About the Changes (Short Term and Long Term)

How will the proposed segregation of Windows and Linux directories affect classes distributing files to students?

It depends on the class and how they are distributing files. Courses providing materials via the web or by Subversion will be unaffected by the segregation. As with the home directories, there will now be separate course materials shares for the Linux and Windows platforms. We can assist instructors with getting their materials to the right locations.

Why not move the Linux home directories to new servers as well?

We have significant experience with Windows Server, including with all of the functionality we need for the Windows home directories. We do not believe this is a good platform to support NFS shares for the Linux clients, however.

There are multiple operating system and hardware options for a new Linux home directory server (e.g. RedHat, Oracle Solaris, an appliance such as NetApp), but some of these solutions lack functionality that we require for the lab environment. We lack sufficient experience with the options that might work in our lab environment to trust them for mission critical production services without thoroughly testing them first. The time required to prepare these solutions put them out of range for the immediate short-term solution.

That said, we are going to be working on determining alternatives as soon as the initial changes are put in place, not only because we have to for the long term solution but because we cannot be certain that the changes we are making will fully resolve the performance problems for the Linux workstations.

One additional factor is that we would need a much longer outage to migrate the Linux home directories in the same manner as we are proposing for the Windows home directories. Despite there being fewer students who primarily use Linux, the home directory space usage of those students is considerably higher. We would need at least another full day of outage to move them all off as well if we used the same migration scheme.

What will the quotas be on each home directory?

Users will have a 3GB quota on their Linux home directory and a separate 3GB quota on their new Windows home directory. These will remain the quotas for at least as long as the short-term solution is in service. Prior to the implementation of the short-term solution, users have had a 3GB quota for a single home directory shared between Windows and Linux systems. Users will now have 6GB total between their two home directories. It will not be possible to trade-off quota between the two.

Will there be any changes to the amount of storage provided is displayed? Because as things are, the drive shows there being lots of free space when it is actually full.

Windows displays the total free space on network shares with no regard for an individual's quota. This can be confusing, especially if you are near your quota as the network drive itself likely does have a lot of free space left.

While there appears to be a great deal of space available, there are thousands of EWS users and that space does end up getting used as individuals consume more of their quota. As part of the long term solution work, we will be looking at quota sizes. It is likely that the long term solution will have higher base quotas when implemented.

A student question did reveal that we have not provided a way for students to check their quota on Windows systems. We are going to create a portal tool that will let everyone check their EWS storage quota and usage for both their Linux and Windows home directories.

Will our home directories still be available remotely like they are now if we map \\winhomes.ews.illinois.edu\NETID as a network drive on our personal Windows machines?

It will still be possible to access your Windows home directory, but not your Linux home directory, as a network drive. However, the path is likely to change. New paths will be published by the end of the outage.

Will all of the same remote access options (to Windows and Linux EWS machines) still be available after the scheduled outage this weekend?

Yes. All remote access services will resume after the scheduled outage. They will also have the split home directories, just as the lab systems will.

When will the issues that caused the recent performance problems be solved? Will this be a problem around finals?

The changes being implemented during the extended outage the weekend of October 11-13 are being made to address these performance problems. There should be significant improvement in performance from this Sunday, October 13 forward (i.e. well before finals). We will be continuing to closely monitor performance after the changes are completed and will make additional modifications if necessary but no additional extended outages are expected.

What is the current plan for users wanting to keep EWS Windows and Linux files synchronized in the short term?

Users wanting to do this will need to handle the synchronization themselves. We do not currently have documentation written on how to use existing tools such as rsync to do this. We are adding this to our action items list. If anyone wants to contribute such documentation, we would be happy to post it to the http://ews.illinois.edu/ website.

What can we do about users that are using excessive resources (eg users using 99% CPU) on remlnx? Oftentimes these users create issues for all others on the system, and nothing can be done about it.

There are already per-user memory and CPU constraints on the remote Linux servers to lessen the impact that one user can have on other services. In the short term, we are also doing more manual monitoring to identify and shut down processes that these constraints are not adequate to deal with. The labs have been more stable the last few weeks because of this than they likely would have been otherwise.

Part of the long term solution re-architecting work we'll be doing will be looking at technical ways to either a) prevent these situations from happening in the first place or b) automating detection and correction of these type of issues. However, just looking at load average or memory usage does not necessarily identify the cause of problems. Any automated solution needs to avoid false positive situations and only shut down truly rogue processes.

Why isn't there a common response from classes to outages in EWS? Each professor seems to take their own decision on whether or not to extend deadlines. However, these are department wide outages, shouldn't the decision be taken by the department?

It is not unusual for these types of decisions to be left to the individual instructor. While the outages we've had do impact multiple courses, not all courses are impacted in the same manner. Some assignments are less critical and easier to reschedule. Some course have extensive demos pre-scheduled that are difficult to reschedule. Some courses may have exams scheduled that interfere with rescheduling assignments. There are many other factors that all lead to the need for each instructor to have latitude with deciding what to do for their own course.

Is there a long term plan to put into place a system that will be an ideal solution instead of just placing a "band-aid" on the problem?

Absolutely. The email sent by Chuck Thompson on September 20 to all College faculty, lecturers, and students included the following work that would be done to develop a long term solution:

Engage instructors to better understand what they need EWS to be able to support for their courses. Engage students to better understand their work patterns and expectations for EWS. Bring in additional outside expertise to assist with designing our long term storage solution.

There will be additional activities beyond these all with the same goal: ensure that we do permanently address the lab performance issues. The information site http://ews.illinois.edu/performance was setup to document progress on both the short term and long term solutions. It will be updated at least weekly until the final long term solution is implemented.

Note that the short-term solution is more than just a "band-aid". We do believe it will make substantial improvements. We are, however, confident that it will not cover all use cases, some of which are only projected at the moment. The long term solution is as much about jumping ahead of the needs curve as it is about putting a permanent solution in place.

Do we have the necessary funding to provide a more permanent solution to the currently experienced problem?

Yes. Funding has never been the problem. Determining the right technical implementation for our strikingly complex and continuously changing instructional needs has been the primary challenge. The continuing issues this semester called for more aggressive action to first resolve the immediate performance problems and then determine the permanent solution. This is the path we are now on.

How do you plan to deal with the ever-increasing enrollment in Engineering majors that make heavy use of the EWS network filesystem?

Specifics will be worked out during the development of the long term solution. The general strategy includes:

Re-architecting the home directory storage infrastructure to have more capacity. Increasing our engagement with faculty and instructors to identify known intended course changes that may impact the labs. Implementing additional course-specific infrastructure for courses known to have above average needs. Determining where we can and should provide alternative ways for students to complete their assignments. For example, all students today can use MATLAB on their own machines as long as those systems are connected to the campus network.

If you have suggestions, please send them to ews [at] illinois.edu.

Q&A About Other Lab Issues

Would EWS be able to put Microsoft Project in a computer lab that can be reserved by RSOs? On the EWS website it says that all EWS computers have Microsoft Project installed on them. I have found this is not the case. I know Engineering Hall has Microsoft Project, but reserving that space is not possible for RSOs.

Student organizations are allowed to reserve all labs. Some labs such as the Engineering Hall labs are often heavily reserved during the day for courses which do have priority. There are also some other events that take priority in the evening. The reservation form at https://illinois.edu/fb/sec/9401394 has been updated to explicitly note that student groups are allowed to make reservations.

Licensing restrictions combined with few requests are the reason why Microsoft Project is not installed on all systems. The licensing issue is a technical issue with Microsoft that may be able to be resolved at some point. Project is installed on the lab in 406B1 Engineering Hall and, as noted above, RSOs can reserve this lab.

What if I need to install and use some other program on an EWS station? How do I get permission for that?

If there is a software application required for a course on an EWS workstation, your instructor should request it by using the form at https://illinois.edu/fb/sec/7075535. There is also a link to this form in the lower left corner of http://ews.illinois.edu labeled "Software Requests" under Instructor Tools. In order to maintain the security of our lab environment, we cannot give students permission to install software on lab computers.

Can we please have the old EWS setup back at Siebel? Now with two computers on one desk the monitors are too close up, my eyes hurt after even a short while.

The changes were made due to a significant number of requests for more lab computers at Siebel (in particular for more Linux systems). This feedback will be discussed with CS administration. Anyone else with similar feedback is strongly encouraged to email ews [at] illinois.edu. The trade-off for addressing this issue is, unfortunately, likely a reduction in the number of installed computers.

Why is Dropbox not installed on EWS computers?

We do not currently install the sync clients for any cloud storage service. As part of the long term solution work we are going to be investigating the use of cloud storage, whether as primary storage or simply as another alternative for students. For example, Indiana University has created a cloud storage integration service that they use with their lab environment. They are intending to begin offering this application to other institutions in the near future and Engineering IT is already in discussions with CITES about collaborating on a pilot implementation.