Imagine having the ability to take a virtual tour of the cosmos from your living room. Not just a flat, two dimensional tour, but an experience so engrossing that you have the ability to see the entire sky at once then zoom into detailed views of distant galaxies. The Terapixel project from Microsoft Research makes all of that possible, by creating the largest and clearest image of the night sky ever produced—a terapixel image, now available in the WorldWide Telescope and Bing Maps.

The Terapixel project began with data from the Digitized Sky Survey, which is a collection of thousands of images taken over a period of 50 years by two ground based survey telescopes— the Palomar telescope in California, United States and the UK Schmidt telescope in New South Wales, Australia. The Palomar telescope took photographs of the Northern sky, and the Southern sky down to around 30 degrees south. The UK Schmidt telescope took photographs of the rest of the Southern sky. Each photograph covers an area of the cosmos six and a half degrees square. For each section of the sky, the digital sky surveys provides two separate images containing the blue and red color intensities. The images themselves are monochromatic; the data simply represents the intensity of blue or red.

The telescope imaging process introduced certain artifacts into the plates such as varying levels of brightness, noise, and color saturation, as well as vignetting: a darkening of the edges and the corners of each plate, which needed correction in order to generate a clear and seamless image. Terapixel programmatically removed these anomalies, stitched and smoothed images, and then created image pyramids for visualization in WorldWide Telescope (WWT).

Developers used Trident, the scientific workflow tool developed by Microsoft Research, to create and manage all of the workflows within the project. Each stage of the process is a Trident workflow activity, from the initial data preparation to sending the terapixel image to the WorldWide Telescope. Given the large amount of data and computation involved, programmers made use of DryadLINQ and .NET parallel extensions to manage code running in parallel on multi-core machines of a Windows High Performance Computing cluster. By making use of a 64-node cluster (512 cores), they were able to compute the final Terapixel image from the raw digitized data in a little more than half-a-day.

Once the original files are decompressed they undergo a series of programmatic changes to correct the vignetting problem, then the red and blue plates are aligned astrometrically and combined to form a new color image which also contains meta data that maps it to sky coordinates. The next step is to stitch the color images together into a spherical image and smooth the seams of that image. Terapixel uses the global image optimization program developed by Hugues Hoppe and Dinoj Surendran of Microsoft Research and Michael Kazhdan of Johns Hopkins. The gradients across the image boundaries are set to zero, resulting in a seamless spherical panorama. The result of the Terapixel project is a full color 24 bit RGB terapixel image of the night sky. The artifacts of the original telescope imaging process have been programmatically removed. The resulting image can be viewed in the WorldWide Telescope and by Bing Maps.TeraPixel is a showcase for Microsoft technologies in many-core computing, in high performance and data-intensive distributed computing, and in scientific workflow management. Terapixel demonstrates how technologies such as Windows HPC, .NET Parallel extensions, DryadLINQ and Trident can be used to create new possibilities for data-intensive research in astronomy, bioinformatics and environmental science.

By the Numbers Raw data 1791 pairs of red-light and blue-light images acquired from two telescopes, scanned into 23,040×23,040 or 14000×14000 images. Windows HPC Cluster The high-performance computational platform used to run Terapixel, consisting of 64 compute nodes, each a quad-core Intel Xeon CPU with 16 GB RAM and 1.7 TB of storage. Generation of RGB plates Processing time: 5 hrs. Input: 417 GB (compressed, 4TB uncompressed)

Output: 790 GB (approx. 500MB/plate) Stitch images into a spherical image Processing time: 3 hours Optimize image to remove seams Processing time: 4 hours 15 minutes Move data off the cluster 2.5 hrs (1Gbps link) Output 1025 pyramid files; total size: 802 GB Terapixel is a reference implementation to derive similar data-intensive solutions, not only in astronomy but also in other domains such as bioinformatics and environmental sciences.