NSW State Records has promised to open source any digital preservation software it writes for an archive of "born-digital" records that will sit in a new Western Sydney data centre.

The digital archives project is the State Government's first attempt to permanently and centrally store a portion of digital files that are beyond "immediate business use".

"Up until now with these permanent value digital records, government agencies have been obliged to simply maintain them in their own systems," digital archives project manager Cassie Findlay told iTnews.

"There hasn't been anywhere to send [the files] because we haven't had that capacity."

The state will permanently store digital files in much the same way it does physical files, such as photographs, documents and volumes.

It will house the digital archive in a new purpose-built data centre at the Western Sydney Records Centre, which currently acts as a central store for the physical archives.

The data centre will house an IT platform consisting of Cisco servers and switches, VMware virtualisation and an EMC Isilon storage system. It is to be designed and implemented by Logicalis Australia.

NSW State Records will accept a "relatively modest ... 15-20 separate transfers" of digitally-born files from various agencies to test the new preservation system.

"We have goals around getting a certain number of transfers of digital records done within the timeframe of the project to make sure everything's working as it should," Findlay said.

If all goes to plan, the digital archive will begin accepting digital file transfers to state archives generally from mid-2013, when the digital archives project officially expires.

"We want to ... sell the benefits to government of not having that ongoing responsibility to keep those records accessible and usable in-house but rather send them to us, we will preserve and look after them," Findlay said.

"It won't cost the government agency anything anymore to do that, and they'll also be part of this big pool of government information."

Open source software

On the IT platform will sit open source digital preservation software created either by the project's resident programmer or using work from the global "digital preservation scene".

"We're developing our own software based on standards and frameworks that are in place in other libraries and archives," Findlay said.

The department's own code will be made available via github, where it has previously released an experimental API for its catalogue of physical items.

"We haven't put up a lot of stuff to do with digital preservation activities yet [on github]," Findlay said.

Uploading code to github could facilitate interoperability and sharing benefits if the software "becomes widely used".

It could also provide access to a potentially wider base of developers than the project could otherwise afford to engage.

"Like any government department we have limited resources to commit to this ongoing so we want to do it as collaboratively as we can," Findlay said.

"In digital preservation there are a lot of issues that can come up that you maybe didn't think of yourself or that come slightly unexpectedly, so if there are people out there that are also grappling with some of these things and developing similar software or using our tools then that's a help for us."

Non-proprietary architecture

The decision to go down the open source route is partially based on facilitating openness, a key challenge in any archival project.

Findlay said the NSW digital archives project aimed to preserve born-digital files "so they will be readable in 100 or 200 years time".

"This digital archive will have to be around forever so [it has to be] as non-proprietary as possible," she said.

That impacts decisions on the systems in which the files are to be stored, and on the format in which files are stored.

"We can't set up a digital archives system that has too many dependencies on vendors or ongoing licensing costs," Findlay said.

"If you have a format that [is] proprietary and you don't have access to the underlying information to know how it's read and presented, then down the track if the software is less freely available, you've got a problem if [the] company goes out of business and no longer makes the software to read it.

"You [also] can't have a situation where you have digital information that relies on licensing and payment to corporations for it to continue to be read.

"If you're managing a digital archive with terabytes of data in all manner of formats from all different sorts of systems, there's no way you could take it on knowing you've got to somehow pay for it to be accessible."

Findlay said that file transfers made to the project would be evaluated against principles of openness. She said that files may be converted if issues arose.

"There are tools in the digital preservation world to help us to do that," she said.

"For example, the National Archives developed a tool some years back called Xena which converts a range of formats into more open formats. A classic example is Word into ODF."

Searching the archive

Findlay said that the project team is currently working on a metadata management plan for the digital archives that will - among other things - impact the extent to which individuals and organisations are able to search across and harness this new store of government data.

"As well as the more technical preservation metadata that we'll need to manage format issues over time, we have certain metadata that we know we'll want to manage in a fairly structured way because it's to do with the legal aspects of managing the records," Findlay said.

"We're also looking at how we can manage, index and retrieve the metadata and the record contents in a very powerful way so people can analyse across large data sets and pull out - providing they're open access of course - information on a subject basis across a whole range of government departments, which is just impossible to do with older physical sets of records which are in boxes.

"It's exciting to have that possibility".

Public access to stored records would still be negotiated with the agencies involved.

"Some will be open and available online very early, some will be closed for longer periods of time and there are processes for that," Findlay said.

Digitisation complementary

The digital archives project is separate from a long-running digitisation program in NSW.

Where the archives project is about taking in "born-digital" records for the first time, digitisation is about making electronic copies of "older and popular" physical documents held in storage in Sydney's west.

"The digitisation program we have is about us trying to get more of our legacy material in the state archives collection online and available," Findlay said.

"[Digital archives is] about having [agencies] send to us ... born-digital records, so they were never paper, that are required to be kept permanently as archives."

The digital archives project falls under the auspices of a broader NSW Government digital record keeping initiative called Future Proof, which has been running since 2008.

The initiative aimed to foster good record keeping practices in NSW Government agencies "for their own purposes as well as down the track for the State Archives collection".