File formats

How do I save data?

File formats are the software used to store digital research data and documentation.. You are probably aware of .docx, .xlsx, .pdf, .mp3, .jpeg, even if we do not think much about what they mean or what they do. These indicate the file format. They specify information encoded in a file and tell computers how to open and use that format, so you cannot play a .docx, you cannot read an .mp3. However, what they mean brings a set of complications.

Open formats are files where we can see how the format works and easily open it on freely available software. The format may or may not be “owned” or “proprietary” but, critically, the information underpinning that format is public or generally available under a fairly restriction free licence so software can be developed that allows access the files. Proprietary closed formats rely on software you have to buy, often where a commercial organisation locks away the operating manual and keeps the intellectual property as a trade secret, supporting the format for only as long as they see fit. When they stop supporting it, they sentence files to a digital death unless you pay for the latest version of that software.

Open formats help make data “open”: not just available, but usable in a way in which allows reuse, repurposing, and remixing.

Another way open formats help is in long-term accessibility. Storage formats quickly become obsolete and inaccessible. If they are open files, they can be transferred to contemporary formats (“migration”) or a way to access can be engineered (“emulation”). So often the critical challenge in keeping digital objects accessible is not the technology, it is the Intellectual Property Rights of closed proprietary formats, which present a greater and often insurmountable hindrance to negotiate. Closed formats also make it difficult to share data informally or formally, or even with your future self, as many closed formats are platform specific. In addition, closed formats by definition do not reveal what information they store in terms of changes made to files, which if you are a researcher dealing with sensitive data can be problematic because you cannot be confident as to what is, or is not, being recorded in the file’s metadata.

Attempt to keep copies in open standard formats or at least formats widely used and accepted by the research community. A few common examples of open formats include:

  • Archiving: 7z (archiving and compression), MAFF (web page archiving), tar (archiving) ZIP (archiving and compression)
  • Databases: CSV (spreadsheets), NetCDF (scientific data)
  • Multimedia: DjVu (scanned images and documents), JPEG2000 (a standardized image format), PNG (standardized raster image format), SVG (standardized vector image), WebM (video and audio format)
  • Text: CSS (websites), HTML (websites), ePUB (open e-book standard), LaTeX (document markup language), Office Open XML (text format) OpenDocument (text format)

While you should bear in mind the above criteria, there is an acknowledged difference between working data and preservation data. While using closed formats for working on your data is not encouraged, it is recognised that in the short-term such formats may have compelling reasons for use. However, we encourage the use of open formats for data collection and preservation copies where possible.

