Open Data Files: The Main Formats

jessica-ruscello-99566.jpg

Open data can be uploaded in a variety of formats. We’ve discussed the basics of open data and mentioned some of our favorite sources; in this post we’ll focus on the formats themselves. Many formats are commonly referred to by their abbreviations.

Comma Separated Values (CSV)

CSV files are valued for their compactness and transferability. In a CSV, all the structure of a spreadsheet is encoded as a single row of values (separated by commas, hence the name). LiveStories works with CSV's, and users can create them from most spreadsheet programs. However, converting an Excel spreadsheet to a CSV may cause information in the original to be lost. Protecting the structure of a CSV is important as any change can make the file harder to interpret without the right documentation.

EXtensible Markup Language (XML)

XML is a markup language similar to HTML. Unlike HTML, which is made to display webpages, XML is designed to store and transport any kind of data. XML is popular because it is self-descriptive; the original data is preserved while changes are documented and visible. These features make collaboration on XML files easier.

JavaScript Object Notation (JSON)

JSON is a text format written with the JavaScript programming language. When data needs to move across servers, JavaScript serializes data into a JSON file. JSON files have an advantage as a text format, being both lightweight and compatible with every programming language.

Text Files (.txt)

Most operating systems have a text app which can be used to create notes. Since any string of text can encode data, some people distribute open data in this simple format. These files are straightforward and easy to read by computers, but sometimes suffer from a lack of structural information. Changing operating systems can cause differences in file copies.

Resource Description Framework (RDF)

RDF isn’t a file type, but a model for organizing metadata and can be written with XML and JSON. Metadata is data about data. For example, in a dataset about health insurance rates, the metadata would include such information as who created the dataset, and when. The W3C recommends RDF for promoting open data across the web, because RDF are easily combinable and identifiable with URL’s.

These formats are not the only ways people store and distribute data. Websites store plenty of data in their HTML, for example. And some people even save data as tables on PDF's—which is not particularly useful as far as open data goes, but may be enough to get the job done. If you are interested in distributing open data, keep in mind that accessibility and machine-readability are keystones of the concept.

Sources

This blog post is based on information from the Open Data Handbook and the W3schools.

Cover photo by Jessica Ruscello