As statisticians, we spend a lot of time reading and writing data files. We need data formats that are easy to read, portable between different languages and analysis packages and operating systems, reasonably compact, and durable: data files that we save now should be readable tomorrow and twenty years from now.
Let’s flesh out these features in a bit more depth. Not every data format has all of these features, but they’re worth thinking about.
The most common data type we encounter, but by no means the only one, is tabular data. Let’s start there.
You are all undoubtedly familiar with comma-separated values files. CSVs are very simple formats for tables: the file is plain text, every row is separated by a newline, and every column is separated by a comma. Hence:
number,description,professor
101,Intro to Mad Science,V Frankenstein
202,Engineering for Evil,H Doofenschmirtz
401,Monster Construction Seminar,M Shelley
Dead simple. Except this leads to a few problems: What if the value of a field contains a comma? What if the value of a field contains a newline character? The common method is to use quotation marks to delimit such fields, so we have
number,description,professor,description
401,Monster Construction Seminar,M Shelley,"A discussion of the tools, techniques, and materials to construct humanoid monsters.
Cross-listed under Biomedical Engineering 378."
But then what do we do if we want to put a quotation mark inside a field? The common rule is to escape it by doubling it, so the field is "foo
""bar"" baz"
instead of "foo "bar" baz"
.
Unfortunately common software does not actually agree on these rules, and so CSV files produced by some software can be hard to read in others. Some readers don’t like newlines in fields; some writers wrap every field in quotes regardless of whether it contains commas or quotes; some software doesn’t expect to see column headers. There is a standard for CSV files defining their syntax, but most software just implements its own favorite version of CSV instead.
CSV files are self-describing in the limited sense that the columns have names, but they do not prescribe the format of the individual fields. A date
column with entries like 10-04-2018
might mean April 10th or October 4th. A column with entries like 2.31e13
might be a number written in scientific notation or the string 2.31e13
. Timestamps might be written with or without timezones. You must know the formatting to correctly interpret a CSV file.
Also, be very careful about opening CSV files in spreadsheet programs. Microsoft Excel will very helpfully try to recognize the format of columns (date? text? currency?) and convert columns, leading to a plague of scientific articles where gene names like SEPT2
(Septin 2) were helpfully converted to dates like 2006/09/02
and record identifiers like 2310009E13
were converted to floating-point numbers.
I recommend:
2019-02-18T19:04:41Z
. (I have seen data files that do not specify their time zone, and if you just guess “local time” you get invalid times during the hour skipped by Daylight Savings in the spring – turns out they used a different time zone without saying so.)location
column that’s formatted as (longitude, latitude)
; don’t do that!)NA
or null
make sure that can’t be confused for a real value. (Don’t make Mr Null suffer any more than he already has. It’s also common for values like 999 to mean “missing” or “invalid”, but then people forget to throw out those numbers.)You sometimes still see tab-separated values files, which are like CSVs except with tabs. TSVs are simpler than CSVs because instead of dealing with escaping commas, they just ban tab characters in fields.
Frustratingly, the basic ASCII character set – first published in 1963 and almost universally used – contains special Record Separator and Unit Separator characters one could use for delimited text files without worrying about escaping, but nobody has ever used them and text editors don’t know how to display them.
Parquet is a new-ish project to define a tabular data format that can be widely used in different programming languages. It is columnar: instead of the file containing one row of data at a time, as in CSV, it contains one column of data at a time. This is ideal for situations when you want to operate on entire columns at once; a CSV file would require you to process the entire file byte-by-byte, tracking commas and quotation marks, whereas Parquet lets you just read one chunk directly.
(For large files or those you want to append new rows to, Parquet splits the rows into “chunks”, so you get one column at a time for just the first N rows of data, then the columns again for the next N rows, and so on. This also makes subsetting quite fast.)
Parquet encodes data in binary, includes compression, and has libraries available for several languages. It’s widely used for Spark, Hadoop, and all the related software.
The Hierarchical Data Format was designed for high-performance supercomputing where we need huge data files of giant multidimensional arrays. The format is hierarchical: Every file contains datasets (multidimensional arrays of data) and groups, which contain datasets and other groups.
Datasets and groups have names and are put together with normal file notation, so a dataset “bar” in a group “foo” is written /foo/bar
.
Datasets and groups can have attributes: small, named pieces of data attached to them to describe their meaning. This is great for making data files self-describing.
Datasets are encoded in binary, and large datasets can be encoded in chunks, so that a reader doesn’t have to read the entire file into RAM to extract pieces of it (unlike a CSV).
I have custom Python code which fits a very big, very slow model with expectation-maximization. On a large dataset, even with parallelization, it can take twenty minutes to an hour to fit and several gigabytes of RAM.
I don’t want to accidentally close Python and lose the model fit. Nor do I want to save a model fit, come back four months later, and forget what data or parameters I used to make it.
class EMFit:
"""
Represents an EM fit and operations on it.
...
"""
## ...
def dump(self, f):
"""Dump the fit to the file f."""
f = h5py.File(f, "w")
data = f.create_group("data")
self.data.dump(data)
params = f.create_group("params")
params.attrs["min_dist2"] = self.min_dist2
params.attrs["min_t"] = self.min_t
params.attrs["iterations"] = self.iterations
params.attrs["loglik"] = self.loglik
params.attrs["equal_sigma_eta"] = self.equal_sigma_eta
params.attrs["num_params"] = self.num_params
self.f.dump(params)
f.close()
@classmethod
def from_file(cls, f):
"""Load a saved fit from the file f."""
f = h5py.File(f, "r")
data = cd.CrimeData.from_hdf(f["data"])
p = f["params"]
efp = EMFitParams(omega=p["omega"][()], sigma2=p["sigma2"][()],
eta2=p["eta2"][()], theta=np.asarray(p["theta"]),
alpha=np.asarray(p["alpha"]))
ef = cls(data=data, min_dist2=p.attrs["min_dist2"],
min_t=p.attrs["min_t"],
equal_sigma_eta=bool(p.attrs["equal_sigma_eta"]),
autorun=False)
ef.iterations = p.attrs["iterations"]
ef.loglik = p.attrs["loglik"]
ef.num_params = p.attrs["num_params"]
ef.f = efp
f.close()
return ef
HDF5 comes with a set of command-line utilities for inspecting HDF5 files, so we can look at a file produced with this code:
alex@rosmarus:~/crime$ h5ls -r fits/burglary-all.h5
/ Group
/data Group
/data/Ms Dataset {38538}
/data/predictors Dataset {9}
/data/responses Dataset {1}
/data/ts Dataset {38538}
/data/xy Dataset {38538, 2}
/params Group
/params/alpha Dataset {10}
/params/eta2 Dataset {SCALAR}
/params/omega Dataset {SCALAR}
/params/sigma2 Dataset {SCALAR}
/params/theta Dataset {10}
These files hence contain the fit parameters, the data used to generate the fit, plus attributes indicating the settings used to create the fit. No longer do you have to encode the model fit settings in the filenames of your output files; all those attributes can be spelled out in the HDF file.
HDF isn’t the only format that can do this, of course, and it does require writing a bunch of code to encode the relevant settings, but it is very useful to store results this way.
HDF5 libraries are available for nearly any programming language you might use. Here are a few highlights.
A related problem to tabular data storage is serialization. Serialization usually refers to taking data structures in memory – variables and objects in your program – and turning them into bytes that can be written to the hard drive.
Serialization applies to more than just tabular data. Can I serialize a Python object? An R list? A dictionary? Closures?
Languages usually have their own serialization formats to do this.
Python has pickle, which can “pickle” (almost) any Python object into a file.
import pickle
pickle.dump(some_big_object, "foo.pickle")
reconstituted_object = pickle.load("foo.pickle")
Pickle files are stored in binary form in a format used only by Python, so you may not be able to read Pickle files from any other language. The format also sometimes changes with new Python versions (there have been five different formats so far)
To quote the manual:
The following types can be pickled:
None
,True
, andFalse
- integers, floating point numbers, complex numbers
- strings, bytes, bytearrays
- tuples, lists, sets, and dictionaries containing only picklable objects
- functions defined at the top level of a module (using
def
, notlambda
)- built-in functions defined at the top level of a module
- classes that are defined at the top level of a module
- instances of such classes whose
__dict__
or the result of calling__getstate__()
is picklable (see section Pickling Class Instances for details).
You can pickle classes and functions, but they are pickled by name: to unpickle them, Python has to be able to import
the module in which they were defined. You can’t pickle a function and send it to someone who does not have that function’s code.
You also can’t pickle things like psycopg2
connection objects, open file handles, and other temporary resources.
Pickling is not suitable for sharing data online. Notice the warning at the top of the manual page:
Warning: The
pickle
module is not secure against erroneous or maliciously constructed data. Never unpickle data received from an untrusted or unauthenticated source.
The pickle
data format (at least the text-based version) is a simple stack programming language, which is why it’s so flexible and able to store so many kinds of data. But here’s a simple example of a pickle
shellcode:
cos
system
(S'/bin/sh'
tR.
Saved as canonical.pickle
, this instructs Pickle to load os
and call os.system
with the /bin/sh
argument. Try it on the command line and you get:
>>> import pickle
>>> pickle.load(open('canonical.pickle'))
sh-3.2$
Oops! Arbitrary code execution.
So probably don’t use Pickle to post data online, and don’t download random Pickle files from the Web.
R has its own native serialization format as well: the RData file, produced by save
and saveRDS
.
save is probably what you’re most familiar with. It saves R objects to a file, just like Pickle does; it can take the objects directly as arguments or it can take a list with the names as strings. Its counterpart load reads the file and puts those files into the environment. That is:
## In one source file:
foo <- 4
save(foo, file = "bar.Rdata")
## In a different source file
load("bar.Rdata")
foo #=> 4
This is dangerous: if the RData file contains variables with the same name as variables in your current workspace, the ones in your workspace will be overwritten. attach reads an RData file but puts it into a new environment in the search path, so it does not overwrite any data.
The same serialization is available through saveRDS and readRDS, which serialize and load a single object; readRDS
returns the object rather than sticking it in your environment.
RData files use a custom binary format and are not intended to be portable to other software. They include gzip compression, so large datasets can be stored in smaller files than the equivalent CSV. But since RData files are not portable, you should think carefully before using them to share data with other people.
There’s a fundamental conflict in serialization: ideally we’d like to serialize any kind of object in our programming language, but since every programming language has different kinds of fundamental objects and data structures, serialization formats have to differ between languages.
General-purpose serialization formats usually make a compromise by supporting a restricted subset of data types.
JSON (JavaScript Object Notation) is one of the most common. JSON has only a few fundamental data types:
true
, false
, and null
Objects may contain lists, lists may contain other lists, and so on.
JSON is written as plain text. The example from Wikipedia shows all the features:
{
"firstName": "John",
"lastName": "Smith",
"isAlive": true,
"age": 27,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers": [
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
},
{
"type": "mobile",
"number": "123 456-7890"
}
],
"children": [],
"spouse": null
}
It’s an object containing key-value pairs, some of which contain lists of objects, and so on.
Languages often have libraries to convert arbitrary objects into JSON; R’s jsonlite, for example, can convert most types of objects to and from JSON, and Python’s json module supports most basic Python data types and lets you extend it to convert new kinds of objects to JSON.
One flaw of JSON is that there is no schema: no way to define what fields and values specific kinds of JSON files should include. If you have JSON records for students, for examples, there’s not a native way to specify that a student record should contain specific values of specific types, so your JSON decoder will have to recover gracefully if the file it reads has different names or missing fields or extra fields or whatever. JSON Schema is a draft proposal to fix this problem.
YAML is another serialization language that’s intended to be more human readable, using indentation instead of braces. It’s also a superset of JSON.
XML is another option, particularly if you’re making data that needs to be read in the 1990s. It has extensive features to create and enforce schemas.
When you publish a paper, you should (if ethical and legal) make the data behind it available. This aids reproducibility, and also gives a way for other authors to cite you if they build on your data.
Journals handle this in different ways. Reproducibility is increasingly expected, so some journals are starting to require that data be published, or at least that authors promise to share data on request (though many authors ignore this promise).
Data files can sometimes be large or unwieldy, making it inconvenient to post them online. Journals sometimes have file size limits for supplementary materials, and their supplements may be deposited in a weird way (the Annals of Applied Statistics used to post data supplements on StatLib, which went online, and JRSS just tells you to go to one web page with a list of hundreds of datasets, instead of sending you directly to the data).
There are special-purpose data repositories you can use instead.
“Why not post it on my website?”, you might ask. And why not just put the paper on GitHub? Or share a Dropbox link?
Because a key requirement of data archiving is that it should be durable. We should be able to retrieve the dataset twenty years from now. It should survive hard drive failures or your account expiring or commercial services being bought by evil corporations.
Your website does not meet these criteria. Nor does your hard drive. Studies have shown that data disappears rapidly with age: “The odds of a data set being reported as extant fell by 17% per year”, often because scientists lost their hard drives or the data was stored on an obsolete medium (like a Zip disk) that they no longer could read.
Data that is part of the scientific record should be stored in a purpose-built repository that keeps reliable backups and is committed to keeping it available for decades, as the above repositories are. It should not depend on you not spilling tea on your laptop.