Management of Research Data – a Shell+Python+Excel+R Approach

I am a computer science researcher, usually working on both Windows and Linux system. Windows is the place where I do the document work, like reading paper, browsing the internet, writing papers with LaTex… Linux is where I run and generate experimental results.

The Chaos

After years of messy data management and recent data chaos, I decided to sort things out. The motivation is when you are RESEARCHING, you tune the parameters and you look at the data, you tune the parameter again and you look the data, and you tune the parameters and you… Then you get lots of output from the experiments you ran. They have lots different parameters. In the later run you may have more parameters when you think you’d better tune more to see if there is any difference…

The Elements

When you do research, you have these on your desktop related to your experiments:

  • An driver application: this can be a benchmark, a real application, or something you just wrote
  • A platform: in system research, this might be the one that your research is trying to improve. This can be a file system, a program with the algorithm you come up with. The driver application runs on the platform.
  • A driver script: most of time, this is a shell script by which you invoke the driver application to run the platform. This script usually start a platform, then start a driver application, redirect the output to a file.
  • Raw result files.
  • Result parser: this can be a Shell script, a Python script or something else. You collect useful data from the raw result files and put them out in a more formatted way (rows and columns).
  • Formatted result files: the file with data generated by result parser.
  • Excel data files: they have the data from formatted result files. In Excel, it is easier to browse and manipulate the data. But sometimes you still need to copy the data back to Linux to use scripts to process them.
  • Data explanation: raw data is not enough. Without an explanation, it is nothing. The position of the data document is tricky. It can be within the formatted result files, the Excel data files. Or the formatted data can be in the data explanation. The data explanation can be in Linux or Windows…
    • Dictionary: A dictionary for all the words you created. For example, explain the parameter in the data document.
    • Link: as illustrated in the figure above by the dotted lines, those elements should be link
  • Plotting script.
  • Graphs: they are generated from the formatted data.

Suggestion

Driver script/driver application/platform
  1. Always make parameter changing easy. You don’t want to change a parameter and recompile the application every time you run.
  2. The commit numbers should be parameters of the formatted data. It is easy, just treat them as parameters and put them to raw result file by driver script.
    1. You can automate it by command like $git log |grep commit | head -n 1|cut -d ‘ ‘ -f 2.
  3. Using Linux environment variable to pass parameters to application or platform is a good method if passing by arguments is not possible.
    1. In driver script, use something like $export MYPARAMETER1=1000 to set the variable.
    2. In the application, get the variable by parameter1=getenv(“MYPARAMETER1”);
  4. Each run the of driver application is recommended generating one line in the out put.
  5. Each driver script is recommended generating only one raw data file. If you have more than one file, the parser needs to do more work.
  6. Always use shell script to automate your test. You don’t want to wait and check if this run has finished all the time and start a new run. You want to start multiple runs, go watch TV and come back later.
Raw data files
  1. Should have all the parameter, commits. The standard of a good raw data file is that you can figure out what exactly you the driver application and platform are.
  2. Different data files from same parameters should have something different, say, job ID.
Result parser
  1. Use Python to parse data. Python is easier and more powerful than shell.
  2. The result parser should ALWAYS generate formatted data files in the standard R format, which can be imported to R directly without any human effort. The formatted data files should have header (name for each column).
  3. Always have a result parser. You don’t want to type the result file name and grep different results manually every time. You want to type one command and everything come out.
Formatted data file
  1. Consider formatted data files as temporary. Put them to Excel immediately as they are generated.
Excel data file
  1. Have one sheet in Excel as a Dictionary.
  2. If the meaning of a word in the dictionary has changed, use a new word in the future.
  3. Let the Excel file be the hub of all experiment data. Consider all that data temporary (and they are). In the Excel file, one sheet has results from one driver script run. In the sheet, you should have the first a few lines reserved for data explanations, graph and plotting script of that graph.
Misc
  1. Name any useful file accordingly and carefully. Put the file name as a word and explain it in the dictionary if possible.
  2. Save all the useful files to a centerized sheet in Excel. You can insert any file by Insert Object.
Tips:

Use a good output format in the executable. If you have a bad output format, every time you modify some output, you have to modify the output parser script. If your output is already formatted as:

Headline ...... HEADERMARKER
Dataline ...... DATAMARKER
it is much easier for you to get the data your want.
Experience:
  • Be sure to describe your data in datahub right away after you got the data.  Otherwise you will feel very hard to know what it is after a day or two. Do this even you thing the data might not be valid. If you find it invalid, you can simply go back to the datahub and marked the data as invalid.
  • For large files, it is OK to put it elsewhere (not in the Dropbox or the same folder of the data hub). But be sure to use a non-temporary folder to host the data.
  • Have one Excel data sheet for one experiment data set. With location of the dataset (if it is big), plots/analysis and the scripts of the analysis.
  • Data analysis may be able to “paste and run”, without any code reading. Don’t you have the experience that, once it fails to run then you give up? Make you life easier for later not now. Make your experiment/data analysis easily reproducible. In R, define one function for one dataset. You can define embedded functions inside this function.
  • Presentation slide also can be a good datahub. In the slide, you can put the descriptions of the experiments, the graphs plotted, some marks on the graphs, data location, the data generator info. In the slide comment, you can put the self-contained copy-paste-capable script to regenerate the graph.
  • The data generator, data, analyzer (e.g. R script) and analysis doc (datahub) should be kept as close as possible.