Casual Notebooks and Rigid Scripts: Understanding Data Science Programming

This webpage contains supplementary material for the paper "Casual Notebooks and Rigid Scripts: Understanding Data Science Programming" by Krishna Subramanian, Nur Hamdan, and Jan Borchers. The four supplementary materials are below: 

  1. Details of our formative study participants and the data we collected from them.
  2. Details of our Grounded Theory analysis, including the codebook and our procedure.
  3. Phases in data science programming
  4. Design recommendations

For questions, please contact This email address is being protected from spambots. You need JavaScript enabled to view it.

1. Details of our formative study participants and the data we collected from them

Participant Experience with the scripting language
(in years)
Domain
(scripting language)
Modalities IDE(s) used Interview & walkthrough
(in minutes) 
Observation 
(in minutes) 
P01  1 Significance testing (R)  Scripts  RStudio 15 45 
P02  Significance testing (R) Scripts RStudio 20 35 
P03 1 Significance testing (R)  Scripts RStudio 20 25 
P04  Significance testing (R)  Scripts RStudio  15  30 
P05  2 Machine learning (Python)  Notebooks  Jupyter notebooks  20  20 
P06 3d data processing (Python)  Scripts  Blender  40 
P07  Significance testing (R)  Scripts RStudio  30  40 
P08  Machine learning (Python)  Both PyCharm  30  50 
P09  0.5  Financial analysis (R)  Both RStudio  40  60 
P10  Machine learning (Python)

Significance testing (R)
Both RStudio, Jupyter notebooks  60  45 
P11  3d data processing (Python)

Numerical analysis (MATLAB) 
Scripts PyCharm  20  30 
P12 Equation modeling (R)

Significance testing (R) 
Scripts RStudio  55 
P13  Machine learning (Python) 

Significance testing (Python, R)
Both  RStudio, Jupyter notebooks, PyCharm  55  30 
P14  Machine learning (Python, R,  MATLAB)  Both  PyCharm, Jupyter notebooks, RStudio, MATLAB  30 
P15  Machine learning (Python)  Both  Textmate, Jupyter notebooks  40 
P16  10  Machine learning (Python, R)  Both  Spyder, Jupyter notebooks, RStudio 60 
P17  Machine learning (Python, MATLAB)  Both  MATLAB, Jupyter notebooks, Sublime  45 
P18  Numerical analysis
(MATLAB) 
Scripts  Vim, MATLAB  40 
P19  Numerical analysis
(MATLAB, Python) 
Scripts  MATLAB  60 
P20  Machine learning (Python)  Scripts  PyCharm  45 
P21  Significance testing (R) 

Machine learning (Python)
Scripts  RStudio, PyCharm  55 

 

2. Details of our Grounded Theory analysis

 

See pdfthis document.

3. Phases in data science programming

 

Phases in data science programming

We identified four phases in data science programming:

I. Data collection and cleaning

In this first phase, participants collect and clean data. Data is mostly collected by another person (P02, P03, P06, P08, P09, and P12–19), sometimes by the participants themselves (P01, P04, P07, P10, P16, P20, and P21); public data was used otherwise (P05. P11, and P16). After collecting data, participants prepare the dataset for analysis, e.g., by converting it to the right format and removing outliers. Data cleaning takes a lot of time, and is a recurring task done throughout analysis. 

II. Experimentation

Participants often experiment with different approaches to obtain insights from their data. These approaches are implemented as "quick and dirty" prototypes in the source code, often in a less modular, reusable manner. In this phase, the data science workflow is highly iterative and unpredictable—experiments lead to comparisons, comparisons generate even more ideas to explore, and so on. For comparisons, our participants employed various criteria, e.g., code metrics like execution time and memory, but also domain-dependent criteria like statistical power (P04) and results of a fitting function (P10). 

III. Code refinement

To prepare the source code of the analysis for dissemination, participants (a) improve the readability of source code by adding further documenttion and pruning "scratchpad" code (P02–P04, P07, P09, P10, and P13), (b) refactor source code either in-place or into a new script file or computation notebook to improve code quality and reusability (P02–04, P08, P10, P13, and P15), and (c) extend source code so that it works with a wider range of input (P08 and P13; this was more common in machine learning).

IV. Dissemination and storage

The final phase in data science programming is to disseminate the results of the analysis to the outside world, e.g., as a research publication, or store it for later reuse. All participants reported disseminating the insights from their analysis, as well as storing their source code for later reuse. Many disseminate source code (P05, P08, P09, P11, and P13–19), although in some situations only as snippets or pseudo-code (P08, P09, and P13–17). 

 

4. Design recommendations

 

Resolving short-term issues

In the short-term, data workers face several issues as a result of using both modalities. Source code often gets cloned within or across files, requiring data workers to maintain clones, e.g., by linked editing. Constantly switching modalities also means that data scientists have to understand their source code and state of their experiments whenever they switch. This problem is exacerbated by sparse documentation and prevalence of unstructured exploratory code. On a related note, it can be difficult for data workers to navigate and find prior code, which is a frequent task during experimentation to begin new experiments, as well as when refining source code for dissemination or storing. Automatically-generated source code visualizations, such as Code Bubbles, Code Thumbnails, and TRACTUS, could help with code understanding, navigation, and task resumption.

Improving reproducibility

Scientific claims gain credibility when the researcher provides enough details so that the findings can be reproduced by an independent researcher. When refining code for dissemination, data workers often rewrite their exploratory code. When doing so, data workers need to capture not just the source code that produced the reported approach, but also the approaches that were explored earlier which led to the final approach. Data workers are, after all, responsible for the decisions they make in their work!

Through our observations and interviews, we find that many participants (P02-04, P10, and P13) primarily record only the source code that produces the results they will use. E.g., P02, a HCI researcher used an R script to capture his analysis. The script contained the source code for loading dataset, subsetting the data, and the significance tests including post-hoc analysis. While it also contains comments capturing the rationale for test selection, e.g., "distributions not normal" and "parametric test is not valid", it did not capture the source code for it. This lowers the reproducibility of the analysis.

Participants are aware of the importance of reproducible research, but do not want to clutter their scripts as this might affect code navigation. Furthermore, as discussed earlier, hidden dependencies are prevalent in notebooks. Some participants reported encountering notebooks that do not execute because they had not copied all the dependencies when migrating code. Potential solutions include dependency managers, e.g., Drake, and tools that help find code snippets, e.g., Code Gathering Tools and Verdant.