Krishna's PhD thesis: "Lowering the Barriers to Hypothesis-Driven Data Science"
Abstract
Data science is a frequent task in academia and industry. One common use of data science is to validate hypotheses, in which the analyst uses significance-based hypothesis testing to draw insights about a population distribution based on experimental data. Apart from data scientists, who are professionally trained in data science and are highly skilled, many non-professional analysts also carry out data analysis. These non-professionals, who we refer to as data workers, are domain experts who lack expertise in data science, such as academic researchers, project managers, and sales managers
Through interviews, observations, online surveys, and content analyses, we aim to understand data workers' workflows across important tasks in hypothesis testing: learning theoretical and practical statistics, selecting statistical procedures, using data science programming IDEs to experiment with ideas in source code, refine and refactor source code, and disseminating findings from an analysis.
We present our findings grouped into two steps when performing data science tasks:
- Preparing to perform data science tasks: We discuss our findings about the impact of formal training on real-world statistical practice; trade-offs among information sources used for selecting statistical procedures; perceived complexity and uncertainty about statistical procedure selection; and reluctance among data workers to adopt alternative methods of analysis. Based on the above findings, we present design recommendations and two artifacts to improve data workers' workflows. Our artifacts include Statsplorer, a web-based tool to help data workers kickstart analysis and learn about common issues in statistical practice, such as over-testing, overlooking assumptions, and selecting the appropriate test; and StatPlayground, an interactive simulation tool that can be used to self-learn or teach statistical concepts and statistical procedure selection.
- Performing data science tasks: Our findings include an overview of data workers' workflows when performing hypothesis testing using programming IDEs, which follows an exploratory programming workflow; and a comparison of existing interfaces for data science programming, namely computational notebooks, scripts, and consoles, and a discussion of how well they support various steps in hypothesis testing. To improve data workers' workflows when performing data science tasks, we contribute design recommendations and two artifacts. Our artifacts include StatWire, an experimental hybrid-programming interface that encourages data workers to write high-quality source code; and Tractus, an interactive visualization that can lower the cost of working with experimental source code.
Based on our work, we present four takeaways that can be used by researchers, software developers, and educators to lower the barriers to hypothesis testing.
Defense Talk
Supplements
From the phase "Preparing to perform data science"
- Details of interviews, survey, and content analysis from studies to understand how data workers seek statistical procedure
- Back-end code and video preview from StatPlayground
- Details of interview/observational study used to understand how data workers use various programming interfaces
From the phase "Performing data science"
- Details of observational study used to understand exploratory analysis workflow, Tractus GitHub source, and details of Tractus' parser evaluation
- Details of code corpus analysis, StatWire GitHub source, and StatWire demo video