Code injection vulnerabilities have been exploited in repeated attacks on US election systems, in the theft of sensitive financial data, and in the theft of millions of credit card numbers. My collaborators and I have created a new approach for detecting injection vulnerabilities in applications by harnessing the combined power of both human developers' test suites and automated dynamic analysis. Our new approach, RIVULET, monitors the execution of developer-written functional tests in order to detect information flows that may be vulnerable to attack (using my taint tracking system, Phosphor). Then, RIVULET uses a white-box test generation technique to repurpose those functional tests to check if any vulnerable flow could be exploited. When applied to the version of Apache Struts exploited in the 2017 Equifax attack, RIVULET quickly identifies the vulnerability, leveraging only the tests that existed in Struts at that time. We compared RIVULET to the state-of-the-art static vulnerability detector Julia on benchmarks, finding that RIVULET outperformed Julia in both false positives and false negatives. We also used RIVULET to detect previously unknown vulnerabilities in Jenkins and iTrust.
Our ongoing work in this area aims to detect more kinds of vulnerabilities with even less reliance on developer-provided tests.
Whenever a developer pushes some changes to a repository, tests are run to check whether the changes broke some functionality. Ideally, every new test failure would be due to the latest changes that the developer made and the developer could focus on debugging these failures. Unfortunately, some failures are not due to the latest changes, but due to flaky tests. A flaky test is a test that can non-deterministically pass or fail when run on the same version of the code --- flaky tests might also pass when they should have failed. For most modern applications, flaky tests are inevitable. For instance: consider a system test at Google that involves loading a page that has an ad embedded in it. If the ad serving system is overloaded and unable to serve an ad within a time limit, the test might be served the page without any ads. In this case, the test runner may not be able to distinguish between a broken ad server (which might not serve ads to any client), and a functional ad server that might simply have dropped the request.
My work in flaky tests began with the problem of test order dependencies: tests that can unexpectedly fail if they are run in a different order. This early work considered how to efficiently isolate tests to prevent this flakiness and how to precisely detect which tests depend on each other (ElectricTest and PraDet), allowing developers to detect which orderings will result in flakiness. Looking at flaky tests more broadly than just test order dependencies, my collaborators and I created DeFlaker, which can mark test outcomes as flaky immediately upon failure (without rerunning it) by using code coverage results. Further considering the relationship between coverage and flaky tests, we conducted a very large scale analysis of code coverage, looking to see how coverage of individual statements may be non-deterministic and changes over time.
If tests non-deterministically cover different statements, then downstream testing techniques like mutation testing, program repair, and fault localization can be confounded. We found that even when tests do not appear flaky (e.g. their outcome is always "pass"), the set of lines covered by each test may vary non-deterministically (we found 22% of statements across 30 projects to have flaky coverage). As we reported in our ISSTA 2019 paper, this change in coverage can result in a wide variance in mutation scores.
Most recently, we performed a longitudinal study of test flakiness, tracing the origin of 245 flaky tests, and presented this study at OOPSLA 2020. We evaluated all of the revisions of each project containing each flaky test from the revision that first introduced the test until the revision in which the test was first flaky. We found that 75% of the tests that we studied were flaky when they were first added to the project: 25% of tests became flaky only after they were first added to the project.
We have several active projects underway that aim to help developers cope with flaky tests.
We have built one-of-a-kind JVM-based runtime systems for dynamic taint tracking and checkpoint-rollback that have enabled many new research contributions in software engineering and security. These systems-oriented contributions answer engineering problems that have arisen while we have been working to solve (developer-facing) software engineering problems. Both of these systems are designed to be extremely portable (using only public APIs to interface with the JVM) and extremely performant, allowing them to be embedded as a part of a larger tool (in our ongoing and future work).
Dynamic taint tracking is a form of information flow analysis that identifies relationships between data during program execution. Inputs to the program are labeled with a marker (``tainted''), and these markers are propagated through data flow. Traditionally, dynamic taint tracking is used for information flow control, or detection of code-injection attacks. Without a performant, portable, and accurate tool for performing dynamic taint tracking in Java, software engineering research can be restricted. In Java, associating metadata (such as tags) with arbitrary variables is very difficult: previous techniques have relied on customized JVMs or symbolic execution environments to maintain this mapping, limiting their portability and restricting their application to large and complex real-world software. To close this gap, we created Phosphor (OOPSLA 2014), which provides taint tracking within the Java Virtual Machine (JVM) without requiring any modifications to the language interpreter, VM, or operating system, and without requiring any access to source code.
Checkpoint/rollback (CR) tools capture the state of an application and store it in some serialized form, allowing the application to later resume execution by returning to that same state. CR tools have been employed to support many tasks, including fault tolerance, input generation and testing, and process migration. Prior work in JVM checkpointing required a specialized, custom JVM, making them difficult to use in practice. Our goal is to provide efficient, fine-grained, and incremental checkpoint support within the JVM, using only commercial, stock, off-the-shelf, state-of-the-art JVMs (e.g. Oracle HotSpot and OpenJDK). Guided by key insights into the JVM Just-In-Time (JIT) compiler behavior and the typical object memory layout, we created CROCHET: Checkpoint ROllbaCk with lightweight HEap Traversal for the JVM (ECOOP 2018). CROCHET is a system for in-JVM checkpoint and rollback, providing copy-on-access semantics for individual variables (on the heap and stack) that imposes very low steady-state overhead and requires no modifications to the JVM.