PhD Defense Slides » Saswat Padhi

Data-Driven Learning of

Invariants and Specifications

Saswat Padhi

UCLA logo University of California, Los Angeles, USA

DISPLAY$ \newcommand{\argmin}{\mathop{\text{arg}\,\text{min}}} $DISPLAY

Software Reliability

Research Artifacts

CompCert is the only compiler we have tested for which Csmith cannot find wrong-code errors. This is not for lack of trying: we have devoted about six CPU-years to the task.

— X. Yang et al. “Finding and Understanding Bugs in C Compilers.” PLDI’11

Industrial Products

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

— Disclaimer from the GNU Public License (GPLv3)

Code reviews, extensive testing, continuous integration/deployment (CI/CD), reproducible builds, quality assurance (QA) … Not enough!

What limits the adoption of software verification?
How can we bridge this gap between research and practice?

Verification

Stating these formal properties accurately
is extremely onerous, even for experts

crafting a correct specification (especially one using an obscure formal system) is often much more difficult than writing the program to be proved (even one written in an obscure programming language)

— B. Hailpern et al. “Software Debugging, Testing, and Verification.” IBM Systems Journal ' 02

Existing tools provide little or no support in generating correct specifications and invariants

Research Focus

Drastically reduce the cost of building verified software by automated synthesis of formal properties

$^\star$ inspired by two emergent techniques — CEGIS (A. Solar-Lezama. STTT ' 13) and SyGuS (R. Alur et al. FMCAD ' 13)

Overview of Projects

Specification Synthesis

PIE

PLDI 2016$\,^\star$

Feature learning for black-box precondition inference

Adopted by CVC4, Alive-Infer (used by LLVM)

FlashProfile

OOPSLA 2018

Input specifications for
data-processing workflows

Released as part of Microsoft PROSE SDK

Invariant Synthesis

LoopInvGen

PLDI 2016$\,^\star$

Feature learning for white-box loop invariant inference

Winning solver at SyGuS - Comp 2017, 2018

Hanoi

PLDI 2020

Representation invariants for data-type implementations

Received an ACM SIGPLAN Distinguished Paper Award

Overfitting in Synthesis

CAV 2019

Mitigating a fundamental trade-off in data-driven synthesis

$^\star$ presented in the same PLDI 2016 paper

FlashProfile

S. Padhi, P. Jain, D. Perelman, O. Polozov, S. Gulwani, T. Millstein · OOPSLA 2018 · UCLA logo Microsoft logo

Real-life datasets are often incomplete and inconsistent

Reference ID
PMC9035311
doi:␣10.1016/S1387- 7003(03)00113-8
ISBN:␣2-287-34069-6
⋮ ⋮ ⋮
ISBN:␣0-006-08903-1
PMC9473786
⋮ ⋮ ⋮
doi:␣␣10.13039/100005795
ISBN:␣1-158-23466-X
not_available
PMC9035311

data format$\,_1$
data format$\,_2$
data format$\,_3$
data format$\,_4$

Process a large dataset

Write a script manually

Programming by examples

Manually discover and handle all data formats

Provide a representative
set of examples

FlashProfile

Data Profiles

Reference ID
PMC9035311
doi:␣10.1016/S1387- 7003(03)00113-8
ISBN:␣2-287-34069-6
⋮ ⋮ ⋮
ISBN:␣0-006-08903-1
PMC9473786
⋮ ⋮ ⋮
doi:␣␣10.13039/100005795
ISBN:␣1-158-23466-X
not_available
PMC9035311

Microsoft SSDT

1024 PMC D+
204 ISBN: ␣ 0- D D D - D D D D D - D
113 .*
110 doi: ␣+ 10. D D D D D / D+

Ataccama One

1024 W N
267 W : ␣ D - N - N - D
110 W : ␣ N . N / N
34 W : ␣ D - N - N - L
11 W : ␣ N . N / L N - N ( N ) N - D
5 W _ W

FlashProfile

1024 PMC D7
267 ISBN: ␣ D - D3 - D5 - D
110 doi: ␣+ 10.13039/ D+
34 ISBN: ␣ D - D3 - D5 -X
11 doi: ␣+ 10.1016/ U D4 - D4 ( D2 ) D5 - D
5 not_available

1024 PMC D7
301 ISBN: ␣ [ISBN]
121 doi: ␣+ [DOI]
5 not_available

FlashProfile

Key Challenges

The space of profiles is inherently ambiguous …

Is “1824” more similar to “1879” or “1924”?

Character-wise similarity measures have a fixed bias
- e.g. “1824” is more similar to “1924”
A fixed set of base patterns also results in an a priori bias
- e.g. inseparable given only D+
Must allow custom patterns for domain-specific bias
- e.g. “1824” is more similar to “1879” given [1800s] = 18.*

… and is extremely large.

Exponentially many ways of partitioning a dataset

Exponentially many ways of generalizing strings to patterns

FlashProfile

Formalization

Profiling = Clustering + Synthesis

DISPLAY$ \mathsf{Synth} : \mathtt{String[\,]} \to \mathtt{Pattern[\,]} \quad;\quad \mathsf{Cost} : \mathtt{Pattern} \times \mathtt{String[\,]} \to \mathtt{Real} $DISPLAY

DISPLAY$ \mathsf{Profile}(S,k) \triangleq \big\{ \argmin\limits_{P \,\raise{0.125ex}{\in}\, \mathsf{Synth}(S_i)} \! \mathsf{Cost}(P,S_i) : S_i \in \mathsf{Clusters}(S,k) \big\} $DISPLAY

$ \mathsf{Clusters}(S,k) \triangleq \argmin\limits_{\substack{ \small \{ S_1, \ldots, S_k \} \\ \text{s.t.} \: S = \bigsqcup\limits_{i=1}^{k} S_i }} \: \sum\limits^{k}_{i=1} \Omega(S_i) $ … $\Omega(S_i):$ objective function

$\Omega$ should capture pattern-based similarity within $S_i$

DISPLAY$ \Omega_\text{ideal}(S_i) \triangleq \hspace{-0.5em}\min\limits_{\,P \,\raise{0.125ex}{\in}\, \mathsf{Synth}(S_i)} \mathsf{Cost}(P,S_i) $DISPLAY

$\mathsf{Cost}$ of the best pattern describing the entire cluster

FlashProfile

Tractable Approximation

Clustering with $\Omega_\text{ideal}$ requires $O(2^{|S|})$ synthesis calls

A tractable objective that works well in practice:

DISPLAY$ \Omega_\text{CL}(S_i) \triangleq\!\! \max_\limits{\,x,y\,\raise{0.125ex}{\in}\,S_i} \,\min_\limits{\,P \,\raise{0.125ex}{\in}\, \mathsf{Synth}({x,y})} \mathsf{Cost}(P,{x,y}) $DISPLAY

Worst-case $\mathsf{Cost}$ of the best pattern describing a pair of strings
Inspired by the widely used$^\star$ complete-linkage clustering method
Only need to evaluate all pairs of strings; $ O(|S|^2) $ synthesis calls

Still prohibitively expensive for real-life datasets …
1. Profile a small random subset, add to the cumulative profile
2. Recursively profile strings that do not match the cumulative profile
3. Repeatedly merge two most similar partitions if profile size $> k$

$^\star$ See survey by A. K. Jain et al. CSUR ' 99

FlashProfile

Architecture

Agglomerative Hierarchical Clustering (AHC)
- An internal node corresponds to a subset of the dataset (annotated with a pattern)
- Split hierarchy at depth $k$ for $k$ partitions

(an example hierarchy)

A small regex-like grammar for patterns
- Top-down recursive synthesis
  - Patterns match the prefix of all strings
  - Recursively learn patterns for unmatched suffixes
  - Accept if no leftover suffixes, else reject
- Implemented using Microsoft PROSE

DISPLAY$ \begin{array}{rl} \text{Constant strings} & \;\text{Character classes} \\[0.25em] \text{Regular exprs.} & \;\text{String predicates} \end{array} $DISPLAY

(kinds of base patterns in FlashProfile)

FlashProfile

Results & Impact

Profiling time: little correlation with dataset size; increases with string length

Soundness and relative completeness on any user-defined patterns

Can significantly improve programming-by-example (PBE) workflows
- Reduces the number of examples required for synthesis
- FlashFill achieves minimal examples for 84% of benchmarks

Publicly released as Matching.Text C# library within the Microsoft PROSE framework

Automated Feature Learning

S. Padhi, R. Sharma, T. Millstein · PLDI 2016 · UCLA logo Microsoft logo

Users require prior knowledge
Limits expressiveness of the technique

Solution: Learn features automatically on demand

PIE

Black-box precondition inference

LoopInvGen

White-box loop invariant inference

PIE

Precondition Inference

Given Code and Post,
find Pre s.t. {Pre} Code {Post}

Limitations of Static Analyses

Source code unavailable
Source code not analyzable
Postcondition not analyzable

Guess precondition from test executions

PIE

Issue with Fixed Features

Test case	Feature Set
(s, i, l)	i ≥ 0	l ≥ 0
("pie", 1, -1)
("xy", 2, 3)
("a", -1, 3)
("siva", 1, 2)

Code: substring(s, i, l)
Extract a substring of length l from index i of string s
Post: no exceptions

Consider feature set:

{ i ≥ 0, l ≥ 0 }

No separator exists
over the given features!

Conflict: Same feature vector for a positive and a negative test

PIE

Conflict Resolution

Test case	Feature Set
(s, i, l)	i ≥ 0	l ≥ 0	f
("pie", 1, -1)
("xy", 2, 3)
("a", -1, 3)
("siva", 1, 2)

A new feature must be added to resolve a conflict

A new feature never creates new conflicts

Feature Synthesis Spec
Input	Output
("xy", 2, 3)	false
("siva", 1, 2)	true

Grammar-based exhaustive enumeration in increasing order of expression size

Learned f : i + l ≤ s.length

Learned Pre : i ≥ 0 ∧ l ≥ 0 ∧ f

PIE

Architecture

PIE

Properties

Black-box use of synthesizer and Boolean function learner

Sufficiency and Necessity

Inferred preconditions are guaranteed to be both sufficient and necessary at least up to given test cases

Strong Convergence

If a sufficient and necessary precondition exists in the synthesizer’s grammar, then PIE will find it in finite time

PIE

PIE Verified PIE

CEGIS

Counterexample-Guided Inductive Synthesis
(A. Solar-Lezama. STTT ' 13)

Preserves strong convergence

Source code must be loop-free

LoopInvGen

Verifying Loopy Code

Given Pre, Code and Post,
validate {Pre} Code {Post}

Using Dijkstra’s weakest precondition,

DISPLAY$ \large\mathsf{Pre} \Rightarrow \mathbf{WP}(\mathsf{Code}, \mathsf{Post}) $DISPLAY

In presence of unbounded control flow, computing $\mathbf{WP}$ requires inductive invariants

  int square(n) {
      assume (n > 0);
      s = 1;
      k = r = 0;
      while(k != n) {
          r += s;
          s += 2;
          k++;
      }
      assert (r == n*n);
  }

A sufficient loop invariant: DISPLAY$ r = k^2 \wedge s = 2k+1 $DISPLAY

Require a loop invariant $\mathcal{I}$ such that

$\mathcal{I}$ holds at the entry to the loop
$\mathcal{I}$ is preserved after each iteration
$\mathcal{I} \Rightarrow \mathsf{Post}$ holds just after exiting the loop

Prior data-driven approaches required a fixed set of features to be provided a priori

LoopInvGen

Architecture

LoopInvGen: Loop invariant inference using precondition inference

inspired by HOLA (Dillig et al. OOPSLA ' 13),
which used logical abduction instead

assume $P$
while $G$ do $S$ done
assert $Q$

$ \mathcal{I} \wedge \neg G \Rightarrow Q $

$ \{ \, \mathcal{I} \wedge G \, \} \;\;S\;\; \{ \, \mathcal{I} \, \} $

$ P \Rightarrow \mathcal{I} $

PIE + LoopInvGen

Results & Impact

PIE: Learning specifications for OCaml libraries$^\star$
- Postconditions: emptiness, no exceptions
- Found incorrect documentation for 2 functions in AVLTree
- Generated sufficient and necessary preconditions for first-order functions in AVLTree, List, String

PIE: Adopted by several state-of-the-art tools
- Alive-Infer (D. Menendez et al. PLDI ' 17)
  - Used by LLVM developers to verify peep-hole optimizations
- CVC4 (H. Barbosa et al. FMCAD ' 19)
  - Used within the invariant-synthesis engine

LoopInvGen: Won Inv track of SyGuS-Comp 2017 and 2018
- Won in all three categories: (a) most tasks solved, (b) fastest on most tasks, (c) smallest solution for most tasks
- Runner up in SyGuS-Comp 2019

$^\star$ batteries, a widely used alternative to OCaml’s standard library

Overfitting in Synthesis

S. Padhi, T. Millstein, A. Nori, R. Sharma · CAV 2019 · UCLA logo Microsoft logo

How does the choice of the grammar
affect the performance of synthesizers?

Theoretical results & experiments
that demonstrate overfitting

Practical mitigation strategies
inspired by ML techniques

Overfitting

State of The Art

Grammars : $ 6 $ commonly used arithmetic grammars
Benchmarks : $ 180 $ invariant-synthesis tasks over integers
Tools : $ 5 $ participants from SyGuS-Comp'18

(Timeout = $ 30 $ mins per benchmark per tool per grammar)

With more expressive power, every tool fails on many benchmarks it could previously solve!

But is this performance degradation
simply because of larger search spaces?

Overfitting

On increasing expressiveness for LoopInvGen:	Increase	No Change
Increase in Examples Required $ \ \Rightarrow\ $ ________ in Synth. Time	$ 79 \,\% $	$ 6 \,\% $
Increase in Synth. Time $ \ \Rightarrow\ $ ________ in Examples Required	$ 27 \,\% $	$ 67 \,\% $

Synthesizers not only spend more time

searching for a function within a large space,
but also collecting more examples from the verifier

Machine Learning: Learned function closely fits the training data, but does not correctly generalize beyond it

Program Synthesis: Learned expression is consistent with the given examples, but does not satisfy the specification

How do we formalize overfitting in synthesis?
What can we do to mitigate it?

Overfitting

Theoretical Contributions

Formal definition of learnability for example-guided synthesis:

A specification $ \phi $ is $ m $-learnable by a synthesizer $ \mathcal{S} $ if there exists some sequence of $ m $ examples with which $ \mathcal{S} $ learns an expression $e$ s.t. $e \models \phi $

(weaker than ML, which requires learning from any $m$ i.i.d. examples)

Perfect learning is impossible even with this weaker notion:

No Free Lunch: Let $X$ and $Y$ be arbitrary domains, $\mathcal{E}$ be an arbitrary grammar, and $m \in \mathbb{Z}$ be an arbitrary number of examples s.t. $0 \leq m < |X|$. Then, either:

$\mathcal{E}$ captures only a small number$^\star$ of semantically distinct $X \to Y$ expressions

or, for every synthesizer $\mathcal{S}$, there exists a specification $\phi$ for an $X \to Y$ function that admits a solution in $\mathcal{E}$, but is not $m$-learnable by $\mathcal{S}$

The underlying cause for the non $m$-learnability is overfitting:

Given a grammar $\mathcal{E}$, a specification $\phi$, and a sequence $M$ of examples,

degree of overfitting $\Omega_\mathcal{E}(\phi, M)$ = $|\{ e \in \mathcal{E} : e \text{ is consistent with } M \;\wedge\; e \not\models \phi \}|$

$^\star$ The exact bounds for finite and infinite $X$ and $Y$ cases can be found in our CAV ' 19 paper

Overfitting

PLearn

A mitigation technique inspired by ensemble methods$^\star$

run several synthesizer instances each with a different grammar
provably lower overall degree of overfitting

Easy to implement thin wrapper
Agnostic to underlying synthesizer

Extremely resource intensive
Many redundant computations

$^\star$ T. G. Dietterich “Ensemble Methods in Machine Learning” MCS 2000

Overfitting

Hybrid Enumeration

A relation $ \lhd $ on $ \mathcal{E}_{1,\ldots,n} \times \mathbb{N} $ is said to be a well order
if $ \forall \mathcal{E}_1, \mathcal{E}_2, k_1, k_2 : [\mathcal{E}_1 \subseteq \mathcal{E}_2 \ \wedge \ k_1 < k_2] \Rightarrow (\mathcal{E}_1,k_1) \lhd (\mathcal{E}_2,k_2) $

Hybrid enumeration — an efficient algorithm for this 2D search over component-based$^\star$ grammars $\mathcal{E}_1, \ldots, \mathcal{E}_n$ using an arbitrary well order $ \lhd $

$^\star$ S. Jha et al. “Oracle-Guided Component-Based Program Synthesis” ICSE 2010

Overfitting

Key Results

Grammar	$ {\small \textbf{median}}\Big[\frac{\tau(\textbf{P})}{\tau(\textbf{H})}\Big] $	$ {\small \textbf{median}}\Big[\frac{\tau(\textbf{H})}{\tau(\textbf{L})}\Big] $
Equalities	$ 1.00 $	$ 1.00 $
Inequalities	$ 1.91 $	$ 1.04 $
Octagons	$ 2.84 $	$ 1.03 $
Polyhedra	$ 3.72 $	$ 1.01 $
Polynomials	$ 4.62 $	$ 1.00 $
Peano	$ 5.49 $	$ 0.97 $

(On $ 180 $ benchmarks with $ 30 $ mins timeout per benchmark per tool per grammar)

Hybrid enumeration is resilient to increasing expressiveness
- Approaches the ideal failure rate with PLearn

Negligible impact on total time $ \tau = $ wall-clock time $ \times $ # threads
- Approaches the original running time of LoopInvGen

LoopInvGen+HE solves $2$ benchmarks that no other synthesizer could in SyGuS-Comp 2018

Recent Work

Hanoi

PLDI 2020

Inferring representation invariants for verifying data-type implementations

Jura

(under review)

Automatically identifying data-frames and formulas within spreadsheets

Recent Work

Hanoi

A. Miltner, S. Padhi, T. Millstein, D. Walker · PLDI 2020 · Princeton logo UCLA logo

(* A signature for Set ADT *)

module type SET = sig
  type t
  val empty : t
  val insert : t -> int -> t
  val delete : t -> int -> t
  val lookup : t -> int -> bool
end

A specification for the Set ADT:

DISPLAY$ \begin{array}{rcl} (\phi\ s) & \triangleq & \forall i : \text{int} \\[0.125em] & & \neg (\mathtt{lookup}\ \mathtt{empty}\ i) \\[0.125em] & \wedge & \neg (\mathtt{lookup}\ (\mathtt{delete}\ s\ i)\ i) \\[0.125em] & \wedge & (\mathtt{lookup}\ (\mathtt{insert}\ s\ i)\ i) \end{array} $DISPLAY

(* An implementation for Set using List *)

module ListSet : SET = struct
  type t = int list

  let empty = []

  let rec lookup l x =
    match l with
    | [] -> false
    | hd :: tl -> (hd = x) || (lookup tl x)

  let insert l x =
    if (lookup l x) then l else (x :: l)

  let rec delete l x =
    match l with
    | [] -> []
    | hd :: tl -> if (hd = x) then tl
                  else (hd :: (delete tl x))
end

How do we automatically verify that the ListSet implementation satisfies $\phi$?

(* A representation invariant for ListSet *)

let rec rep_inv : int list -> bool = function
  | [] -> true
  | hd :: tl -> (not (lookup tl hd)) && (rep_inv tl)

Hanoi can handle arbitrary first-order modules
Work on higher-order modules is on going

The first data-driven approach that guarantees soundness and completeness

Recent Work

Jura

S. Padhi, D. Rouhana, O. Polozov, B. Zorn, et al. · Microsoft logo UCLA logo

How do we automatically
recover missing formulas?

Multiple tables, unclear bounds for fomulas
Header, subheaders, metadata …
Inconsistent formatting

(2) Example-based synthesis
within individual tables

Conclusion

Specifications and invariants are difficult to state accurately
- Learn from readily available program-related data
- Automatic learning and user-driven refinement

Require strong guarantees and learning at large scale
- Program synthesis techniques for formal reasoning
- Data-driven insights from machine learning

Program Synthesis/Repair

Recovering spreadsheet formulas
Automatic synthesis of small bug fixes

Proof Synthesis/Repair

Representation invariants for ADTs
Synthesis within proof assistants
Automatic proof maintainance

Connections with ML + HCI

Symbolic optimizations + enumeration
Better regularization in synthesis
Interaction models for guiding synthesis

Data-Driven Learning of

Invariants and Specifications

Software Reliability

Research Artifacts

Industrial Products

Verification

Research Focus

Overview of Projects

Specification Synthesis

PIE

FlashProfile

Invariant Synthesis

LoopInvGen

Hanoi

Overfitting in Synthesis

FlashProfile

Data Profiles

Key Challenges

Formalization

Tractable Approximation

Architecture

Results & Impact

Automated Feature Learning

PIE

LoopInvGen

Precondition Inference

Limitations of Static Analyses

Issue with Fixed Features

Conflict Resolution

Architecture

Properties

Sufficiency and Necessity

Strong Convergence

PIE Verified PIE

CEGIS

Verifying Loopy Code

Architecture

Results & Impact

Overfitting in Synthesis

State of The Art

Overfitting

Theoretical Contributions

PLearn

Hybrid Enumeration

Key Results

Recent Work

Hanoi

Jura

Hanoi

Jura

Conclusion

Program Synthesis/Repair

Proof Synthesis/Repair

Connections with ML + HCI

Links

Contact: padhi@cs.ucla.edu