Avoid materializing initialization data at all CFG nodes

The current move/init analysis is elegant and illustrates the expressive power of datalog. However, this elegance comes at the cost of efficiency, as it requires instantiating at least one fact (`maybe_uninitialized` and/or `maybe_initialized`) for every move path at every point in the CFG. This becomes intractable for large bodies like the one in #181 which requires several billion initialization facts.

Using a dense relation store (like Souffle's binary radix trie) will alleviate some of the pain by reducing the memory required per fact, and I'm kind of doubtful that `rustc` will ever use a datalog engine to compute move errors. However, I'm still interested in expressing this part of the compiler efficiently in datalog, if only to see how far we can take the "everything is horn clauses" mantra.

The idea is to take advantage of the sparsity of the `path_accessed_at` relation and compute initializedness *inside a basic block* only when necessary. All intra-block CFG nodes have exactly one predecessor (the previous statement in the block), so their initializedness is purely a function of their immediate predecessor and any operations within that node. If we were to compute the cumulative effect of all nodes inside a basic block and do our fixpoint algorithm on blocks instead of individual nodes, we could use that information to compute the state for intrablock nodes on-demand.

This is not groundbreaking; in fact it's pretty dull. This is the first optimization discussed in any introduction to dataflow analysis. However, implementing it in datalog poses some new challenges that require (AFAIK) some non-standard extensions to overcome.

First, let's assume that we have some mapping from CFG nodes to their corresponding basic block and statement index (a `Location`), as well as the control-flow edges for each basic block and  the typical move/init input data in terms of those locations:

```prolog
.type Location = [ block: Block, stmt: unsigned ]

.decl node_loc(node: Node, loc: Location)
.input node_loc

.decl bb_edge(from: Block, to: Block)
.input bb_edge

.decl path_assigned_at(path: Path, loc: Location)
.decl path_moved_at(path: Path, loc: Location)
.decl path_accessed_at(path: Path, loc: Location)
.input path_accessed_at
.input path_assigned_at
.input path_moved_at
```

From there, we can compute the effect of each basic block on the initializedness of each move path in the program by looking for an assignment to (or move from) a path that comes after any other moves from (or assignments to) that path. Unfortunately , doing this without computing the full initializedness at each statement or negating the `path_assigned_at` and `path_accessed_at` relations (which are very sparse) requires either universal quantification within rule bodies or unbound variables within negated clauses, both of which are outside the realm of traditional datalog. Luckily, Soufflé has an extension called [aggregates](https://souffle-lang.github.io/aggregates#aggregates), which allow more complex functions on certain relations.

```prolog
.decl path_becomes_uninit_in_block(path: Path, bb: Block)

path_becomes_uninit_in_block(path, bb) :-
    path_moved_at(path, [bb, uninit_at]),
    count : { path_assigned_at(path, [bb, init_at]), init_at > uninit_at } = 0. // ∀i s.t. i > uninit_at (!path_assigned_at(path, [bb, i]))

.decl path_becomes_init_in_block(path: Path, bb: Block)

path_becomes_init_in_block(path, bb) :-
    path_assigned_at(path, [bb, init_at]),
    count : { path_moved_at(path, [bb, uninit_at]), uninit_at > init_at } = 0.
```

Here, we use the `count` aggregate to count the number of `path_assigned_at` relations for this move path in the same basic block that come after a `path_moved_from` relation. If none exist (`count = 0`), we know that the path will be moved from at the exit of the basic block.

Now that we have our block transfer functions, it's time to use them to compute the entry state for each block. A path is initialized at block exit if it is initialized at block entry and not uninitialized inside the block, or if it is initialized inside the block. This is just your typical fixpoint algorithm.

```prolog
.decl path_maybe_init_at_block_exit(path: Path, bb: Block) inline

path_maybe_init_at_block_exit(path, bb) :-
    path_becomes_init_in_block(path, bb).

path_maybe_init_at_block_exit(path, bb) :-
    path_maybe_init_at_block_entry(path, bb),
    !path_becomes_uninit_in_block(path, bb).


.decl path_maybe_uninit_at_block_exit(path: Path, bb: Block) inline

path_maybe_uninit_at_block_exit(path, bb) :-
    path_becomes_uninit_in_block(path, bb).

path_maybe_uninit_at_block_exit(path, bb) :-
    path_maybe_uninit_at_block_entry(path, bb),
    !path_becomes_uninit_in_block(path, bb).


.decl path_maybe_init_at_block_entry(path: Path, bb: Block)

path_maybe_init_at_block_entry(path, bb) :-
    path_maybe_init_at_block_exit(path, pred),
    bb_edge(pred, bb).

.decl path_maybe_uninit_at_block_entry(path: Path, bb: Block)

path_maybe_uninit_at_block_entry(path, bb) :-
    path_maybe_uninit_at_block_exit(path, pred),
    bb_edge(pred, bb).
```

Finally, we can use this data to compute the intrablock initialization data (using the same `count` trick) and move errors. Note the `inline` on the `path_maybe_uninit_at_location` relation. That means that we don't compute the relation at every point, only the ones deemed significant by `move_error` (the predecessors of `path_accessed_at` points). In fact, that relation *must* be inlined into a rule that constrains `stmt`, since it is not grounded otherwise.

```prolog
.decl path_maybe_uninit_at_location(path: Path, loc: Location) inline

path_maybe_uninit_at_location(path, [bb, stmt]) :-
    path_maybe_uninit_at_block_entry(path, bb),
    count : { path_assigned_at(path, [bb, init_at]), init_at <= stmt } = 0.

path_maybe_uninit_at_location(path, [bb, stmt]) :-
    path_moved_at(path, [bb, uninit_at]),
    uninit_at <= stmt,
    count : { path_assigned_at(path, [bb, init_at]), init_at > uninit_at, init_at <= stmt} = 0.


.decl move_error(path: Path, node: Node)
.output move_error

move_error(path, node) :-
    path_accessed_at(path, node),
    cfg_edge(pred, node),
    node_loc(pred, pred_loc),
    path_maybe_uninit_at_location(path, pred_loc).
```

Unfortunately, this doesn't compile in Soufflé today due to shortcomings around aggregates and how they interact with inlining and record types. However, it can be made to work by manually inlining `path_maybe_uninit_at_location` into `move_error` using a disjunction and splitting `Location` into its subfields in all the input relations.

Phew. With that, we've emulated a simple dataflow optimization in datalog, and avoided storing initializedness data at all points in the CFG. However, I've not tested to see whether this method is actually faster than the naive one; I don't know how optimized the `count` queries will be. It will be fast in `datafrog`, since each `count` is just a range query on a static `Relation`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid materializing initialization data at all CFG nodes #182

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Avoid materializing initialization data at all CFG nodes #182

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions