Part 1: Writing a GCC plugin in Python

This post describes the implementation of pysmtgcc. See “GCC Translation Validation” for background information.

I chose to use Python for the plugin for a few reasons:

  • The main aim of this experiment was to get some experience translating GIMPLE to SMT, but I had not decided exactly what I wanted to do with it. Some of my ideas involved parsing output from disassembler, etc., which is much easier done in Python than in C++.
  • The Z3 Python API is convenient to use.
  • The gcc-python-plugin looked nifty, and I had planned for a long time to try it out.

This project later grew a bit more than I had planned1, so Python may not have been the right choice…

GIMPLE IR

The GCC middle-end IR is called “GIMPLE”, and you can see the IR by passing -fdump-* flags to the compiler. For example, -fdump-tree-all writes the IR to a file after each non-IPA GIMPLE pass. The IR is usually pretty-printed using a C-like syntax, but you can see the raw representation by adding “-raw” (i.e.,-fdump-tree-all-raw).

GIMPLE statements

The GIMPLE operations are documented in gimple.def, and the tree objects containing information in the GIMPLE statements (types, operands, etc.) are documented in tree.def.

There are many GIMPLE operations, but most encode optional information (branch prediction information, OpenMP attributes, etc.), so I only needed to handle

  • GIMPLE_ASSIGN – assignment
  • GIMPLE_CALL – function call
  • GIMPLE_COND – conditional jump
  • GIMPLE_LABEL – label (for switch statements)
  • GIMPLE_PHI – phi node
  • GIMPLE_RETURN – function return value
  • GIMPLE_SWITCH – multiway branch

GIMPLE_ASSIGN

GIMPLE_ASSIGN is the GIMPLE instruction doing all the work that is not related to control flow2. It represents an assignment, such as

a = b + c;

which in the raw form is written as

gimple_assign <plus_expr, a_5, b_2, c_4, NULL>

Load/store instructions are represented by a unary GIMPLE_ASSIGN where the input/output reference a variable or memory location. All other GIMPLE_ASSIGN statements must have SSA names or constants for the input/output.

Using gcc-python-plugin

Registering a pass

Creating and registering a pass is done as

class MyPass(gcc.GimplePass):
    def execute(self, fun):
        # Do something with fun.
        ...

ps = MyPass(name="mypass")
ps.register_after("cfg")

This creates a GIMPLE pass where the execute function is called once for each function in the translation unit.

It is also possible to register callback functions that are called for various events – pysmtgcc does this in plugin1.py as it must do work for each pass the compiler runs. See the gcc-python-plugin manual for details.

Iterating over the IR

The next step is iterating over the GIMPLE instructions in the IR. This is usually done as

for bb in fun.cfg.basic_blocks:
    for stmt in bb.gimple:
        # Do something with stmt.
        ...

I needed to iterate over the basic blocks in reverse post order, so I added support for this in my fork of gcc-python-plugin

for bb in fun.cfg.reverse_post_order:
   ...

GIMPLE statements

Each kind of GIMPLE statement is wrapped in a Python class. For example, GIMPLE_ASSIGN is represented as gcc.GimpleAssign where the operation and arguments are available as member variables.

Much of the work done in many passes consists of matching and inspecting instructions, which for pysmtgcc looks like

if isinstance(stmt, gcc.GimpleAssign):
    if stmt.exprcode == gcc.PlusExpr:
        if isinstance(stmt.lhs.type, gcc.IntegerType):
            # Handle integer addition.
            ...
        elif isinstance(stmt.lhs.type, gcc.RealType):
            # Handle floating point addition.
            ...
        else:
            ...
    elif stmt.exprcode == gcc.MultExpr:
        ...
    ...
elif isinstance(stmt, gcc.GimpleReturn):
    ...

Example – a spellchecker pass

The Python API makes it easy to prototype passes that iterate over the program to detect various problems. For example, the following pass3 implements a -Wall warning that complains about misspelled strings:

import gcc
import enchant

spellingdict = enchant.Dict("en_US")

def spellcheck_node(node, loc):
    if isinstance(node, gcc.StringCst):
        words = node.constant.split()
        for word in words:
            if not spellingdict.check(word):
                gcc.warning(
                    loc,
                    f"Possibly misspelt word in string constant: {word}",
                    gcc.Option("-Wall"),
                )

class SpellcheckingPass(gcc.GimplePass):
    def execute(self, fun):
        for bb in fun.cfg.basic_blocks:
            for stmt in bb.gimple:
                stmt.walk_tree(spellcheck_node, stmt.loc)

ps = SpellcheckingPass(name="spellchecker")
ps.register_after("cfg")

Note: execute is called for each function, so the spellchecker pass does only check strings found in functions – using the Python plugin to check global variables is much more annoying:

  • We could iterate over the global variables as
    for var in gcc.get_variables():
        # Do something with var.
    

    when handling the first function, but it does not work for translation units not having functions.

  • There is no convenient walk_tree function for variables, so we need to handle all cases ourselves. For example,
    char s1[] = "Hello worrld";
    char *s2 = "Helllo world";
    

    are different (s2 has an extra gcc.AddrExpr) so we must do something like

    for var in gcc.get_variables():
        if var.decl.initial:
            node = var.decl.initial
            if isinstance(node, gcc.AddrExpr):
                node = node.operand
            spellcheck_node(node, var.decl.location)
    

    Handling strings in structure/array initializers must (recursively) iterate over the elements.

Problems

Using gcc-python-plugin worked reasonably well, but there are problems:

  • Many instructions from tree.def are not fully implemented, so I had to extend gcc-python-plugin (or, in some cases, return “not implemented”).
  • The plugin crashes when trying to compile some files4.

So my conclusion is that gcc-python-plugin is good for doing quick experiments, but probably not the right choice for developing a product quality tool.


  1. The plan was only to implement a few instructions, only handle 32-bit integers, etc. But the primitive tool found a GCC bug, so it seemed like it could actually be useful if some more functionality was added. 

  2. Control flow is discussed in part 8 of this blog series. 

  3. Adapted from a gcc-python-plugin example pass. 

  4. Some of the crashes may be due to my modifications of gcc-python-plugin, but not all of them. 

Written on October 20, 2022