Grammars, ASTs, Linters, and Solhint Plugins

Solidity has become the de-facto standard language of smart contract programming, although this has been challenged over the years with various other languages such as Vyper. The EVM, itself being a unique target with it’s concepts of “gas” and other various builtins, has been very successful as a smart contract programming environment, and to this day still dominates the space of smart contract blockchains, though new paradigms such as Cosmos chains have mounted a valiant challenge.

At a compute level, the EVM itself is not unique. As a stack machine it is similar to other virtual machines, such as the Java Virtual Machine (JVM) and the Pascal p-code machine. However, due to the extra concepts such as gas, ether, the 256 bit word size, the “built-in” elliptic curve operations, and so on, a new language made more sense than twisting an existing language to serve this new target.

It’s not hard to see that Solidity itself is inspired by various languages; it has a syntax that looks like a fusion of JavaScript and C++; the multiple inheritance of contracts is certainly reminiscent of C++ though one could argue that this is only syntactic and that contract inheritance is more like JavaScript’s prototypal inheritance.

In contrast, Vyper is more inspired by the Python syntax, doesn’t have class inheritance, modifiers, function overloading, operator overloading, or even recursion. The intent behind this apparent minimalism is to reduce chances of any kind of exploits when writing Vyper contracts. So the hypothesis is that a smaller featureset equates to a lower probability of writing vulnerable code.

The JavaScript inspiration is not only present in Solidity itself but in a very large amount of ancillary tooling. It is perhaps the accessibility of JavaScript, its prevalence as a “first language” for many coders, and in general it’s already extremely high popularity, that has caused this explosion in secondary tools in JavaScript itself.

With the emergence of various new languages for this virtual machine, a new host of tooling had to emerge with it as well. The compilers are obviously the first to emerge, being the implementations of the languages themselves. They transform the source code into bytecode, specifically EVM bytecode. This EVM bytecode is then deployed to the EVM blockchain in question and a smart contract “account” is created.

However, as all programmers know that is the bare minimum that is required to use a language. Things like formatters, linters, syntax highlighters, editor plugins, and so on are all required to make development in a particular language a non-obscure task. For EVM development, tools like HardHat (JavaScript), Truffle (JavaScript), Brownie (Python), and so on have become popular, since they provided a kind of unit-testing framework that allowed smart contract developers to test their code quickly without deploying to a testnet blockchain like Gorli, as well as basic automation tooling to do various things like set up environments.

In this post I want to talk specifically about linters, and the most popular linter for Solidity called solhint. Note that most of what is described in this post applies equally to pretty much any other language, but obviously solhint itself is Solidity specific.

Note that linters is really “just the beginning” as far as tooling goes for programming languages. There are far more complex and sophisticated tools, things like static analyzers, correctness provers, verifiers, etc. Maybe I’ll be able to cover those in a future post :-).

Linters

The word “lint” in programming comes from the English word lint, which refers to those “short, fine fibers that separate from the surface of cloth or yarn, especially during processing.” Lint rollers are then devices used to remove the lint from articles of clothing, which I’m sure almost everyone has used at some point. But instead of saying “lint roll your code” we removed the “roll” and just said “lint your code”.

Linters themselves are tools that suggest changes to your code that almost always do not change the semantics of the code being written. The exceptions to this are things like closing an open connection at the end of a function (or before each possible exit point) which may in rare cases change program semantics (i.e, more unlikely to not crash due to leaking too much resources). They work exclusively on the source code, i.e they do not interact with the compiled code and have no knowledge of the machine code that eventually is emitted by the compiler or interpreter. However, note that the authors of the linter may indeed have some knowledge about that and some of their rules may be suggested due to improving generated code output (i.e, reducing gas costs, as in the case of Solidity).

Abstract Syntax Trees and Grammars

Source code itself is simply an array of bytes that represents text. That as a data structure is not great to work with. Therefore linters work on a more structured representation, called the syntax tree or abstract syntax tree, or AST for short. The AST is a tree structure which typically holds the “program node” in the root node, and each child represents a syntactic structure nested within the parent’s syntactic structure. Programs (and lots of natural language documents like essays and books) are inherently tree-like and hierarchical. As an example, take the following text below, which is a very short book:

My Awesome Book

My First Chapter

This is a book about syntactic structures. This is the second sentence in a paragraph. This is the third. The fourth. And now the fifth.

Second paragraph. I only have two sentences.

My Second Chapter

This is the second chapter. Second sentence of chapter 2. Hello there world. Sentence 4 here I am!

The “syntax tree” of the above book could look something like this:

My Awesome Book: Book Title
`-- Chapters
    |-- My First Chapter: Chapter
    |   |-- Paragraph 1: Paragraph
    |   |   |-- Sentence 1: Sentence
    |   |   |-- Sentence 2: Sentence
    |   |   |-- Sentence 3: Sentence
    |   |   |-- Sentence 4: Sentence
    |   |   `-- Sentence 5: Sentence
    |   `-- Paragraph 2: Paragraph
    |       |-- Sentence 1: Sentence
    |       `-- Sentence 2: Sentence
    `-- My Second Chapter: Chapter
        `-- Paragraph 1: Paragraph
            |-- Sentence 1: Sentence
            |-- Sentence 2: Sentence
            |-- Sentence 3: Sentence
            `-- Sentence 4: Sentence

The syntactic structures here are the book title, which every book starts with. Then the chapters. Then each chapter has some paragraphs. Each paragraph has sentences. And so on. One could go even further and define what is called a “grammar” for books. This grammar would specify the exact syntactic structure of a book. A commonly used form for specifying grammars for computer languages is called the Backus-Naur Form (BNF for short). Here’s what it could look like for books:

<book> := <book-title> <chapters> <index>
<chapters> := <chapter> | <chapter> <chapters>
<chapter> := <chapter-title> <chapter-content>
<chapter-content> := <paragraph> | <paragraph> <chapter-content>
<paragraph> := <sentence> | <sentence> <paragraph>
<sentence> := <words> "."
<words> := <word> | <word> <words>

We read grammars starting from the first rule - which is usually the “Program” rule for programming languages. It defines the syntactic structure of the highest level document being produced by the grammar. This is also known as the “start rule”. In this specific case, the document is a book. A book consists of a book title, some chapters, then an index.

When defining what chapters are, we say that chapters are either a single chapter, or a single chapter followed by more chapters. Notice the recursion being applied here - we use the <chapters> rule in the definition of the <chapters> rule itself. This is normal in BNF and in many programming languages.

And similarly for the rest of the rules.

Now, of course, this is just an example. Natural languages are far more fluid than programming languages, and there are many kinds of documents - books, essays, research papers, poems, articles, and so on and so forth.

Programs are more rigid by design - having any kind of ambiguity doesn’t bode well for computers. We need to tell them exactly what they should do. There have been attempts at using natural language to guide computers but they have not reached the succinctness of programming languages, but that’s besides the point :-).

Walking the AST

Lets look at what the syntax tree of a C-like language could look like.

int my_function(int arg0, bool arg1) {
    if (arg0 > 10) {
        return 1;
    }
    // answer to everything in the universe
    return 42;
}

The simple function above may have an AST that looks like the following:

FunctionDeclaration
|-- ReturnType: int
|-- Name: my_function
|-- InputParameters
|   |-- VariableDeclaration
|   |   |-- Name: arg0
|   |   `-- Type: int
|   `-- VariableDeclaration
|       |-- Name: arg1
|       `-- Type: bool
`-- Body
    `-- Statements
        |-- IfStatement
        |   |-- Condition
        |   |   `-- BinaryExpression
        |   |       |-- Operator: >
        |   |       |-- LeftExpression
        |   |       |   `-- Identifier
        |   |       |       `-- Name: arg0
        |   |       `-- Right
        |   |           `-- IntegerLiteral: 10
        |   `-- Body
        |       `-- ReturnStatement
        |           `-- Expression
        |               `-- IntegerLiteral: 1
        `-- ReturnStatement
            `-- Expression
                `-- IntegerLiteral: 42

From the above AST you may notice the level of detail we’ve went to to explain the source code as a structured tree. At the root of the tree is the FunctionDeclaration node - our source code consists purely of this function. Everything else is a child of or a property of this function declaration. You can see as properties the ReturnType, Name, InputParameters, and Body of the function.

What’s nice about this structure is now we can actually write a program that easily processes it. This program can do what is called “walking the AST” - it starts at a node and traverses the tree to get to the syntactic elements it cares about.

As a simple example, lets say we want to emit a warning whenever function input parameters don’t have a leading underscore character. This may be a style choice we want to enforce, to clearly delineate input parameters from other variables that may be declared in a function’s body.

The general loop may look something like this, in Pythonic pseudocode:

for node in program.AST():
    if node.kind() == 'FunctionDeclaration':
        for var_decl in node.input_parameters():
            if not var_decl.name().startswith('_'):
                lint_ctx.emit_warning(
                    node.ctx(),
                    f'function parameter "{var_decl.name()}" must start with an underscore')

The general principle is that you walk the AST, you look for the exact pattern of syntactic structures that you are trying to warn against, and when that pattern is matched, you emit an error from the linter framework.

Linters for Dynamically Typed Languages

Note that the above (highly simplified) example is for a statically typed language. In such languages, the compiler does what is called type checking, which checks that various operations performed in the program conform to a predefined set of sensible rules. Such rules include things like operations such as “+”, “-“, and so on can only be performed when both operands are of the same type (or types convertible to each other, as is common in C-like languages).

The family of dynamically typed languages do not necessarily have this restriction, however. JavaScript is notorious for this, where you can pretty much add any type to any other type. As such, due to the JavaScript “compiler” not being very strict about these things, other tools, such as linters and static analyzers, have had to do the heavy lifting.

This isn’t always easy, however. Suppose you have this expression in JavaScript:

let c = a + b;
console.log("c is:", c);

If you give this code alone to a static analyzer, it is impossible to deduce what the types of a and b are. Even if you wanted to enforce that they be the same type, it doesn’t narrow down the possibilities very much.

Languages that compile to JavaScript have solved this problem via type annotations - which is basically what statically typed languages do (they do things like type inference as well, however).

Let’s give an example with a bit more information:

let a = () => {
    if (someValue > 12) {
        const someOtherVal = await api.get();
        return pow(someOtherValue, 2);
    }
    return DEFAULT_VAL;
};
// ... a bunch of code ...

let c = a + b;
console.log("c is:", c);

This is a bit more complex, but the static analyzer has more to work with. Lets say the SA knows that pow returns a Number. It then infers that DEFAULT_VAL could only “sensibly” be a Number as well (note that this doesn’t have to be the case - you can return different types from a function in dynamically typed languages). Then when it gets to the declaration of c, it sees the expression as Number + Unknown. At this point, it can’t infer any further, since in JavaScript you could add a number to anything. It could emit a warning at this point, saying to make sure that b is a number or to silence the warning if you know what you’re doing.

All of this can be done by walking the syntax tree, similar to what we did in the previous section, but we’re doing a lot more heavy lifting here. The luxury of statically typed languages (which is what Solidity is) means that we don’t have to worry about shooting ourselves in the foot at runtime, and that all these kinds of errors can be catched at compile time.

Solhint

And now we get to solhint. solhint is a linter for Solidity and a kind of linter framework for it as well. The way it works is basically what was described two sections previously - it walks the AST for the Solidity contract, and upon a subtree of the AST matching a condition, it emits a warning to the user to fix the problem.

Let’s take a look at an actual Solidity AST. This is with the help of astexplorer.net which gives a really nice interactive view of the ASTs of many programming languages. The program in question is the following:

pragma solidity ^0.4.18;
contract SimpleStore {
    function set(uint _value) public {
        value = _value;
    }

    function get() public constant returns (uint) {
        return value;
    }

    uint value;
}

And here is the AST. I’ve represented it in YAML rather than JSON and I’ve removed the range keys to make it easier to digest. If you want to produce the AST for your contract, you can paste it into the text area on AST explorer and select “Solidity” in the dropdown:

type: SourceUnit
children:
- type: PragmaDirective
  name: solidity
  value: "^0.4.18"
- type: ContractDefinition
  name: SimpleStore
  baseContracts: []
  subNodes:
  - type: FunctionDefinition
    name: set
    parameters:
    - type: VariableDeclaration
      typeName:
        type: ElementaryTypeName
        name: uint
      name: _value
      storageLocation:
      isStateVar: false
      isIndexed: false
    returnParameters:
    body:
      type: Block
      statements:
      - type: ExpressionStatement
        expression:
          type: BinaryOperation
          operator: "="
          left:
            type: Identifier
            name: value
          right:
            type: Identifier
            name: _value
    visibility: public
    modifiers: []
    isConstructor: false
    stateMutability:
  - type: FunctionDefinition
    name: get
    parameters: []
    returnParameters:
    - type: VariableDeclaration
      typeName:
        type: ElementaryTypeName
        name: uint
      name:
      storageLocation:
      isStateVar: false
      isIndexed: false
    body:
      type: Block
      statements:
      - type: ReturnStatement
        expression:
          type: Identifier
          name: value
    visibility: public
    modifiers: []
    isConstructor: false
    stateMutability: constant
  - type: StateVariableDeclaration
    variables:
    - type: VariableDeclaration
      typeName:
        type: ElementaryTypeName
        name: uint
      name: value
      expression:
      visibility: default
      isStateVar: true
      isDeclaredConst: false
      isIndexed: false
    initialValue:
  kind: contract

Notice the level of detail in this AST - all of these details can be used in the linter in order to construct your rules. Lets implement the following rule using solhint: all storage variables must be prefixed with s_.

Looking at the AST above, we notice that the StateVariableDeclaration node is what represents the uint value storage variable. We also notice that the name field is the name of the variable itself, which in this case is creatively named value. Lets use this knowledge to now write the plugin that will enforce the naming style we have just proposed.

To learn more about the Solidity grammar, head on over to https://docs.soliditylang.org/en/latest/grammar.html to explore all the grammar in it’s gory details.

Writing Solhint Plugins

As mentioned in the introduction, with solhint, you can write plugins which are invoked by the solhint tool while walking the AST. The pseudocode for how this works is the following:

for node in ast:
    switch node.type():
        case StateVariableDeclaration:
            for plugin in plugins:
                if plugin.method() == 'StateVariableDeclaration':
                    plugin.run(node)

Essentially, walk the AST, and everytime we hit a node, we invoke a plugin if it specifies this kind of node as the thing it wants to analyze. This will make more sense when we see the structure of a plugin.

Below is a scaffolding of what it looks like. Recall that the rule that we want to implement is: “all storage variables must be prefixed with s_.”.

class StateVarsSUnderscore {
    constructor(reporter, config) {
        this.ruleId = 'state-vars-s-underscore'
        this.reporter = reporter
        this.config = config
    }

    StateVariableDeclaration(ctx) {
        // rule implementation goes here
    }
}

The constructor takes in a reporter object, which is what we’ll use to report errors or warnings through the linter, and a config object, which is the solhint configuration specified as a JSON object.

We also have a single method, named after the syntactic structure we want to check, which in this case is StateVariableDeclaration. This method takes in a ctx, which is the AST node itself represented as a JSON object. This means that in this case, we have a JSON object with the fields variables and initialValue. Now our task seems straightforward - extract all the name fields in the declaration and check if they start with an s_. If not, emit a warning.

And here is what the plugin definition looks like:

class StateVarsSUnderscore {
    // .. constructor here ..

    StateVariableDeclaration(ctx) {
        const { variables } = ctx;
        for (let i = 0; i < variables.length; i++) {
            if (!variables[i].name.startsWith('s_')) {
                this.reporter.error(
                    variables[i],
                    this.ruleId,
                    `State variables must be prefixed with "s_"`)
            }
        }
    }
}

To explain in plain English what is written in code - we extract the variables field from ctx, which holds the StateVariableDeclaration AST node. Then, for each variable, we check that the name starts with s_. If not, we emit an error, providing the exact AST node (which has references to line + column in the source code), our rule ID, and the error message.

Testing our Plugin

Lets test the plugin we wrote above to make sure it works. To do this we’ll need to create an NPM package (no need to publish it for now, it can stay on your local machine). Note that when NPM prompts for the package name, make sure to prefix it with solhint-plugin - I creatively named mine solhint-plugin-solhint-plugins, but you can choose whichever name you please.

$ cd ~/projects
$ mkdir solhint-plugins
$ cd solhint-plugins && npm init

After going through NPM’s prompts, create a folder structure that looks like this:

$ mkdir plugins
$ touch index.js plugins/StateVarsSUnderscore.js

In StateVarsSUnderscore.js paste the code we produced above, referenced below for ease of copy/pasting:

// StateVarsSUnderscore.js
class StateVarsSUnderscore {
    constructor(reporter, config) {
        this.ruleId = 'state-vars-s-underscore';
        this.reporter = reporter;
        this.config = config;
    }

    StateVariableDeclaration(ctx) {
        const { variables } = ctx;
        for (let i = 0; i < variables.length; i++) {
            if (!variables[i].name.startsWith('s_')) {
                this.reporter.error(
                    variables[i],
                    this.ruleId,
                    `State variables must be prefixed with "s_"`);
            }
        }
    }
}

module.exports = StateVarsSUnderscore;

Note the module.exports - that’s important to let the JavaScript module system know that this type is being exported. Save the above file, and in index.js, write the following:

// index.js
const StateVarsSUnderscore = require('./plugins/StateVarsSUnderscore.js')

module.exports = [
    StateVarsSUnderscore,
]

Now, this plugin is ready for use. We just need to create another project with our Solidity contracts and install solhint and the plugin we just created.

Create another directory to house your contracts and install the necessary packages:

$ cd ..
$ mkdir solidity-contracts && cd solidity-contracts && npm init
$ mkdir -p contracts/v0.4
# Make sure to paste the contents of your contract below
$ touch contracts/v0.4/SimpleStore.sol
$ npm install solhint
$ npm install ../solhint-plugins
$ touch .solhint.json

In .solhint.json you can start with the following:

{
	"extends": "solhint:recommended",
	"plugins": ["solhint-plugins"],
	"rules": {
        "solhint-plugins/state-vars-s-underscore": "warn"
	}
}

Note the plugins key in the JSON above. Make sure to specify the name of your plugin package excluding the solhint-plugin- prefix in order for the solhint run to work. In the rules section, you will notice that we specify solhint-plugins/state-vars-s-underscore to warn. This is our rule ID, just scoped to our solhint plugin.

Finally, in your package.json of the contracts package, add the following package script:

{
  "scripts": {
    "solhint": "solhint \"./src/v0.4/*.sol\""
  }
}

At this point you can run:

$ npm run solhint

And here’s the output I get:

> solidity-project@1.0.0 solhint
> solhint "./src/v0.4/*.sol"


./src/v0.4/MyContract.sol
   4:18  warning  Rule is set with explicit type [var/s: uint]  explicit-types
   8:45  warning  Rule is set with explicit type [var/s: uint]  explicit-types
  12:5   warning  Explicitly mark visibility of state           state-visibility
  12:5   warning  State variables must be prefixed with "s_"    solhint-plugins/state-vars-s-underscore
  12:5   warning  Rule is set with explicit type [var/s: uint]  explicit-types

✖ 5 problems (0 errors, 5 warnings)

You can see the warning of our custom plugin: State variables must be prefixed with "s_" solhint-plugins/state-vars-s-underscore. Cool! It worked!

Conclusion

In conclusion, I hope this was informative. In upcoming posts I’d love to explore more Solidity tools; perhaps static analyzers. Until then, farewell and take care!

Written on October 31, 2023