Fun with Esprima and Static Analysis

Writing programs to do static analysis on Javascript is easier now than ever, thanks to Esprima - the Javascript parser library. In this post, I'll demonstrate how to use Esprima to do something fairly simple - detect accidentally leaked global variables in a program.

Accidentally leaked global variables happen when a value is assigned to an identifier without declaring the identifier as a variable first. For example, this g = 1a accidentally makes g a global, but var g = 1; and var g; g = 1; do not.

To accomplish the task at hand we need to do two things

  1. Find all assignment statements in the program
  2. For each of those assignments, find out if the assign variable has already been defined within the scope

But before we do all that, let's start with the basics.

The AST

Esprima's job is to turn a Javascript program into an AST - Abstract Syntax Tree. An AST is a tree representation of the source code that can be inspected programmatically. For example the AST for this program

6 * 7

Would be something like

AST Example

The output you get from esprima is a bit more detailed, but is nothing more than a nested Javascript object

{
    "type": "Program",
    "body": [
        {
            "type": "ExpressionStatement",
            "expression": {
                "type": "BinaryExpression",
                "operator": "*",
                "left": {
                    "type": "Literal",
                    "value": 6,
                    "raw": "6"
                },
                "right": {
                    "type": "Literal",
                    "value": 7,
                    "raw": "7"
                }
            }
        }
    ]
}

To get an even better idea, I recommend you go play with the Esprima online parse for a bit. When you type in a valid Javascript program on the left, and you'll immediately see its AST on the right.

Traversing the AST

Back to the task at hand. If we want to find all the assignment statements, we'll need to be able to traverse the AST. AST traversing is slightly tricky because there isn't a unified interface for getting the children for a given node. For example, an VariableDeclaration has nested children nodes within declarations while an IfStatement has nested children within its consequent and its alternate. Moreover, some properties contain objects, some arrays, others contain strings.

Thanks to the power of Node modules, we'll just use the Estraverse module to do that for us. To use Estraverse:

estraverse.traverse(ast, {
  enter: function(node){
    console.log('Entered node', node.type);
  }
});

Find the Assignments

Now that we can traverse an AST, we need to know what to look for. What does an assignment look like? To find out, put into the online parser "g = 1" and we find that we just have to find all nodes of type AssignmentExpression, and then get at its left.name. Let's try this.

First, let's create a new directory for this project, and install both Esprima and Estraverse.

mkdir esprima_fun
cd esprima_fun
npm install esprima
npm install estraverse

Now, create a assignments.js and write

var fs = require('fs');
var esprima = require('esprima');
var estraverse = require('estraverse');

var filename = process.argv[2];
console.log('Processing', filename);
var ast = esprima.parse(fs.readFileSync(filename));
estraverse.traverse(ast, {
  enter: function(node){
    if (node.type === 'AssignmentExpression'){
      console.log('Encountered assignment to', node.left.name);
    }
  }
});

This program will take the input file name as its a parameter and report for each assignment in the program, which identifier was assigned to. As a test, make a file example1.js

g = 1;
f = 2;
var h = 1, i;

Run node assignments.js example1.js and the output should be

Processing example1.js
Encountered assignment to g
Encountered assignment to f

Full Source

Javascript Scoping 101

The next step is to find out whether a given identifier being assigned to has been defined as a variable within that scope. But first, let's revisit how variable scoping works in Javascript. In Javascript, other than the global scope - the default variable scope for code that doesn't belong to any function - the only thing that creates a new scope is a function.

var a = 1, b = 2; // a and b are in the global scope
function f(){
  var c;  // c is in the scope of f, 
          // which is a child of the global scope
  (function g(){
    var d = 'yo'; // d is in the scope
                  // of g, 
                  // which is a child of the scope of f
  }());
}

Here's a way to visualize the scope chain above

Scope Chain

Within the scope of g(), all 4 variables a, b, c, and d are defined, because each of them either live in the scope of g() or its parent or ancestor scope. Within f(), only a, b, and c are defined, and then in the global scope only a and b are defined.

Reverse-Engineering the Scope Chain

In order to figure out whether an identifier is defined within a particular scope, we need to not only collect all variables declared within the scope itself, but also of all its ancestors. In order to do this, we'll need to maintain our own scope chain as we traverse down the AST. This will be fun.

First, write a function that tells us whether a given node creates a new scope. We are interested in three types of nodes

  1. FunctionDeclarations - such as function f(){...}
  2. FunctionExpressions - such as var f = function(){...};
  3. the top level program, which always has type of Program.
function createsNewScope(node){
  return node.type === 'FunctionDeclaration' ||
    node.type === 'FunctionExpression' ||
    node.type === 'Program';
}

We'll use an array as a queue for storing the scope chain.

var scopeChain = [];

As we traverse each node, in the enter function, we add a new scope - for which we are just going to represent as an array of variable names - to the scope chain whenever we encounter a scope-creating node.

if (createsNewScope(node)){
  scopeChain.push([]);
}

When we see a variable declarator, we'll push it into the current scope

if (node.type === 'VariableDeclarator'){
  var currentScope = scopeChain[scopeChain.length - 1];
  currentScope.push(node.id.name);
}

Next, using the leave hook of Estraverse - called when all the children of a node have been traversed - we can find out what variables have been declared within that scope just by printing it out. We also dispose of the current scope using array's pop() method on the scope chain.

function leave(node){
  if (createsNewScope(node)){
    var currentScope = scopeChain.pop();
    console.log(currentScope);
  }
}

But we can make the output give more context here if we also print out information about the node. So let's instead write a printScope() function and substitute printScope(currentScope, node) for console.log(currentScope)

function printScope(scope, node){
  var varsDisplay = scope.join(', ');
  if (node.type === 'Program'){
    console.log('Variables declared in the global scope:', 
      varsDisplay);
  }else{
    if (node.id && node.id.name){
      console.log('Variables declared in the function ' + node.id.name + '():',
        varsDisplay);
    }else{
      console.log('Variables declared in anonymous function:',
        varsDisplay);
    }
  }
}

Now, if you run that script on this input program:

var a = 1, b = 2; // a and b are in the global scope
function f(){
  var c;  // c is in the scope of f, 
          // which is a child of the global scope
  (function g(){
    var d = 'yo'; // d is in the scope
                  // of g, 
                  // which is a child of the scope of f
  }());
}

You should get the output of

Variables declared in the function g(): d
Variables declared in the function f(): c
Variables declared in the global scope: a, b

Notice that the output starts with the innermost scope and ends with the outermost scope. This is because we finish collecting the variable declarations and print them out on leaving a node, rather than entering it.

Full Source

Putting it All Together

Now that we have the scope chain reverse-engineered, we can query it to see whether an variable was define when it was assigned a value. We'll make a function isVarDefined() like so

function isVarDefined(varname, scopeChain){
  for (var i = 0; i < scopeChain.length; i++){
    var scope = scopeChain[i];
    if (scope.indexOf(varname) !== -1){
      return true;
    }
  }
  return false;
}

After this, it's a matter of collecting all the assignment statements within the current scope in an array, and then checking all of them for defined-ness.

We'll need an array to collect all the assignments done within the current scope, because the check must happen on leaving a node - the point when the scope chain fully materializes.

var assignments = [];

Inside the enter function, we collect all the assignment expressions nodes into it

if (node.type === 'AssignmentExpression'){
  assignments.push(node.left.name);
}

Then in the leave function, we check for any global variable leaks

if (createsNewScope(node)){
  checkForLeaks(assns, scopeChain);
}

where checkForLeaks() is

function checkForLeaks(assignments, scopeChain){
  for (var i = 0; i < assignments.length; i++){
    if (!isVarDefined(assignments[i], scopeChain)){
      console.log('Detected leaked global variable:', assignments[i]);
    }
  }
}

If you run the it against the following input program

e = 1;
f = 2;
var h = 1, f;
function m(){
  k = 2;
}

You should get

Detected leaked global variable: e
Detected leaked global variable: k

Full Source

Line Numbers

But, wait, there's more! Esprima can also give location information about each node in the AST, so that we can output the line number for each of the leaks we found. We just turn on this feature by passing in {loc: true} as the second argument to esprima.parse(). See the docs for all esprima options.

Next we need to modify the code to keep track of all the leaking nodes so that we can actually get at their locations when we need to. So we change the line

assns.push(node.left.name);

to

assns.push(node);

And then we modify checkForLeaks() to print out the line information:

function checkForLeaks(assignments, scopeChain){
  for (var i = 0; i < assignments.length; i++){
    var assignment = assignments[i];
    var varname = assignment.left.name;
    if (!isVarDefined(varname, scopeChain)){
      console.log('Leaked global', varname, 
        'on line', assignment.loc.start.line);
    }
  }
}

Full Source

Homework

There are some additional problems with our code still:

  1. a function statement implicitly declares a variable - these are currently not tracked
  2. a function's parameters all become variables within the function's scope - these are currently not tracked

Your assignment is to make the global variable detector work with these two cases as well. To submit your homework, put your solution on Github and post it in the comments.

Extra Credit!

I can tell that you can't get enough of this static analysis stuff!! Here are a couple more problems to satisfy that appetite

  1. detect unused variables in a program.
  2. detect referenced-but-undeclared globals - these would be globals variables that are used by the program but not declared by the program. This is not necessarily a problem, the DOM API exposes itself via global variables such as window and document, for example, but tracking these sometimes can help find typos.

As with the homework, please post your solutions in the comments.

blog comments powered by Disqus