Using ML to analyze ASTs for Smart Contract Exploit Detection

Paper Summary: “A Novel Machine Learning-Based Analysis Model for Smart Contract Vulnerability”

6 min readMar 28, 2022

Paper by: Yingjie Xu, Gengran Hu, Lin You, Chengtang Cao

Abstract:

Let’s use ML for smart contract vulnerability detection. First, we build an Abstract-Syntax-Tree on labelled datasets. Then we build it on a bunch of unlabeled datasets. We measure the structural similarity, turn that into a feature vector, and then use a KNN to classify vulnerabilities.

This paper was able to detect 8 types of vulnerabilities, with an accuracy, recall, and precision of over 90% — thus obtaining SOTA in non-ML and ML methods.

Introduction:

Existing tools in smart contract vulnerability detection are:

Static Analysis: Data flow analysis, static symbolic execution.

Symbolic execution = normal execution, but as a special case (as variables instead of numbers). They need lots of patterns to pattern match against, thus is time-consuming and limited.

Dynamic Analysis: Fuzzing, Taint tracking.

Fuzzing = generating test inputs and running them through the program to find vulnerabilities. Requires expert knowledge and as complicated as you need it to interact with a blockchain environment

This paper takes a different approach

We take the AST from the code, thus no need to execute the smart contracts. But we don’t have to pattern match like the static analysis. We extract features from the AST and then use a classification algorithm to classify the results. This process is much quicker than existing methods.

Other ML approaches pass the opcode and analyze it that way. Some draw features from AST and CFG (control flow graph) and use ML to classify them. But those methods extract features manually, but this paper proposes an automatic way by looking at the shared child nodes.

We build ASTs from labelled vulnerability datasets & smart contracts in the wild, compare them, and then are able to create models that are capable of getting 90%+ Accuracy, recall, and precision — beating existing tools.

Background:

Here are 8 popular exploits:

Re-Entrancy: If a contract has a function where it sends ETH to an address, if left unprotected, an attacker contract can create a fallback function (which is triggered whenever ETH is deposited into that account) → This fallback function can just continuously call the withdraw function

Arithmetic: Integer Overflow and Underflows can occur. In the example below, when we’re checking balances[msg.sender]-_value → if msg.sender is 0, and _value is 1, this results in a huge number (that’s bigger than 0), thus able to grab money

Access Control: When someone can become an administrator and start screwing with the contract’s values. In the example below the trans function requires that you’re the owner, but the initContract function is public, thus anyone can become owner

Denial of Service: When someone monkeys with the contract values and cripples the contract to be unable to function. In the example below someone can set an extremely high value for largestWinner and render the contract useless

Unchecked Low-Level Calls: Some low-level calls, when failed, won’t rever the transaction. For example, if a contract that doesn’t accept ETH called the withdraw function, it would still change the balances & etherLeft variables, even though msg.sender.send failed

Bad Randomness: Randomness that is actually predictable — if a seed was set, we can see it, and if we know the function that generates it, then we know the outcomes

Front Running: People can pay higher gas fees in order to get their transactions first. We can also see everything in the mempool. Thus if there was a solution to the puzzle, someone could steal the solution and front-run them by paying a higher gas fee
Short Addresses: The EVM automatically fills in values with extra 0’s for incomplete lengths. For example if to = 0xab0, and amount = 1000 → encoding = ab0001000. But if we change it: to = 0xab, and amount = 1000 → encoding = ab0010000. This increases the transfer amount

Next, we talk about 2 existing static analysis tools:

SmartCheck: It’s a static analysis tool. Solidity Source code → ANTLR + Solidity grammar → XML parse tree (it’s an intermediate representation (IR)). Then XPATH pattern matches the IR and tries to find vulnerabilities. But we need patterns to find the vulnerabilities, and since patterns aren’t perfect or sometimes don’t exist, we miss vulnerabilities
Oyente: Static analysis tool. It takes the symbolic execution of the EVM bytecode to find vulnerabilities [not explained how it works]

Methodology:

First, we create ASTs of the contracts we want to analyze (B) and known malicious smart contracts (A) → We extract feature vectors from child nodes, and then label the vectors (by using other tools to label the original contracts of interest) → Then we pop it into our KNN & SGD models.

AST: It’s an abstract representation of the source code. It takes the structure of the code and converts it into a tree. ASTs have rich details about the contracts.

We then collect the dataset of malicious contracts, convert them into ASTs, then compare them again to the contracts of interest → We find the shared child nodes between the ASTs.

We grab all the nodes from both ASTs, then find all the common nodes in both ASTs and slap them into a vector. Then find the number of common child-nodes in each AST and turn that into a vector.

Labeling is done by using Slither and Ethainter. We use KNN (K-Nearest-Neighbors) and SGD (Stoichastic Gradient Descent)

Experiment:

Dataset was from Smartbugs, SolidiFi, and Smartbugs wild. Smartbugs & Solidifi = pre-labeled vulnerabilities, while Smartbugs wild was not — thus we use the existing tools to label them and use them as data.

We can see that ML based models beat existing methods, while KNN beats SGD.

Related Work:

Conclusion:

This work introduces an automatic way to using ML to analyze ASTs to detect smart contract vulnerabilities. It was able to achieve a 90% on all metrics, which beats existing methods.

However this paper can be improved by detecting concrete problems with the code / which line of code the vulnerability occurs. Also expanding to other languages and chains. Additionally we need a basic set of malicious smart contracts → The quality and quantity will affect training performance.

If you want to find out more: Read the paper here!

Thanks for reading! I’m Dickson, currently working on Deus Ex Securitas, where we aspire to achieve superhuman level performance in smart contract exploit detection!