Rust Performance Guide for Hayabusa Developers¶

Author¶

Fukusuke Takahashi

English translation¶

About this document¶

Hayabusa (English: "peregrine falcon") is a fast forensics analysis tool developed by the Yamato Security group in Japan. It is developed in Rust in order to (threat) hunt as fast as a peregrine falcon. Rust is a fast language in itself, however, there are many pitfalls that can result in slow speeds and high memory usage. We created this document based on actual performance improvements in Hayabusa (see the changelog here), but these techniques should be applicable to other Rust programs as well. We hope you can benefit from the knowledge we have gained through our trial and error.

Speed improvement¶

Change the memory allocator¶

Simply changing the default memory allocator may improve speed significantly. For example, according to these benchmarks, the following two memory allocators

are much faster than the default memory allocator. We were able to confirm a significant speed improvement by changing our memory allocator from jemalloc to mimalloc, so we made mimalloc the default since version 1.8.0. (Although mimalloc does use slightly more memory than jemalloc.)

Before ¶

# Not applicable. (You do not need to declare anything to use the default memory allocator.)

After ¶

You only need to perform the following 2 steps in order to change the global memory allocator:

Add the mimalloc crate to the Cargo.toml file's [dependencies] section:

[dependencies]
mimalloc = { version = "*", default-features = false }

Define that you want to use mimalloc under #[global_allocator] somewhere in the program:
```
use mimalloc::MiMalloc;

#[global_allocator]
static GLOBAL: MiMalloc = MiMalloc;
```
That is all you need to do to change the memory allocator.

Effectiveness（Real example from a Pull Request） ¶

How much speed improves will depend on the program, but in the following example

chg: build.rs(for vc runtime) to rustflags in config.toml and replace default global memory allocator with mimalloc. #777

changing the memory allocator to mimalloc resulted in a 20-30% performance increase on Intel CPUs. (For some reason, there was not as a significant performance increase on ARM based macOS devices.)

Reduce IO processing in loops¶

Disk IO processing is much slower than processing in memory. Therefore, it is desirable to avoid IO processing as much as possible, especially in loops.

Before ¶

The example below shows a file open occuring one million times in a loop:

use std::fs;

fn main() {
    for _ in 0..1000000 {
        let f = fs::read_to_string("sample.txt").unwrap();
        f.len();
    }
}

After ¶

By opening the file outside of the loop as follows

use std::fs;

fn main() {
    let f = fs::read_to_string("sample.txt").unwrap();
    for _ in 0..1000000 {
        f.len();
    }
}

there will be about a 1000 times speed increase.

Effectiveness（Real example from a Pull Request） ¶

In the following example, the IO processing when handling one detection result at a time was able to be performed outside of the loop:

Improve speed by removing IO process before insert_message() #858

This resulted in a speed improvement of about 20%.

Avoid regular expression compilation in loops¶

Regular expression compilation is a very costly process compared to regular expression matching. Therefore, it is advisable to avoid regular expression compilation as much as possible, especially in loops.

Before ¶

For example, the following process creates 100,000 attempts to match a regular expression in a loop:

extern crate regex;
use regex::Regex;

fn main() {
    let text = "1234567890";
    let match_str = "abc";
    for _ in 0..100000 {
        if Regex::new(match_str).unwrap().is_match(text){ // Regular expression compilation in a loop
            println!("matched!");
        }
    }
}

After ¶

By doing a regular expression compilation outside the loop, as shown below

extern crate regex;
use regex::Regex;

fn main() {
    let text = "1234567890";
    let match_str = "abc";
    let r = Regex::new(match_str).unwrap(); // Compile the regular expression outside the loop
    for _ in 0..100000 {
        if r.is_match(text) {
            println!("matched!");
        }
    }
}

the updated code is about 100 times faster.

Effectiveness（Real example from a Pull Request） ¶

In the following example, regular expression compilation is performed outside the loop and cached.

cache regex for allowlist and regexes keyword. #174

This resulted in significant speed improvements.

Use buffer IO¶

Without buffer IO, file IO is slow. With buffer IO, IO operations are performed through buffers in memory, reducing the number of system calls and improving speed.

Before ¶

For example, in the following process, write occurs 1,000,000 times.

use std::fs::File;
use std::io::{BufWriter, Write};

fn main() {
    let mut f = File::create("sample.txt").unwrap();
    for _ in 0..1000000 {
        f.write(b"hello world!");
    }
}

After ¶

By using BufWriter as follows

use std::fs::File;
use std::io::{BufWriter, Write};

fn main() {
    let mut f = File::create("sample.txt").unwrap();
    let mut writer = BufWriter::new(f);
    for _ in 0..1000000 {
        writer.write(b"some text");
    }
    writer.flush().unwrap();
}

there is about a 50 times speed improvement.

Effectiveness（Real example from a Pull Request） ¶

The method described above was implemented here

Feature/improve output#253 #285

and has resulted in significant speed improvements in output processing.

Use standard String methods instead of regular expressions¶

While regular expressions can cover complex matching patterns, they are slower than standard String methods. Therefore, it is faster to use standard String methods for simple string matching such as the following.

Starts-with matching（Regex: foo.*）-> String::starts_with()
Ends-with matching（Regex: .*foo）-> String::ends_with()
Contains matching（Regex: .*foo.*）-> String::contains()

Before ¶

For example, the following code performs ends-with matching in a regular expression one million times.

extern crate regex;
use regex::Regex;

fn main() {
    let text = "1234567890";
    let match_str = ".*abc";
    let r = Regex::new(match_str).unwrap();
    for _ in 0..1000000 {
        if r.is_match(text) {
            println!("matched!");
        }
    }
}

After ¶

By using String::ends_with() as follows

fn main() {
    let text = "1234567890";
    let match_str = "abc";
    for _ in 0..1000000 {
        if text.ends_with(match_str) {
            println!("matched!");
        }
    }
}

processing will be 10 times faster.

Effectiveness（Real example from a Pull Request） ¶

Since Hayabusa requires case-insensitive string comparison, we use to_lowercase() and then apply the above method. Even then, in the following examples

speed has improved by about 15% compared to before.

Filter by string length¶

Depending on the characteristics of the strings being handled, adding a simple filter may reduce the number of string matching attempts and speed up the process. If you often compare strings of non-fixed and unmatched string lengths, you can speed up the process by using string length as a primary filter.

Before ¶

For example, the following code attempts one million regular expression matches.

extern crate regex;
use regex::Regex;

fn main() {
    let text = "1234567890";
    let match_str = "abc";
    let r = Regex::new(match_str).unwrap();
    for _ in 0..1000000 {
        if r.is_match(text) {
            println!("matched!");
        }
    }
}

After ¶

By using String::len() as a primary filter, as shown below

extern crate regex;
use regex::Regex;

fn main() {
    let text = "1234567890";
    let match_str = "abc";
    let r = Regex::new(match_str).unwrap();
    for _ in 0..1000000 {
        if text.len() == match_str.len() { // Primary filter by string length
            if r.is_match(text) {
                println!("matched!");
            }
        }
    }
}

speed will improve by about 20 times.

Effectiveness（Real example from a Pull Request） ¶

In the following example, the above method is used.

Improving speed by adding string length match before regular expression match #883

This improved speed by about 15%.

Do not compile with codegen-units=1¶

Many articles on performance optimization with Rust advise to add codegen-units = 1 under the [profile.release] section. This will cause slower compilation times as the default is to compile in parallel but in theory should result in more optimized and faster code. However, in our testing, Hayabusa actually runs slower with this option turned on and compilation takes longer so we keep this off. The binary size of the executable is about 100kb smaller so this may be ideal for embedded systems where hard disk space is limited.

Reducing memory usage¶

Avoid unnecessary use of clone(), to_string(), and to_owned()¶

Using clone() or to_string() are easy ways to resolve compilation errors related to ownership. However, they will usually result in high usage of memory and should be avoided. It is always best to first see if you can replace them with low cost references.

Before ¶

For example, if you want to iterate the same Vec multiple times, you can use clone() to eliminate compilation errors.

fn main() {
    let lst = vec![1, 2, 3];
    for x in lst.clone() { // In order to eliminate compile errors
        println!("{x}");
    }

    for x in lst {
        println!("{x}");
    }
}

After ¶

However, by using references as shown below, you can remove the need to use clone().

fn main() {
    let lst = vec![1, 2, 3];
    for x in &lst { // Eliminate compile errors with a reference
        println!("{x}");
    }

    for x in lst {
        println!("{x}");
    }
}

By removing the clone() usage, memory usage is reduced by up to 50%.

Effectiveness（Real example from a Pull Request） ¶

In the following example, by replacing unnecessary clone(), to_string(), and to_owned() usage,

Reduce used memory and Skipped rule author, detect counts aggregation when --no-summary option is used #782

we were able to significantly reduce memory usage.

Use Iterator instead of Vec¶

Vec keeps all elements in memory, so it uses a lot of memory in proportion to the number of elements. If processing one element at a time is sufficient, then using an Iterator instead will use much less memory.

Before ¶

For example, the following return_lines() function reads a file of about 1 GB and returns a Vec:

use std::fs::File;
use std::io::{BufRead, BufReader};

fn return_lines() -> Vec<String> {
    let f = File::open("sample.txt").unwrap();
    let buf = BufReader::new(f);
    buf.lines()
        .map(|l| l.expect("Could not parse line"))
        .collect()
}

fn main() {
    let lines = return_lines();
    for line in lines {
        println!("{}", line)
    }
}

After ¶

Instead you should return an Iterator Trait as follows:

use std::fs::File;
use std::io::{BufRead, BufReader};

fn return_lines() -> impl Iterator<Item=String> {
    let f = File::open("sample.txt").unwrap();
    let buf = BufReader::new(f);
    buf.lines()
        .map(|l| l.expect("Could not parse line"))
        // ここでcollect()せずに、Iteratorを戻り値として返す
}

fn main() {
    let lines = return_lines();
    for line in lines {
        println!("{}", line)
    }
}

Or if the type is different depending on which branch is taken, you can return a Box<dyn Iterator<Item = T>> as follows:

use std::fs::File;
use std::io::{BufRead, BufReader};

fn return_lines(need_filter:bool) -> Box<dyn Iterator<Item = String>> {
    let f = File::open("sample.txt").unwrap();
    let buf = BufReader::new(f);
    if need_filter {
        let result= buf.lines()
            .filter_map(|l| l.ok())
            .map(|l| l.replace("A", "B"));
        return Box::new(result)
    }
    let result= buf.lines()
        .map(|l| l.expect("Could not parse line"));
    Box::new(result)
}

fn main() {
    let lines = return_lines(true);
    for line in lines {
        println!("{}", line)
    }
}

Memory usage drops significantly from 1 GB to only 3 MB.

Effectiveness（Real example from a Pull Request） ¶

The following example uses the method described above:

Reduce memory usage when reading JSONL file #921

When tested on a 1.7GB JSON file, memory decreased by 75%.

Use the compact_str crate when handling short strings¶

When dealing with a large number of short strings of less than 24 bytes, the compact_str crate can be used to reduce memory usage.

Before ¶

In the example below, the Vec holds 10 million strings.

fn main() {
    let v: Vec<String> = vec![String::from("ABCDEFGHIJKLMNOPQRSTUV"); 10000000];
    // do some kind of processing
}

After ¶

It is better to replace them with a CompactString:

use compact_str::CompactString;

fn main() {
    let v: Vec<CompactString> = vec![CompactString::from("ABCDEFGHIJKLMNOPQRSTUV"); 10000000];
    // do some kind of processing
}

By doing this, memory usage is reduced by around 50%.

Effectiveness（Real example from a Pull Request） ¶

In the following example, short strings are handled with CompactString:

To reduce ram usage and performance, Replaced String with other crate #793

This gave a reduction of memory usage by about 20%.

Delete unnecessary fields in long-lived structures¶

Structures that continue to be retained in memory during process startup may affect the overall memory usage. In Hayabusa, the following structures (as of version 2.2.2), in particular, are retained in large numbers.

The removal of fields associated with the above structures had some effect on reducing overall memory usage.

Before ¶

For example, the DetectInfo field was, until version 1.8.1, the following:

#[derive(Debug, Clone)]
pub struct DetectInfo {
    pub rulepath: CompactString,
    pub ruletitle: CompactString,
    pub level: CompactString,
    pub computername: CompactString,
    pub eventid: CompactString,
    pub detail: CompactString,
    pub record_information: CompactString,
    pub ext_field: Vec<(CompactString, Profile)>,
    pub is_condition: bool,
}

After ¶

By deleting the record_information field as follows

#[derive(Debug, Clone)]
pub struct DetectInfo {
    pub rulepath: CompactString,
    pub ruletitle: CompactString,
    pub level: CompactString,
    pub computername: CompactString,
    pub eventid: CompactString,
    pub detail: CompactString,
    // remove record_information field
    pub ext_field: Vec<(CompactString, Profile)>,
    pub is_condition: bool,
}

a reduction in memory usage of several bytes per detection result record was achieved.

Effectiveness（Real example from a Pull Request） ¶

In the following example, when tested against data where the number of detection result records was about 1.5 million,

we were able to achieve about a 300MB reduction in memory usage.

Benchmarking¶

Use the memory allocator's statistics function.¶

Some memory allocators maintain their own memory usage statistics. For example, in mimalloc, the mi_stats_print_out() function can be called to obtain memory usage.

How to obtain statistics ¶

Prerequisites: You need to be using mimalloc as explained in the Change the memory allocator section.

In Cargo.toml's dependencies section, add the libmimalloc-sys crate:

[dependencies]
libmimalloc-sys = { version = "*",  features = ["extended"] }

Whenever you want to print the memory usage statistics, write the following code and inside an unsafe block, call mi_stats_print_out(). The memory usage statistics will be outputted to standard out.

use libmimalloc_sys::mi_stats_print_out;
use std::ptr::null_mut;

fn main() {

  // Write the following code where you want to measure memory usage
  unsafe {
        mi_stats_print_out(None, null_mut());
  }
}

The upper left peak/reserved value is the maximum memory usage.

Example ¶

The above implementation was applied in the following:

add --debug option for printing mimalloc memory stats #822

In Hayabusa, if you add the --debug option, memory usage statistics will be outputted at the end.

Use Windows' performance counter¶

Various resource usage can be checked from statistics that can be obtained on the OS side. In this case, the following two points should be noted.

Influence from anti-virus software (Windows Defender)
Only the first run is affected by the scan and is slower, so results from the second and subsequent runs after the build are suitable for comparison. (Or you can disable your anti-virus for more accurate results.)
Influence from file caching
The results from the second and subsequent times after OS startup are faster than the first time because evtx and other file IOs are read from the file cache in memory, so the results from the first time after the OS boots is more ideal for taking benchmarks.

How to obtain ¶

Prerequisites：The following procedure is only valid for environments where PowerShell 7 is already installed on Windows.

Restart the OS

Run PowerShell 7's Get-Counter command which will continuously record the performance counter every second to a CSV file. (If you would like to measure resources other than those listed below, this article is a good reference.)

Get-Counter -Counter "\Memory\Available MBytes",  "\Processor(_Total)\% Processor Time" -Continuous | ForEach {
     $_.CounterSamples | ForEach {
         [pscustomobject]@{
             TimeStamp = $_.TimeStamp
             Path = $_.Path
             Value = $_.CookedValue
         }
     }
 } | Export-Csv -Path PerfMonCounters.csv -NoTypeInformation

Execute the process you want to measure.

Example ¶

The following contains an example procedure for measuring performance with Hayabusa.

Example of obtaining Windows performance counters

Use heaptrack¶

heaptrack is a sophisticated memory profiler available for Linux and macOS. By using heaptrack, you can thoroughly investigate bottlenecks.

How to obtain ¶

Prerequisites: Below is the procedure for Ubuntu 22.04. You cannot use heaptrack on Windows.

Install heaptrack with the following two commands.

sudo apt install heaptrack
sudo apt install heaptrack-gui

Remove the following mimalloc code from Hayabusa. (You cannot use heaptrack's memory profiler with mimalloc.
https://github.com/Yamato-Security/hayabusa/blob/v2.2.2/src/main.rs#L32-L33
https://github.com/Yamato-Security/hayabusa/blob/v2.2.2/src/main.rs#L59-L60
https://github.com/Yamato-Security/hayabusa/blob/v2.2.2/src/main.rs#L632-L634
Delete the [profile.release] section in Hayabusa's Cargo.toml file and change it to the following:
```
[profile.release]
debug = true
```
Build a release build: cargo build --release
Run heaptrack hayabusa csv-timeline -d sample -o out.csv

Now when Hayabusa finishes running, heaptrack's results will automatically open in a GUI application.

Examples ¶

An example of heaptrack's results are shown below. The Flame Graph and Top-Down tabs allow you to visually check functions with high memory usage.

References¶

Contributions¶

This document is based on findings from actual improvement cases in Hayabusa. If you find any errors or techniques that can improve performance, please send us an issue or pull request.