This post was written in 2018. Tools and versions mentioned may be outdated, though the underlying ideas still hold.

Intro

Quick write-up of what I actually learned from this Alibaba Cloud Security algorithm competition.

Refactored my autoclf framework — added debug and evaluation support
Dug into how LGB works under the hood (it really does beat XGB in practice)
Did statistical analysis on malware API call sequences to figure out distinguishing features
Picked up some new awk tricks for data wrangling
Spent time on hyperparameter tuning and tracking down data quality issues

Preprocessing

Each row in the raw data is a sandbox API call made by a malware sample. The fields are: file ID, file type, API name, thread ID, return value, and sequence number.

The file is massive — 1GB compressed, 14GB unzipped. Pandas would need at least 64GB of RAM just to read it. So that’s a hard no. I went straight to awk and split each file_id into its own file.

awk -F "," '{print $0 > ("FILE" $1)}' train.csv

Which gives you this:

There’s one catch though: some machines will throw a “can’t open too many files” error. The workaround is to close each file after writing, but it makes things significantly slower.

awk -F "," '{print $0 >> ("FILE" $1); close("FILE" $1)}' train.csv

After splitting, you also need to move the files into category-specific folders:

for item in $(ls $PWD/FILELIST/)
do
    fpath=$PWD/FILELIST/$item
    ftype=$(head -n 1 $fpath | awk -F, '{print $2}')
    case $ftype in
            0) echo "Normal" | mv $fpath traintype/0 ;;
            1) echo "Ransomware" | mv $fpath traintype/1 ;;
            2) echo "Miner" | mv $fpath traintype/2 ;;
            3) echo "DDoS Trojan" | mv $fpath traintype/3 ;;
            4) echo "Worm" | mv $fpath traintype/4 ;;
            5) echo "Infector" | mv $fpath traintype/5 ;;
    esac
done

I also extracted per-file API call statistics and stored them separately:

for i in {0..5}
do
    echo $i
    for f in $(ls $i)
    do
        echo $i/$f
        cat $i/$f | awk -F , '{print $3" "$4" "$5}' | sort |  uniq -c | sort -bgr > "$i""summary"/$f
    done
done

Full shell scripts in the references.

Features and Statistical Analysis

Six classes total: 0-Normal / 1-Ransomware / 2-Miner / 3-DDoS Trojan / 4-Worm / 5-Infector. A quick note on approaches: a teammate went with TF-IDF features and treated the whole thing as a text classification problem, and it worked pretty well.

The three main approaches I tried:

Train on API call statistics — the summary files from above
Extract the API call sequence per file, deduplicate, compare against class-0 (normal), then train
Treat it as text classification with TF-IDF features

For model selection: best practice is to run a bunch of algorithms first, pick the top performers, then GridSearch and do ensemble learning. That said, for me personally this competition was more about building intuition for malware analysis than chasing a leaderboard score. Anyway, here are the log_loss results from approaches 1 and 2:

summary\log_loss	0/1	0/2	0/3	0/4	0/5
xgb	0.00594	0.014	0.013	0.00359	0.11094
lgb		0.012	0.006		0.05

For cases where log_loss was already tiny I didn’t bother re-running XGB. Running all the summaries together gives a combined log_loss around 0.13, which sounds decent locally — but it blew up to over 1.2 on the public leaderboard. Yeah. Approach 2 (deduplicated API sequences) didn’t look great either:

apis\method	xgb	lgb
log_loss	0.77228245	0.6040165

To be fair, I only used 1/5 of the class-0 samples as negatives — using all of them would probably help. Other teams using neural networks were hovering around 0.1.

Approach 3 ended up being the winner. My teammate rode TF-IDF features to rank 6 on the A-leaderboard with a loss around 0.06. We pushed that to 0.049, but slid to 8th on the B-leaderboard. When I asked around later, turns out the training set had a suspicious pattern — malicious samples were clustered around consecutive file IDs like file1122, file1123. And some samples were duplicates from re-running the same malware. Anyway, enough about leaderboard drama.

The interesting thing is that you can actually get solid results without using any security domain knowledge at all — pure statistical patterns work. But we should still analyze things from a security lens. Each malware type also has sub-variants: DDoS trojans, for example, range from process-killing types to memory-flooding types. Most malware code is tight and minimal — typically drops a DLL into the system first, then unpacks the payload. Below are some quick observations and tricks per class.

Note: pie charts show aggregate API call statistics across all samples in that class; scatter plots show 2D and 3D views of API call sequences for individual samples.

0-Normal

API sequences tend to repeat on a timer — basically behavioral patterns that show up again later. The overall API call distribution looks fairly uniform.

1-Ransomware

Lots of file I/O. Over 80% call into the Crypt family — CryptHashData and similar — though not at a high frequency. You also see a lot of LdrLoadDLL and NtDeviceIoControlFile.

2-Miner

Very distinctive signature: heavy multi-threading, lots of network operations, and file creation (to sync blockchain data). I first noticed this while doing the dedup-based feature selection, then confirmed it visually. One edge case: smarter miners will pin to a single TID — just one core, keeps a lower profile.

3-DDoS Trojan

This class took the most analysis time. Flavors include process terminators, memory flooders, and even the “popup-window spam until the system dies” variant. For a full taxonomy see DDOS type.

The dominant API calls are GetSystemMetrics, NtClose, NtAllocateVirtualMemory, plus a bunch of registry operations. The clearest signal is multi-process behavior — the same API call sequence showing up across different TIDs. Compared to miners, which also use multiple threads, DDoS trojans call these APIs at a much higher volume.

4-Worm

Key behavior: self-replication, spreads to other machines. The 2D scatter plots aren’t that readable, but the 3D view is interesting — each TID sequence looks like it’s growing to match the most complete TID, with visible breaks in the sequence. Unlike miners and DDoS where multiple threads run the same API sequence in parallel, here the other TIDs seem to be converging toward the longest one.

5-Infector

Honestly the hardest class for me. I kept trying to analyze it like a generic virus, forgetting this specifically means file infectors. I couldn’t find a clear pattern, even though there are 3000+ samples. The loss on this class is the highest of all six — and it’s dragging the overall score down. One thing I did notice from the 3D scatter plots: rotating to different angles consistently gives a good overlay across samples. Not sure if that’s meaningful or just noise. Still needs more investigation.

Misc

There’s something genuinely elegant about how this malware is written. I remember reading the Mirai source code a while back and thinking the same thing.
Plenty of other directions to explore here.
The pie charts were made in Google Sheets. Ideas about how to evade model-based detection are in my notes.
If I hooked into the kernel to capture syscall data directly, could I use that for analysis? Would the volume be too overwhelming? Maybe start with hard statistical thresholds first?

Malware Analysis and Data Science

Intro

Preprocessing

Features and Statistical Analysis

Misc

Reference