Solving CTFs with AI Agents
Its 2025. The dawn of AGI is upon us. Binaries are just tokens. Code is just tokens. Symbolic execution is a brittle memory of the past. Manual reverse engineering is an ancient art form. Embrace the future… embrace … tokens.
nfuncs
I played DEF CON 33 Quals this year with SuperDiceCode (placing 2nd thanks to my strong teammates!).
I spent a lot of time on the nfuncs
challenge, written by fish, a known symbolic execution lover and one of the core devs of angr. (note: this writeup may be blasphemous for those who prefer static analysis tools with strong guarantees, you may want to avert your eyes)
The n<challenge>
naming scheme is a callback to the original ncuts
challenge. It hints that the objective here is not to solve one hard problem, but rather to automatically solve thousands of mini problems. In previous years, these mini problems have all been fairly easy to solve by hand, but it would be too time consuming and tedious to solve them all.
Unsurprisingly, this year was the same, where the objective was to decode a huge binary consisting of a Matryoshka-like unpacking process.
This writeup summarizes my attempt at solving the challenge in a fairly general way using an o3-mini
powered agent. This solution was enough for nfuncs1
, but proved too slow for nfuncs2
. We realized the encoding scheme was repetitive enough to solve with a heuristic-based script too late, and thus only managed to decode 25% of part 2 by the end of the CTF.
However, I think this type of solution path is interesting. o3-mini
was more capable than I expected, and as more models come out, costs and latencies are reduced, I think these types of solutions will become much more common for automation tasks.
Overview
The challenge provides us with a 3 GB Windows executable (which Windows 11 even refuses to run).
After loading it into Binary Ninja (hint: disable automatic linear sweep!), we see that most of the binary is just random data, except for a few functions including this one:
This function decrypts sub_140001510 by first calling VirtualProtect to make memory writable. It then decrypts the memory using an XOR key and calls VirtualProtect again to revert memory to read-only. Finally, it jumps to the newly decrypted function.
Decoding this by hand, we uncover sub_140001510 :
This function does some preparation, then calls _read 8 times, and validates the 64-bit value that is read in, printing either “:)” ( 0x293a ) or “:(” ( 0x283a ) depending on whether the value is correct.
In this case, the lookup table
s
performs a mapping from x
to x + 0xd
. In the main branch at
14000185f
, we see each byte, mapped through this lookup table, results in the sequence 965d5b541a172717
. Subtracting 0xd
from each byte, we get the input \x89PNG\x0d\x0a\x1a\x0a
(which happens to be the start of a PNG file).
Next, on line 1400018c8 , it performs some bit manipulation on this value to generate an XOR key, and uses it to decrypt the next function.
Looking at the next function, we see a similar pattern:
Here, on line
14000f0e1
, we find a direct comparison to 0000000d49484452
(\x00\x00\x00\rIHDR
), a continuation of the PNG file, describing the size and type of the first chunk (the IHDR chunk).
We also see the same bit manipulation pattern as before on line 14000f148 , where the XOR key is generated.
By the fourth function, we see some different structures:
Here there is a similar user input check at 140027687 . But now, the XOR key is 16 bytes (indicated by the 0xf at 140027730 ). Also the XOR key is now generated by a cryptographic-looking subroutine sub_1400273a0 , seen below:
In general, we observe that the binary contains functions which read 8 bytes, validate the user input, and then use the input somehow to generate an XOR key (either 8 or 16 bytes), which is then used to decrypt the next function:

Importantly, we can also tell if the XOR key is correct by looking at whether the decompiled function code is sensible. For example, the following is a function decoded with an incorrect XOR key:
In particular, we will check if there is a call to VirtualProtect as an oracle for whether the XOR key is correct.
Thinking with tokens
From manually decoding the first few functions, we learn that the overall objective of the challenge is to decode each function consecutively, solving 8-bytes of a PNG file along the way. There is some variety in the function templates, but overall they have a similar structure.
A year or two ago, I might have tried dynamic tracing or symbolic execution to hook the functions and then solve for the user input. But that sounds boring…
Instead of all that, we’re going to build an agent that will solve each function for us!
Behold:
Flow
Based on our initial observations, most functions are self-contained and consist of a separate user-input check, followed by an XOR key derivation. We split the flow up into two main steps:
- In the first step, we instruct the LLM to analyze the
_read
calls, deduce the constraints on the user input, and determine the correct 64-bit integer. - In the second step, we then instruct the LLM to figure out the XOR key derivation and produce an XOR key (along with its size).
We can then validate the output of the agent by performing the decryption and checking if the decompiled function is sensible (i.e. contains a call to VirtualProtect ).
In both steps, we also provide the LLM with the ability to invoke the run_python
tool which will run arbitrary Python code in a jupyter-notebook style cell. LLMs are not good at precise mathematical reasoning, but are capable of writing Python programs to compute things when instructed to do so.
Prompts
The full first prompt is the following:
You are an expert CTF player. Your task is to understand a function in a decompiled binary for a CTF challenge. The function consists of two steps: 1. 8-bytes of user input are read into a buffer and treated as a 64-bit little-endian integer. 2. This integer is manipulated in a series of arithmetic operations, to produce an XOR key. 3. The XOR key is used to deobfuscate another part of memory. For now, work on the first step: 1. Identify the `_read` calls to find where the user input is read into a buffer. 2. There should be 8 bytes read, but the buffer is treated as a single 64-bit little-endian integer. 3. Determine the correct user input (as a 64-bit integer). You have access to the `run_python` tool, which executes a python notebook cell and returns the output. Do not attempt to perform any calculations yourself, let the `run_python` tool do that. Do not perform any kind of brute force search, the validation is quite simple, just figure out how to recover the user input directly. Each byte of the user input should be constrained to a specific value. If the partial condition is (x ==) or (x != ), then that byte should be . You can perform a search if you need to, but check each byte individually on the partial condition. When you have identified the correct user input, run the `UserInput` tool to submit your answer. Provide the user input as a little-endian hex integer with the 0x prefix. [CODE] {code} [/CODE]
We inject the HLIL (as text) inside the [CODE]
tags. Also if the function has any subroutine calls, we include their HLIL inside a [HELPER_FUNCTIONS]
tag appended to the end of the prompt.
Once the Agent performs the UserInput
call, indicating they have recovered the user input, we then append the second prompt:
Now that you have identified the correct user input, you need to determine the XOR key. The XOR key will be used in the loop after the VirtualProtect call. Work step by step: 1. Analyze the code to identify how the user input is used to calculate the XOR key. 2. Carefully examine the logic and translate it to equivalent python code 1 to 1. 3. Use the `run_python` tool to run the python code and calculate the XOR key. If there are any helper functions, be careful to keep track how the function arguments and return values are used. When you have identified the correct XOR key, run the `XORKey` tool to submit your answer. Provide the XOR key as a little-endian hex integer with the 0x prefix. Also return the key_size (number of bytes) as an integer. Check the & condition inside the unpacking loop to determine how many bytes are used. If it is helpful, here is the same code represented in MLIL. Make sure to pay attention to the number and order of arithmetic operations. [MLIL] {mlil} [/MLIL]
We observed that sometimes, the agent wasn’t quite able to translate nested expressions correctly into python so we also provide the MLIL representation of the function (where each expression like rol
and ror
is on a new line) to improve reasoning ability.
Examples
Here are some example traces of the agent solving the first 5 functions with a variety of schemes:
Func 1: Func 1 - Shifted input + Bit manipulation XOR
Func 2: Func 2 - Direct input + Bit manipulation XOR
Func 3: Func 3 - Shifted input + Bit manipulation XOR
Func 4: Func 4 - Direct input + Subroutine S-box XOR key expansion (agent was able to discover S-box on its own)
Func 5: Func 5 - Bit checking input + Subroutine S-box XOR key expansion (agent wrote a python-based solver to solve the user input check and then used the same S-box key expansion to solve the XOR key)
What was really cool to see was that the o3-mini
was able to deduce that a function like the following was implementing AES S-box generation and then using that to expand the user input into a 16-bit XOR key. (See Func 4 and 5 above for an example)
It was able to then generate Python code like the following to solve it:
def aes_sbox():
# Standard AES S-box
return [
0x63, 0x7c, 0x77, 0x7b, 0xf2, 0x6b, 0x6f, 0xc5,
0x30, 0x01, 0x67, 0x2b, 0xfe, 0xd7, 0xab, 0x76,
0xca, 0x82, 0xc9, 0x7d, 0xfa, 0x59, 0x47, 0xf0,
0xad, 0xd4, 0xa2, 0xaf, 0x9c, 0xa4, 0x72, 0xc0,
# (... rest of the S-box)
]
def generate_key(user_input):
# user_input is a 64-bit integer, little endian representation was given.
# Step 1: compute arg2 = user_input + 0x6274eb3f
arg2 = user_input + 0x6274eb3f
result = arg2
s = aes_sbox()
out_bytes = []
# Generate 32 bytes as per loop
for i in range(32):
idx = result & 0xff
b = s[idx]
out_bytes.append(b)
result = (result >> 4) ^ b
return out_bytes
# Given user input:
user_input = 0x0e778d0000000608
key_schedule = generate_key(user_input)
# The XOR loop later uses only 16 bytes: index = i mod 16 into key_schedule
xor_key = key_schedule[:16]
# Represent xor_key as a little-endian hex integer.
# Since xor_key is a sequence of 16 bytes in order, that order is little-endian order.
val = 0
for i, b in enumerate(xor_key):
val |= (b << (8 * i))
print(f"XOR key (little-endian hex): 0x{val:032x}")
print(f"key_size: {len(xor_key)}")
Takeaways
During the actual CTF, we used this agent to solve around the first 100 functions (left it running overnight Saturday), at a cost of around $20.
However, after examining the decoded functions, we realized that many of them were repetitive enough to solve with a heuristic-based script and ended up switching to that for the rest of the challenge which was much faster (and cheaper).
The agent did obviate the need to actually manually reverse engineer any decoding schemes. I ended up just stealing one its solutions for the S-box key expansion and for the bit checking user input (Func 4 above).
Later in part two of the challenge, a few new schemes were introduced, including an embedded RC4-based user input checker, which o3-mini
was able to solve by recognizing the RC4 pattern and writing the correct Python code to decode it.
Code
You can find the code for the agent here: hgarrereyn/nfuncs-agent.
This is messy CTF proof-of-concept code, but it may serve useful as an example.