Debugging OXO agents

This article is a follow-up to the previous article on how to write an OXO Agent. In the following sections we will share tips and tricks to make the task of writing & debugging agents a simple task.

OXO’s engine & Agents logs

OXO’s engine accepts a --debug flag, that will set all loggers, OXO’s, and any third-party package to debug level. This will display a lot of information about what is happening, some of which are very useful, for example, you can answer questions like was the network created successfully, are the agent’s configurations the right ones, are system agents healthy, etc.

Suppose you are only interested in investigating a specific agent or a set of agents. In that case, you can use the --follow agent/<organisation>/<agent_name> flag to only display logging messages for a specific agent. You can follow as many as you want; each will be color coded differently for easier readability.

Here are both flags in action:

oxo --debug scan run --agent agent/ostorlab/subfinder --follow agent/ostorlab/subfinder --agent agent/ostorlab/dnsx --follow agent/ostorlab/dnsx domain-name ostorlab.co

--follow flag

You are basically debugging docker containers

Since the agents are docker services working collectively to scan an asset, one can make use of docker commands to help investigate:

docker service ls: Lists running services. Useful to see which agents are up and which are down.
docker service logs <service>: Displays log messages of the agent docker service.
docker service inspect <service>: Displays metadata of the service’s state, configurations, etc.
docker stats <container>: Displays information about the state of the container like CPU%, memory & IO usage.

Remember that you can always get a shell inside the agent with:
docker exec -it <container> bash/sh, for more control over how you want to drive your investigation.

In some cases, you won’t have a clear error message, but you know something is wrong, maybe the message is not emitted or processed, maybe it was emitted, but to the wrong queue, maybe you added a unicity check to not process the same message twice, but it is still getting processed again and again, or it could be a completely different reason.

RabbitMQ

The agents use RabbitMQ as a message broker, to send and receive messages. One thing we can do is to get a shell into the RabbitMQ container and use the RabbitMQ cli commands to:

Get the list of queues with columns specified, you can add and remove them depending on your need:

rabbitmqadmin list queues name node messages messages_ready messages_unacknowledged message_stats.publish_details.rate message_stats.deliver_get_details.rate message_stats.publish

queues stats

Display the list of bindings: bindings are rules for routing messages to respective queues:

rabbitmqadmin list bindings

RabbitMQ bindings

Test queues to respond within the timeout. Lists those who did not respond:

rabbitmqctl list_unresponsive_queues

Another approach you can take when it comes it the RabbitMQ service is to use the --mq-exposed-ports flag and specify the port of the RabbitMQ management user interface to be exposed, for example :

oxo scan --mq-exposed-ports 15672:15672 run --agent ...

Then you can access the interface with 127.0.0.1:15672 with 'guest' as the default username and password.

RabbitMQ management UI

You can achieve the same as the command line we saw in the previous section like listing queues, exchanges, the number, and rate of delivered or processed messages or just get an overall idea about the state of the service.

RabbitMQ queues management UI

Redis

Redis does not expose a management user interface, however since it is intuitive, we do not really need that. With basic commands we can answer questions like:

What keys are persisted?

KEYS <regex>

you can use * to show all the keys or any regular expression to limit the output.
Does a key exist?
For strings, bytes and sets use:

EXISTS <key>

For hashmaps use:

HEXISTS key fields

What’s the value type of a specific key?

TYPE <key>

What's the value of a specific key? Depending on the type: For string/bytes values

GET <key>

For hash-maps values

HGET <key> <field>

Or to list all the values

HGETALL <key>

For all values of a set

SMEMBERS <key>

Tracing

Distributed tracing is a way to get a bird's view into how a request is flowing from one system to another. It can help us troubleshoot messages as they move from one agent to another.

OXO comes with instrumentation already implemented for both process & emit messages methods. In other terms you can see when a message was processed, with which selector, the content of its data, and also the value of its output or emitted results

To enable the tracing of messages between agents, we can run the classic oxo scan run command with the --tracing flag. This will expose jaeger; a user interface for monitoring and tracing transactions between distributed services. In simple terms, it serves the purpose of visualizing the chain of events happening between microservices, in our case between Agents. You can follow a message being processed by Agent1, then the results of the processing emitted to Agent2 and so on. You can see different attributes like time, duration, which selector was used, the value of the message..etc.

Before explaining the results, let’s talk briefly about the main concepts of OpenTelemetry.

Trace: The path taken by requests, in our case the messages, as they propagate between multi-service architectures, in our case between agents.
Span: A tracked operation performed by or to the request. In our case, for example, the message being processed.
Context: Metadata that helps track spans when they move between the services. It is basically the traceID and spanID that help us link the flow of the message between the agent.

So by running :

oxo scan --tracing run --agent agent/ostorlab/subfinder --agent agent/ostorlab/dnsx domain-name ostorlab.co

Subfinder and DNSx agents will be launched on ostorlab.co and a browser will open at: http://127.0.0.1:16686 with the jaeger user interface.
The other agents get spawned after the jaeger service, so it is normal to not see anything at first.

We can see the 2 agents in the services dropdown, if we take the example of the Subfinder agent:
We see our first domain message ostorlab.co as the root span, and its children underneath, describing the path of what happened. The first message was processed, and 10 other messages were emitted, each with a specific subdomain: docs.ostorlab.co, blog.ostorlab.co.

Subfinder trace overview

Each of those were picked up by the DNSx agent, processed and each emitted DNS records with extra information.

Subfinder trace details

Unit tests

At the beginning of the agent development cycle, it is preferred to make use of the power of unit tests to make sure your code is actually working and achieving its purpose. It is an efficient way to eliminate bugs or weird behaviors coming directly from the agent code and not from external sources or from the interaction of your agent with the rest of the system.

Ostorlab offers pytest fixtures to help cover the whole process in unit tests, for example:

agent_run_mock: This fixture patches the agent low-level methods like the agent service health check, initializing the RabbitMQ server, connecting to it, consuming and emitting messages. It also returns an AgentRunInstance instance with the follwing attributes:

agent_run_mock.raw_messages: List
agent_run_mock.emitted_messages: List

So a check like:

assert len(agent_run_mock.emitted_messages) > 0

tells us that there was some emitted message. We can access the first element for example with:

agent_run_mock.emitted_messages[0]

which is a Message class with the following attributes:

message.selector
message.data

These attributes can help us verify that we have used the correct selector for example or that the message data is what we expected.

agent_persist_mock: This fixture patches the Agent Redis persistence component and returns a dictionary mimicking Redis, with the keys as Redis keys, and the values respecting the types stored in Redis. This is useful, for example to test agents that inherit from the persist_mixin to implement a unicity check for the received messages, to not process the exact message multiple times, or when you need a counter that can be used by all replicas of an agents.

assert agent_persist_mock.get('<key>') is not None
assert agent_persist_mock.get('<global_counter>') == 42

Persisting Messages

When running a scan, you might want to persist some messages that were processed in local. This can be achieved by including the Nebula Agent in the list of agents to run.

The Nebula Agent serves as a tool for persisting messages to local files, allowing you to specify the desired file type (e.g., json). This functionality proves invaluable for debugging purposes, as it provides visibility into the messages that have been processed.

To run scan with the Nebula Agent and persist the messages to a file:

ostorlab scan run --follow agent/ostorlab/nebula --agent agent/ostorlab/nebula --agent agent/ostorlab/subfinder --agent agent/ostorlab/dnsx domain-name ostorlab.co

The Nebula Agent will persist the messages to files based on the selector in the ~/.ostorlab/messages directory:

Persisted messages

By default, all types of messages are persisted. However, you can specify the types of messages you want to persist by adding the in_selectors parameter in the agent group definition:

kind: AgentGroup
description: Test Group for documentation.
agents:
  - key: agent/ostorlab/nebula
    args:
      - name: file_type
        type: string
        value: "json"

    in_selectors:
      - v3.asset
      - v3.fingerprint

  - key: agent/ostorlab/subfinder
    args: [ ]

  - key: agent/ostorlab/dnsx
    args: [ ]

Conclusion

We have seen different aspects of how to debug agents. From using basic commands to display logs, check the RabbitMQ queues, and verify Redis data exist. We have also seen how to follow message traces through their life cycle and how we can make use of the offered pytest fixtures to test our code even before spawning the agent.