Using HTTP Pull as a collector

Overview

HTTP Puller integrations are parametrized and organized by vendor, categorized by product/API.

Inside each endpoint, you will find a yaml configuration. This configuration is used in the Onum HTTP Puller action in order to start feeding that information into the platform. Check the articles under this section to learn more about configurations specific to each vendor.

Desconstructing a YAML

Here we will learn what each parameter of the YAML means, and how they correspond to the settings in the HTTP Pull Listener.

The YAML is used for pulling alerts via an API and typically uses

  • A Temporal Window to enable the use of a time-based query window for filtering results.

  • Authentication using a token to authenticate the connection.

  • The first phase (Enumeration) enables an initial listing phase to get identifiers (e.g., alert IDs), paginating through the results.

  • The second phase (Collection) then fetches full alert details using the alert IDs from the enumeration phase.

  • Standard JSON response mapping is used to output the results.

Only the Collection phase is mandatory, the rest of the fields are optional.

Let´s take a closer look at each phase below.


Temporal window

A temporal window is a defined time range used to filter or limit data retrieval in queries or API requests. It specifies the start and end time for the data you want to collect or analyze. This YAML uses a temporal window of 5 minutes, in RFC3339 format, with an offset of 0, in UTC timezone.

Parameter
Description

Duration*

Add the duration in milliseconds that the window will remain open for.

Offset*

How far back from the current time the window starts.

Time Zone*

This value is usually automatically set to your current time zone. If not, select it here.

Format*

Choose between Epoch or RCF3339 for the timestamp format.

Temporal Window example
withTemporalWindow: true
temporalWindow:
  duration: 5m
  offset: 0
  tz: UTC
  format: RFC3339

In Onum, toggle ON the Temporal Window selector and enter the information in the corresponding fields

  • Duration* - 5m

  • Offset* - 0s

  • TZ* - this will set automatically according to your current timezone.

  • Format* - RFC3339


Authentication phase

If your connection requires authentication, enter the credentials here.

Parameter
Description

Authentication Type*

Choose the authentication type and enter the details.

Authentication credentials

The options provided will vary depending on the type chosen to authenticate your API. This is the type you have selected in the API end, so it can recognize the request.

Choose between the options below.

Basic
  • Username* - the user sending the request.

  • Password* - the password eg: ${secrets.password}

API Key

Enter the following:

  • API Key - API keys are usually stored in developer portals, cloud dashboards, or authentication settings. Set the a secret, eg: ${secrets.api_key}

  • Auth injection:

    • In* - Enter the incoming format of the API: Header or Query.

    • Name* - The header name or parameter name where the api key will be sent.

    • Prefix - Enter a prefix if required.

    • Suffix - Enter a suffix if required.

Token

Token Retrieve Based Authentication

  • Request -

    • Method* - Choose between GET or POST

    • URL*- Enter the URL to send the request to.

  • Headers - Add as many headers as required.

    • Name

    • Value

  • Query Params - Add as many query parameters as required.

    • Name

    • Value

  • Token Path* - Enter your Token Path for used to retrieve an authentication token.

  • Auth injection:

    • In* - Enter the incoming format of the API: Header or Query.

    • Name* - A label assigned to the API key for identification. You can find it depending on where the API key was created.

    • Prefix - Enter a connection prefix if required.

    • Suffix - Enter a connection suffix if required.

withAuthentication: true
authentication:
  type: token
  token:
    request:
      method: POST
      url: ${parameters.domain}/oauth2/token
      headers:
        - name: Content-Type
          value: application/x-www-form-urlencoded
      bodyType: urlEncoded
      bodyParams:
        - name: grant_type
          value: client_credentials
        - name: client_id
          value: '${secrets.client_id}'
        - name: client_secret
          value: '${secrets.client_secret}'
    tokenPath: ".access_token"
    authInjection:
      in: header
      name: Authorization
      prefix: 'Bearer '
      suffix: ''

Example

  • Type - Token. Token authentication is a method of authenticating API requests by using a secure token, usually passed in an HTTP header.

  • Request

    • method - POSTSends a POST request to obtain an access token.

    • url - ${parameters.domain}/oauth2/tokenThe OAuth token endpoint. ${parameters.domain} is a placeholder for value entered in the Parameters section.

    • headers - these headers are key-value pairs that provide additional information to the server when making a request.

      • name - Content-Type

      • value - application/x-www-form-urlencodedIndicates that the request body is formatted as URL-encoded key-value pairs (standard for OAuth token requests).

    • Body type -urlEncoded Specifies the request body format is URL-encoded (like key=value&key2=value2).

      • Body params

        • name - grant_type Required by OAuth 2.0 to specify the type of grant being requested.

        • value - client_credentials Used for server-to-server authentication without a user.

        • name - client_ID

        • value - ${secrets.client_id}this is a dynamic variable pulled from the value entered in the Secrets setting.

        • name - client_secret

        • value - ${secrets.client_secret} this is a dynamic variable pulled from the value entered in the Secrets setting.

    • Token path - Extracts the access token from the JSON response of an authentication request. It's a JSONPath-like expression used to locate the token in the response body.

Toggle ON the Authentication option.

  • Auth injection - This part defines how and where to inject the authentication token (typically an access token) into the requests after it has been retrieved, for example, from an OAuth token endpoint.

    • in -headerThe token should be injected into the HTTP header of the request.This is the most common method for passing authentication tokens.

    • Name -AuthorizationThe name of the header that will contain the token. Most APIs expect this to be Authorization.

    • prefix - The text added before the token value.Bearer is the standard prefix for OAuth 2.0 tokens.

    • suffix -''Text added after the token value. In this case, it's empty — nothing is appended.

HMAC

Signs the queries using a secret key that is used by the server to authenticate and validate integrity.

Token Retrieve Based Authentication

Request

  • Generate ID - Toggle ON to generate.

  • Generate Timestamp

    • Timezone* - this field is automatically-filled using your current timezone.

    • Format* - the format for the timestamp syntax (Seconds, Epoch, Epoch Timestamp, RFC1123, RFC1123Z, RFC3339 or custom). Selecting custom opens the Go time format option, where you can write your custom syntax e.g. 2 Jan 2006 15:04:05

  • Generate content hash

    • Content hash

      • Hashing algorithm* - select the hash operation to carry out on the content.

      • Encoding* - choose the encoding method.

    • Hashing

      • Hashing algorithm* - select the hash operation to carry out on the content.

      • Encoding* - choose the encoding method.

      • Secret key* - how to generate the string that will be signed.

      • Data to sign* - e.g. "${request.method}\n${request.contentHash}\napplication/json\n${request.relativeUrl}\n${request.timestamp}"

  • Headers to be added to the request (name & value).

withAuthentication: true
authentication:
  type: hmac
  hmac:
    request:
      generateTimestamp: true
      timestamp:
        tz: UTC
        format: EpochMillis
    hash:
      secretKey: ${secrets.apiSecret}
      algorithm: hmac_sha256
      encoding: hex
      dataToSign: "${secrets.apiKey}${request.body}${request.timestamp}"
    headers:
      x-logtrust-apikey: ${secrets.apiKey}
      x-logtrust-timestamp: ${request.timestamp}
      x-logtrust-sign: ${hmac.hash}

Example: Authenticate HTTP requests to Microsoft Azure using the HMAC-SHA256 scheme.

Learn how to calculate the HMAC for this API here.

withAuthentication: true
authentication:
  type: hmac
  hmac:
    request:
      generateTimestamp: true
      timestamp:
        tz: UTC
        format: RFC1123
      generateContentHash: true
      contentHash:
        algorithm: sha256
        encoding: base64
    hash:
      algorithm: hmac_sha256
      encoding: base64
      secretKey: ${secrets.secretKey}
      dataToSign: "${request.method}\n${request.relativeUrl}\n${request.timestamp};${request.host};${request.contentHash}"
    headers:
      - name: x-ms-date
        value: ${request.timestamp}
      - name: x-ms-content-sha256
        value: ${request.contentHash}
      - name: Authorization
        value: "HMAC-SHA256 Credential=${secrets.accessKeyId}&SignedHeaders=x-ms-date;host;x-ms-content-sha256&Signature=${hmac.hash}"
  • Type - HMAC.

Request Parameters

  • Generate Timestamp

    • Timezone - UTC

    • Format - RFC1123

  • Generate Content Hash

    • Algorithm - sha256

    • Encoding - base64

Hash

Base64-encoded HMACSHA256 of the String-To-Sign.

  • Algorithm - hmac_sha256

  • Encoding - base64

  • Secret Key - ${secrets.secretKey} This variable is retrieved from the secrets parameter.

  • Data To Sign - A canonical representation of the request with the format HTTP_METHOD + '\n' + path_and_query + '\n' + signed_headers_values ${request.method}\n${request.relativeUrl}\n${request.timestamp};${request.host};${request.contentHash}

Headers

  • Name - x-ms-date can be used when the agent cannot directly access the Date request header or when a proxy modifies it. If both x-ms-date and Date are provided, x-ms-date takes precedence.

  • Value - ${request.timestamp}

  • Name - x-ms-content-sha256 Base64-encoded SHA256 hash of the request body. It must be provided even if there is no body.

  • Value - ${request.contentHash}

  • Name - Authorization Required by the HMAC-SHA256 scheme.

  • Value - HMAC-SHA256 Credential=${secrets.accessKeyId}&SignedHeaders=x-ms-date;host;x-ms-content-sha256&Signature=${hmac.hash}

Example 2: API HMAC Authentication for Oracle

See here for how to calculate the API HMAC in Oracle.

  • Type - HMAC.

Request Parameters

  • Generate ID

    • Type - uuid

  • Generate Timestamp

    • Timezone - UTC

    • Format - Epoch

  • Generate Content Hash

    • Algorithm - sha1

    • Encoding - base64 - The binary hash result will be encoded in Base64 for transmission.

Hash

Base64-encoded HMACSHA256 of the String-To-Sign.

  • Algorithm - hmac_sha256

  • Encoding - base64

  • Secret Key - ${secrets.secretKey} This variable is retrieved from the secrets parameter.

  • Data To Sign - ${request.method}\n${request.contentHash}\napplication/json${request.timestamp}\n${request.relativeUrl}This is the canonical string-to-sign:

    • ${request.method} - HTTP method (e.g., GET, POST)

    • ${request.contentHash} - Base64 SHA-1 hash of the request body

    • "application/json" - Hardcoded content type

    • ${request.timestamp} - Epoch UTC timestamp

    • ${request.relativeUrl} - The relative path and query string

      The \n means each element is separated by a newline.

Headers

  • Name - ct-authorization

  • Value - CTApiV2Auth ${parameters.publicKey}:${hmac.hash}

    • CTApiV2Auth - Authentication scheme name.

    • ${parameters.publicKey} - Public key or access ID.

    • ${hmac.hash} - The generated HMAC-SHA256 signature from the hash section.

  • Name - ct-timestamp

  • Value - ${request.timestamp} the same Epoch UTC timestamp generated earlier.

withAuthentication: true
authentication:
  type: hmac
  hmac:
    request:
      generateId: true
      idType: uuid
      generateTimestamp: true
      timestamp:
        tz: UTC
        format: Epoch
      generateContentHash: true
      contentHash:
        algorithm: sha1
        encoding: base64
    hash:
      algorithm: hmac_sha256
      encoding: base64
      secretKey: ${secrets.secretKey}
      dataToSign: "${request.method}\n${request.contentHash}\napplication/json${request.timestamp}\n${request.relativeUrl}"
    headers:
      - name: x-ct-authorization
        value: CTApiV2Auth ${parameters.publicKey}:${hmac.hash}
      - name: x-ct-timestamp
        value: ${request.timestamp}

Retry

Toggle ON to allow for retries and to configure the specifics.

Parameter
Description

Retry Type*

  • Fixed - Retries the failed operation after a constant, fixed interval every time e.g. the same amount of time between each retry attempt

    • Interval* - enter the amount of time to wait e.g. 5s.

  • Exponential - Retries the failed operation after increasingly longer intervals to avoid overwhelming the service. The delay grows with each retry attempt.

    • Initial delay* - The starting delay before the first retry attempt to ensure there’s at least some delay before retrying to avoid immediate re-hits. For example, an initial delay of 2s equals a retry pattern of 2s, 4s, 8s, 16s, etc.

    • Maximum delay* - The maximum wait time allowed between retries to prevent the retry delay from growing indefinitely. For example, an initial delay of 2s and a maximum delay of 10s equals a delay progression of 2s, 4s, 8s, 10s, 10s, etc.

    • Increasing factor* - The multiplier used to calculate the next delay interval, determining how quickly the delay grows after each failed attempt.

Retry after response header

Used to define how long to wait before making another request e.g. HTTP 429 Too Many Requests or HTTP 503 Service Unavailable.

  • Header - Follow the header syntax for the header.

  • Format - The format for the header syntax (Seconds, Epoch, Epoch Timestamp, RFC1123, RFC1123Z, RFC3339).

    • e.g. wait 120 seconds Retry-After: 120

    • e.g. epoch timestamp Retry-After: Wed, 21 Oct 2025 07:28:00 GMT


Throttling

Use throttling to intentionally limit the rate at which the HTTP requests are sent to the API or service.

Throttling Type*

The client itself controls and limits the rate at which it sends requests.

Parameter
Description

Client type*

How to manage the rate of requests.

  • Rate - the client is restricted by the data transfer rate or request rate over time.

    • Maximum requests* - The maximum number of requests (or amount of data) to make within a specified time interval.

    • Call interval* - The sliding or fixed window of time used to calculate the rate.

    • Number of burst requests* - the number of requests that can exceed the normal rate temporarily before throttling kicks in to allow short bursts of traffic over the limit to accommodate sudden spikes without immediate blocking. e.g. if the max rate is 10 requests/sec, and burst is 5, the client could make up to 15 requests instantly, but then throttling will slow down after the burst.

  • Fixed delay - The server enforces a fixed wait time after each request before allowing the client to make the next request. Instead of limiting by rate (requests per second) or volume, it just inserts a pause/delay between requests.

    • Call interval* - The sliding or fixed window of time used to calculate the delay.


Enumeration phase

The enumeration phase is an optional step in data collection or API integration workflows, where the system first retrieves a list of available items (IDs, resource names, keys, etc.) before fetching detailed data about each one.

Identify the available endpoints, methods, parameters, and resources exposed by the API. This performs initial data discovery to feed the collection phase and makes the results available to the Collection Phase via variable interpolation (inputs.*).

Can use:

  • ${parameters.xxx}

  • ${secrets.xxx}

  • ${temporalWindow.xxx} (if configured)

  • ${pagination.xxx} Pagination variables

Parameter
Description

Pagination Type*

Select one from the drop-down. Pagination type is the method used to split and deliver large datasets in smaller, manageable parts (pages), and how those pages can be navigated during discovery.

Each pagination method manages its own state and exposes specific variables that can be interpolated in request definitions (e.g., URL, headers, query params, body).

None

  • Description: No pagination; only a single request is issued.

  • Exposed Variables: None

PageNumber/PageSize

  • Description: Pages are indexed using a page number and fixed size.

  • Configuration:

    • pageSize: page size

  • Exposed Variables:

    • ${pagination.pageNumber}

    • ${pagination.pageSize}

Offset/Limit

  • Description: Uses offset and limit to fetch pages of data.

  • Configuration:

    • Limit: max quantity of records per request

  • Exposed Variables:

    • ${pagination.offset}

    • ${pagination.limit}

From/To

  • Description: Performs pagination by increasing a window using from and to values.

  • Configuration: limit: max quantity of records per request

  • Exposed Variables:

    • ${pagination.from}

    • ${pagination.to}

Web Linking (RFC 5988)

  • Description: Parses the Link header to find the rel="next" URL.

  • Exposed Variables: None

Next Link at Response Header

  • Description: Follows a link found in a response header.

  • Configuration:

    • headerName: header name that contains the next link

  • Exposed Variables: None

Next Link at Response Body

  • Description: Follows a link found in the response body.

  • Configuration:

    • nextLinkSelector: path to next link sent in response payload

  • Exposed Variables: None

Cursor

  • Description: Extracts a cursor value from each response to request the next page.

  • Configuration:

    • cursorSelector: path to the cursor sent in response payload

  • Exposed Variables:

    • ${pagination.cursor}

Output

Parameter
Description

Select*

If your connection does not require authentication, leave as None. Otherwise, choose the authentication type and enter the details. A JSON selector expression to pick a part of the response e.g. '.data'.

Filter

A JSON expression to filter the selected elements. Example: '.films | index("Tangled")'.

Map

A JSON expression to transform each selected element into a new event. Example: '{characterName: .name}'.

Output Mode*

Choose between

  • Element: emits each transformed element individually as an event.

  • Collection: emits all transformed items as a single array/collection as an event.

Enumeration example
enumerationPhase:
  paginationType: offsetLimit
  limit: 100
  request:
    responseType: json
    method: GET
    url: ${parameters.domain}/alerts/queries/alerts/v2
    queryParams:
      - name: offset
        value: ${pagination.offset}
      - name: limit
        value: ${pagination.limit}
      - name: filter
        value: created_timestamp:>'${temporalWindow.from}'+created_timestamp:<'${temporalWindow.to}'
  output:
    select: ".resources"
    map: "."
    outputMode: collection
  • Pagination type - offset/LimitUses classic pagination with offset and limit to page through results, fetching data in batches (pages) — limit determines page size, offset determines where to start.

  • Limit - Retrieves up to 100 records per request. This value is used in the limit query parameter to control batch size.

  • Request - Describes the API request that will be sent during enumeration.

    • Response type - Specifies the expected response format. Here, the system expects a JSON response.

    • Method - The HTTP method to use for this request. GET is used to retrieve data from the server.

    • URL - ${parameters.domain} is a placeholder variable that will be replaced by the domain value you entered in the Parameters section.

Query params - These are query string parameters appended to the URL.

  • ${pagination.offset}controls where to start in the dataset. Used for pagination.

  • ${pagination.limit}replaced with the limit value you entered for number of records to retrieve per request (100).

  • Filters data to only return alerts created within a specific time window. ${temporalWindow.from} and ${temporalWindow.to} are dynamically filled in with RFC3339 or epoch timestamps, depending what you have configured.

output - Describes how to extract and interpret the results from the JSON response.

  • select - .resourcesLooks for a field named resources in the response JSON. This is where the array of items lives.

  • map - .Each item under .resources is returned as-is. No transformation or remapping.

  • outputMode - collectionThe result is treated as a collection (array) of individual items. Used when you expect multiple items and want to pass them along for further processing.


Collection phase

The collection phase in an HTTP Puller is the part of the process where the system actively pulls or retrieves data from an external API using HTTP requests.

The collection phase is mandatory. This is where the final data retrieval happens (either directly or using IDs/resources generated by an enumeration phase).

The collection phase involves gathering actual data from an API after the enumeration phase has mapped out endpoints, parameters, and authentication methods. It supports dynamic variable resolution via the variable resolver and can use data exported from the Enumeration Phase, such as:

  • ${parameters.xxx}

  • ${secrets.xxx}

  • ${temporalWindow.xxx}

  • ${inputs.xxx} (from Enumeration Phase)

  • ${pagination.xxx}*

Inputs

In collection phases, you can define variables to be used elsewhere in the configuration (for example, in URLs, query parameters, or request bodies). Each variable definition has the following fields:

Parameter
Description

Name

The variable name (used later as ${inputs.name} in the configuration).

Source

Usually "input", indicating the value comes from the enumeration phase’s output.

Expression

A JSON expression applied to the input to extract or transform the needed value.

Format

Controls how the variable is converted to a string (see Variable Formatting below). Eg: json.

Retry

Toggle ON to allow for retries and to configure the specifics.

Parameter
Description

Retry Type*

  • Fixed - Retries the failed operation after a constant, fixed interval every time e.g. the same amount of time between each retry attempt

    • Interval* - enter the amount of time to wait e.g. 5s.

  • Exponential - Retries the failed operation after increasingly longer intervals to avoid overwhelming the service. The delay grows with each retry attempt.

    • Initial delay* - The starting delay before the first retry attempt to ensure there’s at least some delay before retrying to avoid immediate re-hits. For example, an initial delay of 2s equals a retry pattern of 2s, 4s, 8s, 16s, etc.

    • Maximum delay* - The maximum wait time allowed between retries to prevent the retry delay from growing indefinitely. For example, an initial delay of 2s and a maximum delay of 10s equals a delay progression of 2s, 4s, 8s, 10s, 10s, etc.

    • Increasing factor* - The multiplier used to calculate the next delay interval, determining how quickly the delay grows after each failed attempt.

Retry after response header

Used to define how long to wait before making another request e.g. HTTP 429 Too Many Requests or HTTP 503 Service Unavailable.

  • Header - Follow the header syntax for the header.

  • Format - The format for the header syntax (Seconds, Epoch, Epoch Timestamp, RFC1123, RFC1123Z, RFC3339).

    • e.g. wait 120 seconds Retry-After: 120

    • e.g. epoch timestamp Retry-After: Wed, 21 Oct 2025 07:28:00 GMT

Throttling

Use throttling to intentionally limit the rate at which the HTTP requests are sent to the API or service.

Throttling Type*

The client itself controls and limits the rate at which it sends requests.

Parameter
Description

Client type*

How to manage the rate of requests.

  • Rate - the client is restricted by the data transfer rate or request rate over time.

    • Maximum requests* - The maximum number of requests (or amount of data) to make within a specified time interval.

    • Call interval* - The sliding or fixed window of time used to calculate the rate.

    • Number of burst requests* - the number of requests that can exceed the normal rate temporarily before throttling kicks in to allow short bursts of traffic over the limit to accommodate sudden spikes without immediate blocking. e.g. if the max rate is 10 requests/sec, and burst is 5, the client could make up to 15 requests instantly, but then throttling will slow down after the burst.

  • Fixed delay - The server enforces a fixed wait time after each request before allowing the client to make the next request. Instead of limiting by rate (requests per second) or volume, it just inserts a pause/delay between requests.

    • Call interval* - The sliding or fixed window of time used to calculate the delay.

Parameter
Description

Pagination Type*

Choose how the API organizes and delivers large sets of data across multiple pages—and how that affects the process of systematically collecting or extracting all available records.

Output

Parameter
Description

Select*

If your connection does not require authentication, leave as None. Otherwise, choose the authentication type and enter the details. A JSON selector expression to pick a part of the response e.g. '.data'.

Filter

A JSON expression to filter the selected elements. Example: '.films | index("Tangled")'.

Map

A JSON expression to transform each selected element into a new event. Example: '{characterName: .name}'.

Output Mode*

Choose between

  • Element: emits each transformed element individually as an event.

  • Collection: emits all transformed items as a single array/collection as an event.

Collection example

Let´s say you have the following SIEM Integration events from Sophos.

collectionPhase:
  paginationType: cursor
  cursorSelector: ".next_cursor"
  initialRequest:
    method: GET
    url: "${inputs.dataRegionURL}/siem/v1/events"
    headers:
      - name: Accept
        value: application/json
      - name: Accept-Encoding
        value: gzip, deflate
      - name: X-Tenant-ID
        value: "${inputs.tenantId}"
    queryParams:
      - name: from_date
        value: "${temporalWindow.from}"
    bodyParams: []
  nextRequest:
    method: GET
    url: "${inputs.dataRegionURL}/siem/v1/events"
    headers:
      - name: Accept
        value: application/json
    queryParams:
      - name: cursor
        value: "${pagination.cursor}"
    bodyParams: []
  output:
    select: ".result"
    filter: "."
    map: "."
    outputMode: element
  • Pagination type - cursor. If you select the cursor type, you retrieve the data in chunks (pages) using a cursor token, which points to the position in the dataset where the next page of results should start.

    • Cursor selector - The cursor selector tells the HTTP Puller where to find the cursor value in the API response so it can be saved and used in the next request e.g. .next_cursor

  • Initial request - We fetch the first set of results, the response including the cursor token (e.g. timestamp or ID).

    • method - GET to fetch the results.

    • url - The URL is composed of various elements:

      • https://${inputs.dataRegionURL}- these variables are taken from the values you entered in the Parameters section of the HTTP Pull settings.

      • /siem/v1/ -API base path — indicates you're calling version 1 of the SIEM API.

      • events- indicates the specific endpoint being accessed. events general category of the API (event-related).

  • headers - these headers are key-value pairs that provide additional information to the server when making a request.

    • name - Accept

    • value - application/json tells the server that the client expects the response to be in JSON format, a standard HTTP header used for content negotiation.

  • Next request - send the cursor token back to the server using a parameter (e.g., ?cursor=abc123) to get the next page of results. The server returns the next chunk of data and a new cursor.

    Repeat until no more data or the server returns a has_more: false flag.method

  • Output

    • select - .result Selects the part of the response to extract. This is a JSONPath-like expression that tells the puller where to find the list or array of items in the response.

    • map - . Maps each selected item as-is, keeping each object unchanged. It passes through each item without transforming it. If you needed to restructure or extract specific fields from each item, you would replace . with a field mapping (e.g., .id, { "id": .id, "name": .username }, etc.).

    • output mode - element Controls the output format. Each item from the select result will be emitted individually using element. This is useful for event stream processing, where each object (e.g., an alert or event) is treated as a separate record. Other possible values (depending on the platform) might include array (emit as a batch) or raw (emit as-is).

Examples

1. Basic GET Puller

Here's a simple example of using the HTTP Puller collector with parameters for a basic GET request. No authentication, no pagination, just pulling JSON data from an API endpoint. Keep Config as YAML, Temporal window, Authentication and Enumeration phase as OFF.

  • Collection phase

    • Pagination type - none Indicates that you only need one request to retrieve all data at once.

    • Request

      • Response type - jsonTells the puller to expect a JSON response.

      • Method: GET Performs a basic HTTP GET request.

      • URL: Constructed from the parameters.domain and parameters.path https://{{parameters.domain}}{{parameters.path}}

    • Headers: Set standard headers and include the API key.

    • Output:

      • Select:.logs Tells the system where to find the list of log entries in the response.

      • Output mode: element each object inside .logs will be extracted as a separate output element e.g.

        {
          "logs": [
            { "timestamp": "2024-12-01T12:00:00Z", "event": "user_login" },
            { "timestamp": "2024-12-01T12:05:00Z", "event": "file_upload" }
          ]
        }

2. Make an HTTP request using offset and limit pagination

Instead of displaying the results in a scrollable list, we will use offset/limit pagination to fetch data in pages.

  • Pagination type - offset/Limit We control how many records are returned at a time (limit) and choose where to start each request (offset or skip parameter)

  • Zero Index - false

  • Limit* - 50

  • Request - The request to be repeated, with offset and limit automatically incremented per iteration.

  • Response type* - Json

  • Method* - GET

  • URL* - https://example.com/items

  • Query params The API supports pagination through query parameters:

    • Name - skip

    • Value - ${pagination.offset}" the number of records to skip before returning results

    • Name - limit

    • Value - ${pagination.limit} uses the limit entered (50) as the maximum number of records to return in one request.

httpRequest:
  type: "offsetLimit"
  offsetLimit:
    limit: 50
    isZeroIndex: false
    request:
      method: "GET"
      url: "https://example.com/items"
      queryParams:
        skip: "${pagination.offset}"
        limit: "${pagination.limit}"

This example defines a data extraction workflow that

  1. Enumerates through a paginated API endpoint using responseBodyLink.

  2. Filters and transforms specific data from the paginated results.

  3. Collects further data based on the enumerated output using individual requests.

It also uses a temporal window to scope or schedule the data extraction process.

# Temporal window (optional)
# Generated variables: $temporalWindow.from, $temporalWindow.to
temporalWindow:
  duration: 5m
  offset: 10m
  tz: UTC
  format: RFC3339
enumeration:
  type: "httpRequest"
  httpRequest:
    type: "responseBodyLink"
    responseBodyLink:
      nextLinkSelector: ".info.nextPage"
      request:
        method: "GET"
        url: "https://api.cyberintel.dev/iocs"
        headers:
          Accept: "application/json"
      stopCondition:
        type: bodyExpression
        bodyExpression:
          expression: "(.data | length) == 50"
  output:
    select: '.data'
    filter: '.threatType == "Ransomware"'
    map: '._id'
    outputMode: "element"
collection:
  variables:
    - name: id
      source: input
      expression: "."
  type: "httpRequest"
  httpRequest:
    type: "none"
    none:
      request:
        method: "GET"
        url: "https://api.cyberintel.dev/iocs/${id}"
        headers:
          Accept: "application/json"
  output:
    select: ".data"
    filter: ""
    map: "{iocName: .name}"
    outputMode: "element"
    callback: "saveToFile"

Enumeration

The enumeration defines how to gather data in a paginated manner from the Cyber Threat Intelligence API using the responseBodyLink pagination strategy.

  • Pagination Type - The type is Next Link At Response Body

  • Selector - The next page link is found using the JSON path ".info.nextPage" This suggests that the response will contain a field info.nextPage with the URL of the next page of results.

For example, the response might look like:

{
  "info": {
    "nextPage": "https://api.cyberintel.dev/iocs?page=2"
  },
  "data": [ ... ]
}
  • Response type - JSON

  • Method - GET. The HTTP method is GET to fetch the data.

  • URL - The initial URL for the request is "https://api.cyberintel.dev/iocs", where the IOCs are listed.

  • headers - The Accept header specifies that the response should be in JSON format.

Stop Condition

  • StopCondition - This defines when to stop pagination.

  • Type - The stop condition is based on the response body.

  • BodyExpression - This expression (.data | length) == 50 will stop the pagination once the number of items in the .data array equals 50. This suggests that each page contains 50 IOCs, and once a page returns fewer than 50 IOCs, it marks the end of the dataset.

Output

  • Select - The .data array from the response is selected for further processing. This array contains the actual IOC data.

  • Filter - The filter expression '.threatType == "Ransomware"' selects only those IOCs where the threatType is "Ransomware". This is how we focus on ransomware-related indicators.

  • Map - The map expression '._id' extracts the ._id field from each IOC that passed the filter. This results in a list of IOC IDs that match the ransomware threat type.

  • Output Mode - element indicates that each IOC ID (element) is treated as an individual item, rather than as a group or array.

Result: After processing the pages, we will have a list of ransomware IOC IDs.

Collection

Once the enumeration process gathers a list of IOC IDs related to ransomware, the collection section is responsible for retrieving more detailed information for each of those IOCs.

variables - This section defines variables used in the collection step.

  • Name - id: The variable id represents each individual IOC ID from the enumeration output.

  • Source - The source: input means that the IDs come from the output of the previous enumeration step.

  • Expression - expression: "." simply takes each item from the input (the IOC IDs).

HTTP Request for Detailed IOC Information

  • Pagination type: The type is "none", indicating no additional processing is needed before making the request.

  • Response type - JSON.

  • Method: The HTTP method is GET, to fetch detailed information about each IOC.

  • Url: The URL for each IOC is dynamic, with the IOC ID substituted in the URL (${id}). For example, if id = "a1b2", the URL would be https://api.cyberintel.dev/iocs/a1b2.

  • Headers: The Accept: "application/json" header ensures the response is in JSON format.

Output Selection, Mapping, and Callback

  • Select: This selects the .data field from the response, which contains the detailed information for the IOC.

  • Filter: No additional filtering is applied.

  • Map: The map expression "{iocName: .name}" creates a new object with the iocName key, mapping it to the .name of the IOC from the response.

  • Output Mode: outputMode: "element" means each IOC’s name will be treated as an individual output item.

  • Callback: The callback: "saveToFile" triggers an external action to save the result to a file.

Result: Each IOC name (or other information, if mapped) will be saved to a file.

4. Single Collection Phase with cursor

This YAML defines a cursor-based pagination method to retrieve movie data from a GraphQL API, applying a temporal window filter and fetching movies with an id greater than 10.

collection:
  type: "httpRequest"
  httpRequest:
    type: "cursor"
    cursor:
      cursorSelector: ".data.moviesConnection.pageInfo.endCursor"
      initialRequest:
        method: POST
        url: "https://us-east-1-shared-usea1-02.cdn.hygraph.com/content/clpvcopq3aavs01usft1idkgj/master"
        headers:
          Accept: "application/json"
        queryParams:
          from: "${temporalWindow.from}"
          to: "${temporalWindow.to}"
        bodyType: "raw"
        bodyRaw: |
          {
            "query": "query ExampleQuery { moviesConnection(first: 2) { edges { node { id } } pageInfo { hasNextPage startCursor endCursor } } }",
            "operationName": "ExampleQuery"
          }
      nextRequest:
        method: POST
        url: "https://us-east-1-shared-usea1-02.cdn.hygraph.com/content/clpvcopq3aavs01usft1idkgj/master"
        headers:
          Accept: "application/json"
        bodyType: "raw"
        bodyRaw: |
          {
            "query": "query ExampleQuery { moviesConnection(first: 2, after: ${pagination.cursor}) { edges { node { id } } pageInfo { hasNextPage startCursor endCursor } } }",
            "operationName": "ExampleQuery"
          }
  output:
    select: "."
    filter: ".id > 10"
    map: "{id: .id, title: .title, status: .status}"
    outputMode: "element"
    callback: "saveToFile"
  • Pagination Type - cursor Indicates that the request uses cursor-based pagination to navigate through the data. Cursor pagination helps efficiently handle large datasets by retrieving data in chunks (pages). Each request contains a cursor that allows fetching the next chunk of data.

  • Selector - .data.moviesConnection.pageInfo.endCursor Defines a JSON path to extract the cursor for the next page. The Selector looks in the response body for .data.moviesConnection.pageInfo.endCursor to find the cursor value. This cursor is used to request the next page of data.

Initial Request Configuration:

The initial request to fetch the first page of data.

  • Method -POST Specifies that the HTTP method for the initial request is POST. This is typically used in GraphQL queries where the body of the request contains the query and variables.

  • URL - https://us-east-1-shared-usea1-02.cdn.hygraph.com/content/clpvcopq3aavs01usft1idkgj/masterThe URL is the endpoint of the GraphQL server, where the query will be sent.

  • Headers - Accept: "application/json" Sets the HTTP header to specify that the response should be in JSON format. The Accept header tells the server to send back the response in JSON.

  • Query Params - from: "${temporalWindow.from}" to: "${temporalWindow.to}"

    from and to are dynamically populated with the values defined in the temporalWindow section.

    These parameters are used to restrict the query to a specific time range, e.g. data collected from a certain period.

  • Body Type - raw Specifies that the body of the request is raw, meaning it contains a custom body format (e.g., GraphQL query) and is passed as-is in the HTTP request.

  • body Content - The body of the POST request, specified as a GraphQL query.

{
  "query": "query ExampleQuery { moviesConnection(first: 2) { edges { node { id } } pageInfo { hasNextPage startCursor endCursor } } }",
  "operationName": "ExampleQuery"
}
  • The GraphQL query moviesConnection(first: 2) fetches the first 2 movies from the API.

  • The pageInfo part includes hasNextPage, startCursor, and endCursor which help navigate through the pagination.

  • The response will include the edges (movie data) and pageInfo (pagination details).

  • first: 2: Fetches the first 2 movies in the initial query.

  • edges: Contains the actual movie data.

  • pageInfo: Contains pagination details, including whether there are more pages and the cursors to navigate.

Next Request:

  • method - POST

  • URL - https://us-east-1-shared-usea1-02.cdn.hygraph.com/content/clpvcopq3aavs01usft1idkgj/master The URL for the next request is the same as the initial request, but now it will include a cursor to fetch the next page.

  • Body Type - Raw

  • Body Content -

{
  "query": "query ExampleQuery { moviesConnection(first: 2, after: ${pagination.cursor}) { edges { node { id } } pageInfo { hasNextPage startCursor endCursor } } }",
  "operationName": "ExampleQuery"
}
  • The query includes the after: ${pagination.cursor} field, which uses the cursor value from the previous response to fetch the next page.

  • It fetches the next 2 movies (as specified by first: 2).

  • The pagination.cursor will be replaced with the actual cursor value from the previous response.

  • after: ${pagination.cursor}: This uses the cursor value from the previous response to get the next page.

  • hasNextPage: Indicates whether there are more pages to retrieve.

  • startCursor and endCursor: Provide the cursors for pagination.

Output Configuration

Select - .Specifies that the entire response body should be used as the output. The . symbol represents the entire data object in the response, meaning all returned data will be available for further processing.

Filter - .id > 10Filters the output data to only include items where the id is greater than 10. This ensures that only movies with an id greater than 10 will be included in the final output.

Map - {id: .id, title: .title, status: .status}Maps the data to a new structure, extracting only the id, title, and status fields.

  • This transforms the data into a simpler format, which is useful for later stages (e.g., saving or processing).

  • For each movie, it extracts just the id, title, and status.

Output Mode - element means that the data will be treated as individual elements (e.g., each movie is processed individually rather than as a collection).

5. Enumeration (collection output) + Collection (POST with bodyRaw)

Temporal window defines a 5-minute slice of time, offset 10 minutes ago.

Enumeration step:

  • Makes a paginated GET to /posts.

  • Extracts IDs from posts within the time window.

  • Produces a collection of IDs.

Collection step:

  • Uses those IDs in a POST request.

  • Filters, maps, and outputs enriched objects (id, title, status).

  • Saves results to a file.

# Temporal window (optional)
temporalWindow:
  duration: 5m
  offset: 10m
  tz: UTC
  format: RFC3339

enumeration:
  type: "httpRequest"
  httpRequest:
    type: "page"
    page:
      pageSize: 50
      request:
        method: "GET"
        url: "https://api.fake-rest.refine.dev/posts"
        headers:
          Accept: "application/json"
        queryParams:
          from: "${temporalWindow.from}"
          to: "${temporalWindow.to}"
          _page: "${pagination.pageNumber}"
          _per_page: "${pagination.pageSize}"
  output:
    select: '.'
    # filter: '.language == 3'
    map: '{id: .id}'
    outputMode: "collection"

collection:
  variables:
    - name: ids
      source: input
      expression: "."
      format: "json"
  type: "httpRequest"
  httpRequest:
    type: "none"
    none:
      request:
        method: "POST"
        url: "https://api.fake-rest.refine.dev/posts"
        headers:
          Accept: "application/json"
        bodyType: "raw"
        bodyRaw: |
          {
            "ids": ${inputs.ids}
          }
  output:
    select: "."
    filter: ".id > 10"
    map: "{id: .id, title: .title, status: .status}"
    outputMode: "element"
    callback: "saveToFile"
  • Duration - 5m window size is 5 minutes.

  • Offset - 10m shifts the window back 10 minutes from “now”. So if current UTC is 12:00, the range would be 11:45 – 11:50.

  • Time zone - UTC

  • Format - RFC3339 output format for timestamps (e.g., 2025-08-20T12:00:00Z).

The variables ${temporalWindow.from} and ${temporalWindow.to} get auto-populated with these calculated times.

Enumeration

  • Pagination type - page number/page size

  • Page size: 50 fetch 50 records per request.

  • Request

    • Response type - JSON

    • Method - GET

    • URL - https://api.fake-rest.refine.dev/posts

    • Query Params

      1.From: "${temporalWindow.from}"

      • Inserts the start timestamp of the time window. ${temporalWindow.from} is automatically computed based on your temporalWindow configuration e.g. If now = 12:00 UTC, offset = 10m, and duration = 5m = temporalWindow.from = 11:45 UTC (start) In the request, this becomes something like:

      ?from=2025-08-20T11:45:00Z

      2. to: "${temporalWindow.to}" Inserts the end timestamp of the time window e.g.

      temporalWindow.to = 11:50 UTC (end). In the request, this becomes:

      &to=2025-08-20T11:50:00Z

      So together, from and to tell the API:

      “Only give me records between 11:45 and 11:50 UTC.”

      3. _page: "${pagination.pageNumber}" This is a built-in pagination variable.

      ${pagination.pageNumber} auto-increments as the system makes repeated requests to fetch all pages e.g. First request _page=1 Second request _page=2 etc.

      This ensures you don’t just get the first batch, but all results page by page.

      4. _per_page: "${pagination.pageSize}” Controls how many records to fetch per page.

      This pulls from your earlier configuration

      page:
        pageSize: 50

      So each request includes: &_per_page=50

      &_per_page=50
  • Select - '.'selects the entire JSON response.

  • Filter - would filter only records where .language == 3.

  • Map - extracts only {id: .id} for each record.

  • Output Mode - collection outputs an array of items (instead of single elements).

[
  {"id": 1},
  {"id": 2},
  {"id": 3}
]

Collection (POST with BodyRaw)

  • Pagination Type - Next link at response body

  • Selector - "." take the full collection.

  • Response Type - json keep it as JSON (array of IDs).

  • Method - POST to send data.

  • URL - https://api.fake-rest.refine.dev/posts

  • Body Type: raw freeform JSON payload.

  • Body Content - sends the IDs collected in the enumeration: ids": ${inputs.ids}

  • Select: "." take the full response.

  • Filter - ".id > 10" only keep posts with ID greater than 10.

  • Map - reduce each record to {id, title, status}.

  • Output Mode - element output individual objects, one at a time.


Integrations

Read the following articles to learn how to use the HTTP Pull listener to integrate data with other providers.

Last updated

Was this helpful?