Skip to content

[apm-data] Set event.dataset if empty for logs #129074

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Jun 17, 2025

Conversation

carsonip
Copy link
Member

@carsonip carsonip commented Jun 6, 2025

For APM logs, set event.dataset to data_stream.dataset if event.dataset is empty, to satisfy Anomaly Detection's requirement to have event.dataset in every logs-* data stream.

Fixes elastic/apm-server#16397

@carsonip carsonip force-pushed the apm-logs-event-dataset branch from 011910c to 71b8459 Compare June 6, 2025 16:29
@carsonip carsonip added v8.19.0 auto-backport Automatically create backport pull requests when merged labels Jun 6, 2025
@carsonip carsonip marked this pull request as ready for review June 9, 2025 16:17
@carsonip carsonip requested a review from a team as a code owner June 9, 2025 16:17
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jun 9, 2025
@carsonip carsonip added the :Data Management/Data streams Data streams and their lifecycles label Jun 9, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jun 9, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Jun 9, 2025
@carsonip carsonip added needs:triage Requires assignment of a team area label >bug and removed needs:triage Requires assignment of a team area label labels Jun 9, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @carsonip, I've created a changelog YAML for you.

@carsonip
Copy link
Member Author

carsonip commented Jun 9, 2025

Testing notes

Tested with adding the processor to default pipeline.

DELETE _data_stream/logs-apm.app.foo-bar

POST /logs-apm.app.foo-bar/_doc
{
  "@timestamp": 1,
  "data_stream": {
    "type": "logs",
    "dataset": "apm.app.foo",
    "namespace": "bar"
  }
}

POST /logs-apm.app.foo-bar/_doc
{
  "@timestamp": 1,
  "data_stream": {
    "type": "logs",
    "dataset": "apm.app.foo",
    "namespace": "bar"
  },
  "event": {
    "dataset": "ds"
  }
}

GET /logs-apm.app.foo-bar/_search

Output:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": ".ds-logs-apm.app.foo-bar-2025.06.09-000001",
        "_id": "_WCQVZcBbmWL1cSKlTSi",
        "_score": 1,
        "_source": {
          "@timestamp": 1,
          "data_stream": {
            "namespace": "bar",
            "type": "logs",
            "dataset": "apm.app.foo"
          },
          "event": {
            "ingested": "2025-06-09T16:40:28Z",
            "dataset": "apm.app.foo"
          }
        }
      },
      {
        "_index": ".ds-logs-apm.app.foo-bar-2025.06.09-000001",
        "_id": "_mCQVZcBbmWL1cSKoTRm",
        "_score": 1,
        "_source": {
          "@timestamp": 1,
          "data_stream": {
            "namespace": "bar",
            "type": "logs",
            "dataset": "apm.app.foo"
          },
          "event": {
            "ingested": "2025-06-09T16:40:31Z",
            "dataset": "ds"
          }
        }
      }
    ]
  }
}

@carsonip carsonip force-pushed the apm-logs-event-dataset branch from 3a8135a to 7add32d Compare June 11, 2025 13:17
Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a question about sending docs to logs-apm.* without explicitly setting data_stream.dataset. Not sure if that's a thing we need to cater for.

- set:
if: "ctx.data_stream?.dataset != null"
field: event.dataset
value: "{{{data_stream.dataset}}}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should we use copy_from and ignore_empty_value, and remove the if and value?
(See https://www.elastic.co/docs/reference/enrich-processor/set-processor)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will data_stream.dataset be automatically set for the logs-apm.* data stream documents, if they're not specified in _source? If that's the case, I'm not sure they will be in ctx.*, and we may need to have a fallback.

Can you add test cases for that?

Copy link
Member Author

@carsonip carsonip Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will data_stream.dataset be automatically set for the logs-apm.* data stream documents, if they're not specified in _source?

No, at least not visible to the log ingest pipeline.

If that's the case, I'm not sure they will be in ctx.*

Correct, they will not be in ctx.

we may need to have a fallback.

Do you mean a fallback to a non-null event.dataset in the case where both data_stream.dataset and event.dataset is missing? I'm inclined to not perform any special case handling to make things up because it sounds more like a misconfiguration issue. Unless the user explicitly remove the DS fields in an intermediate ingest pipeline, DS fields should always be present in docs from apm-server (even if reroute processor is used).

In any case, I've added test in 957d9ef to confirm this behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean a fallback to a non-null event.dataset in the case where both data_stream.dataset and event.dataset is missing? I'm inclined to not perform any special case handling to make things up because it sounds more like a misconfiguration issue.

I'm OK with this. I don't think it's a real thing we need to handle, even if it's technically possible.

@carsonip carsonip requested a review from axw June 12, 2025 09:21
@carsonip carsonip requested review from 1pkg and isaacaflores2 June 16, 2025 08:42
@dakrone
Copy link
Member

dakrone commented Jun 16, 2025

Is there a reason why you switched away from apm@pipeline for the final pipeline? When I look at this it appears that these logs will no longer be doing the processing in that file, instead only doing the event.dataset configuration from the new final pipeline.

Can this be considered a breaking change in behavior?

@carsonip
Copy link
Member Author

Is there a reason why you switched away from apm@pipeline for the final pipeline? When I look at this it appears that these logs will no longer be doing the processing in that file, instead only doing the event.dataset configuration from the new final pipeline.

The new logs-apm@pipeline in turn calls apm@pipeline, just like the other final pipelines e.g. metrics-apm@pipeline and traces-apm@pipeline. It is needed just because we needed log-specific processor in the final pipeline and we cannot just add it to apm@pipeline directly.

@dakrone
Copy link
Member

dakrone commented Jun 16, 2025

Ahh okay, that makes sense. I missed the call out to the original apm@pipeline (I am not used to reading pipelines in yaml…). Thanks for the explanation.

@carsonip carsonip requested a review from simitt June 16, 2025 16:33
Copy link
Member

@axw axw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks, and sorry for the delay

@carsonip carsonip merged commit 466afba into elastic:main Jun 17, 2025
18 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.19 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 129074

@carsonip
Copy link
Member Author

💚 All backports created successfully

Status Branch Result
8.19

Questions ?

Please refer to the Backport tool documentation

carsonip added a commit to carsonip/elasticsearch that referenced this pull request Jun 17, 2025
For APM logs, set event.dataset to data_stream.dataset if event.dataset is empty, to satisfy Anomaly Detection's requirement to have event.dataset in every logs-* data stream.

(cherry picked from commit 466afba)

# Conflicts:
#	test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
carsonip added a commit that referenced this pull request Jun 17, 2025
For APM logs, set event.dataset to data_stream.dataset if event.dataset is empty, to satisfy Anomaly Detection's requirement to have event.dataset in every logs-* data stream.

(cherry picked from commit 466afba)

# Conflicts:
#	test/framework/src/main/java/org/elasticsearch/test/rest/ESRestTestCase.java
@rubvs rubvs mentioned this pull request Jun 19, 2025
17 tasks
@rubvs
Copy link
Contributor

rubvs commented Jun 23, 2025

Tested manually with steps outlined in #129074 (comment)

With version 9.0.3
    "hits": [
      {
        "_index": ".ds-logs-apm.app.foo-bar-2025.06.23-000001",
        "_id": "JNXnnZcBQ3xVa0c-8bvn",
        "_score": 1,
        "_source": {
          "@timestamp": 1,
          "data_stream": {
            "namespace": "bar",
            "type": "logs",
            "dataset": "apm.app.foo"
          },
          "tags": [
            "_geoip_database_unavailable_GeoLite2-City.mmdb"
          ]
        }
      },
      {
        "_index": ".ds-logs-apm.app.foo-bar-2025.06.23-000001",
        "_id": "xU_nnZcBCJmh9scm-w5_",
        "_score": 1,
        "_source": {
          "@timestamp": 1,
          "data_stream": {
            "namespace": "bar",
            "type": "logs",
            "dataset": "apm.app.foo"
          },
          "event": {
            "dataset": "ds"
          },
          "tags": [
            "_geoip_database_unavailable_GeoLite2-City.mmdb"
          ]
        }
      }
    ]
With version 9.1.0
    "hits": [
      {
        "_index": ".ds-logs-apm.app.foo-bar-2025.06.23-000001",
        "_id": "hFznnZcBHYim8ksPeDdl",
        "_score": 1,
        "_source": {
          "@timestamp": 1,
          "data_stream": {
            "namespace": "bar",
            "type": "logs",
            "dataset": "apm.app.foo"
          },
          "event": {
            "dataset": "apm.app.foo"
          },
          "tags": [
            "_geoip_database_unavailable_GeoLite2-City.mmdb"
          ]
        }
      },
      {
        "_index": ".ds-logs-apm.app.foo-bar-2025.06.23-000001",
        "_id": "hVznnZcBHYim8ksPjTfk",
        "_score": 1,
        "_source": {
          "@timestamp": 1,
          "data_stream": {
            "namespace": "bar",
            "type": "logs",
            "dataset": "apm.app.foo"
          },
          "event": {
            "dataset": "ds"
          },
          "tags": [
            "_geoip_database_unavailable_GeoLite2-City.mmdb"
          ]
        }
      }
    ]

By inspecting the outputs above, version 9.0.3 does not include the event.dataset field, while 9.1.0 does include it. Concluding that the data_stream.dataset field is copied properly.

"event": {
    "dataset": "apm.app.foo"
},

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged backport pending >bug :Data Management/Data streams Data streams and their lifecycles Team:Data Management Meta label for data/management team v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OTLP ingested logs do not have event.dataset field
7 participants