-
Notifications
You must be signed in to change notification settings - Fork 519
[Mimecast] - Refactored 'siem_logs' & 'cloud_integrated_logs' data streams to improve memory usage #16308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…o andrew's proof of concept
|
Pinging @elastic/security-service-integrations (Team:Security-Service Integrations) |
🚀 Benchmarks reportTo see the full report comment with |
| ), | ||
| }, | ||
| }, | ||
| "want_more": false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if we are going to get into a non-recoverable state if one of these blobs ages out (Mimecast claims to retain data for 7 days). Assuming the API returns a 404, maybe it should move to the next blob (send an error event, remove the item from cursor.blobs, continue to the next). WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just concerned if a 404 is strictly enforced from mimecast's end in such a scenario or not. If a temporary issue causes a 404 then we will be acting upon a false positive.
CEL already retries 5 times by default, what if we increment the waitMin and waitMax to a greater value and then on a non 200, we deterministically remove the faulty blob ? This way we keep a retry fall back and if it does not succeed, we effectively remove the blob from the list permanently and emit a error event, but continue to process the rest of the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on a non 200, we deterministically remove the faulty blob
I disagree with the non-200. I think that is too wide of a condition. I would limit it to HTTP 404 to begin with because this is the conventional status code for this state, but if we learn more, we can expand the status codes as required (e.g. HTTP 409).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rémy suggests the following commit message (with my modification)
mimecast: refactor CEL code to prevent memory issues and OOM failures
The existing CEL implementation in siem_logs data stream was causing
out-of-memory failures during processing. The refactored approach uses
batch processing techniques to reduce memory consumption.
The cloud_integrated_logs data stream received similar optimizations
due to code sharing with siem_logs.
Based on Andrew's POC implementation:
https://github.com/andrewkroh/integrations/commits/mimecast-siem-batch-per-execution/
I've replaced code "coupling" with "sharing" since there is no coupling.
packages/mimecast/changelog.yml
Outdated
| - description: Refactored the CEL program for siem_logs data stream to improve memory usage. | ||
| type: bugfix | ||
| link: https://github.com/elastic/integrations/pull/16308 | ||
| - description: Refactored the CEL program for cloud_integrated_logs data stream to improve memory usage. | ||
| type: bugfix | ||
| link: https://github.com/elastic/integrations/pull/16308 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| - description: Refactored the CEL program for siem_logs data stream to improve memory usage. | |
| type: bugfix | |
| link: https://github.com/elastic/integrations/pull/16308 | |
| - description: Refactored the CEL program for cloud_integrated_logs data stream to improve memory usage. | |
| type: bugfix | |
| link: https://github.com/elastic/integrations/pull/16308 | |
| - description: Refactored the CEL program for cloud_integrated_logs and siem_logs data streams to improve memory usage. | |
| type: bugfix | |
| link: https://github.com/elastic/integrations/pull/16308 |
| "access_token": token.access_token, | ||
| "expires": token.expires, | ||
| }, | ||
| "want_more": size(tail(blobs)) != 0 || size(work_list) != 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| "want_more": size(tail(blobs)) != 0 || size(work_list) != 0, | |
| "want_more": size(blobs) > 1 || size(work_list) != 0, |
in both files
💚 Build Succeeded
History
cc @ShourieG |
Type of change
Proposed commit message
Checklist
changelog.ymlfile.Note:
Skipped the httpjson test in 'siem_logs' as a fix for the health status degraded issue did not seem
fit for this current PR.
Author's Checklist
How to test this PR locally
Related issues
Screenshots