Log analyzer not resilient to bad data
Summary
I have observed 4 ways that the log analyzer will continuously loop over messages, never ack'ing them (removing them from queue) and sometimes not getting to new messages because it exits upon encountering the old bad message (seems like order of receipt is not guaranteed).
Steps to Reproduce
Unrecognized AWS account
Upload the following to the s3 bucket associated with your cloudigrade instance: no.such.account.json.gz
-
Observe the tracebacks in the celery logs. Observe that it keeps having the same error until you manually delete the message from the queue.
-
After deleteing the message from the queue, you can move on to the next example of bad data
Non-gzipped file
Upload the following to the s3 bucket associated with your cloudigrade instance:
-
Observe the tracebacks in the celery logs. Observe that it keeps having the same error until you manually delete the message from the queue.
-
After deleteing the message from the queue, you can move on to the next example of bad data
Mal-formed json
Upload the following to the s3 bucket associated with your cloudigrade instance:
-
Observe the tracebacks in the celery logs. Observe that it keeps having the same error until you manually delete the message from the queue.
-
After deleteing the message from the queue, you can move on to the next example of bad data
Instance no longer exists at time of message receipt
See #457 (closed)
Expected Result
- Log analyzer can recover from bad data
Actual Result
- Messages stay in queue and log analyzer has to fumble through them every time, possibly causing it to miss work it is supposed to be doing.
Unrecognized AWS account
When CloudTrail data ends up in s3 bucket of a cloudigrade that no longer has an AwsAccount with matching AWS account #: (Will do this every time it looks at the queue because message never ack'd. Never gets to the code that deletes the messages).
This is a pain point for testing because we are deleting accounts. It could happen if we ever during operation decide to remove an account but the customer does not delete their cloud trail.
[2018-08-06 20:29:23,381: ERROR/ForkPoolWorker-307] Task analyzer.tasks.analyze_log[9789adb3-fe16-4789-ab09-5323c7d7dff3] raised unexpected: DoesNotExist('AwsAccount matching query does not exist.',)
Traceback (most recent call last):
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/celery/app/trace.py", line 382, in trace_task
R = retval = fun(*args, **kwargs)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/celery/app/trace.py", line 641, in __protected_call__
return self.run(*args, **kwargs)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/celery/app/base.py", line 467, in run
return task._orig_run(*args, **kwargs)
File "/opt/cloudigrade/util/aws/helper.py", line 189, in wrapped
result = original_function(*args, **kwargs)
File "/opt/cloudigrade/analyzer/tasks.py", line 62, in analyze_log
instances = _parse_log_for_ec2_instance_events(log)
File "/opt/cloudigrade/analyzer/tasks.py", line 96, in _parse_log_for_ec2_instance_events
account = AwsAccount.objects.get(aws_account_id=account_id)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/django/db/models/manager.py", line 82, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/django/db/models/query.py", line 399, in get
self.model._meta.object_name
account.models.AwsAccount.DoesNotExist: AwsAccount matching query does not exist.
Non-gzipped file
When a non-gzipped file ends up in s3 bucket: (Will do this every time it looks at the queue because message never ack'd. Never gets to the code that deletes the messages)
This would occur if CloudTrail does something wild, which could possibly happen.
[2018-08-06 20:40:15,963: ERROR/ForkPoolWorker-2] Task analyzer.tasks.analyze_log[7f7223fe-effe-434e-8e47-1d2793a1ff63] raised unexpected: OSError("Not a gzipped file (b'ba')",)
Traceback (most recent call last):
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/celery/app/trace.py", line 382, in trace_task
R = retval = fun(*args, **kwargs)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/celery/app/trace.py", line 641, in __protected_call__
return self.run(*args, **kwargs)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/celery/app/base.py", line 467, in run
return task._orig_run(*args, **kwargs)
File "/opt/cloudigrade/util/aws/helper.py", line 189, in wrapped
result = original_function(*args, **kwargs)
File "/opt/cloudigrade/analyzer/tasks.py", line 57, in analyze_log
logs.append(aws.get_object_content_from_s3(bucket, key))
File "/opt/cloudigrade/util/aws/s3.py", line 32, in get_object_content_from_s3
content = gzip.decompress(object_bytes).decode('utf-8')
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/gzip.py", line 532, in decompress
return f.read()
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/gzip.py", line 276, in read
return self._buffer.read(size)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/gzip.py", line 463, in read
if not self._read_gzip_header():
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/gzip.py", line 411, in _read_gzip_header
raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'ba')
Mal-formed json
When bad json is properlly gzipped and found in the s3 bucket: (Will do this every time it looks at the queue because message never ack'd. Never gets to the code that deletes the messages)
This would occur if CloudTrail does something wild, which could possibly happen.
[2018-08-06 21:00:17,403: ERROR/ForkPoolWorker-2] Task analyzer.tasks.analyze_log[2a2e1883-7a64-4d5e-8abe-f4a6bf52c4c4] raised unexpected: JSONDecodeError('Expecting value: line 1 column 1 (char 0)',)
Traceback (most recent call last):
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/celery/app/trace.py", line 382, in trace_task
R = retval = fun(*args, **kwargs)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/celery/app/trace.py", line 641, in __protected_call__
return self.run(*args, **kwargs)
File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/celery/app/base.py", line 467, in run
return task._orig_run(*args, **kwargs)
File "/opt/cloudigrade/util/aws/helper.py", line 189, in wrapped
result = original_function(*args, **kwargs)
File "/opt/cloudigrade/analyzer/tasks.py", line 62, in analyze_log
instances = _parse_log_for_ec2_instance_events(log)
File "/opt/cloudigrade/analyzer/tasks.py", line 87, in _parse_log_for_ec2_instance_events
log = json.loads(log)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/json/__init__.py", line 354, in loads
return _default_decoder.decode(s)
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Additional context
Files for reproducing these problems are attached with descriptive names. Ping me if you have any questions!