This project is archived. Its data is read-only.

Log analyzer not resilient to bad data

Summary

I have observed 4 ways that the log analyzer will continuously loop over messages, never ack'ing them (removing them from queue) and sometimes not getting to new messages because it exits upon encountering the old bad message (seems like order of receipt is not guaranteed).

Steps to Reproduce

Unrecognized AWS account

Upload the following to the s3 bucket associated with your cloudigrade instance: no.such.account.json.gz

  1. Observe the tracebacks in the celery logs. Observe that it keeps having the same error until you manually delete the message from the queue.

  2. After deleteing the message from the queue, you can move on to the next example of bad data

Non-gzipped file

Upload the following to the s3 bucket associated with your cloudigrade instance:

not_gzipped.txt

  1. Observe the tracebacks in the celery logs. Observe that it keeps having the same error until you manually delete the message from the queue.

  2. After deleteing the message from the queue, you can move on to the next example of bad data

Mal-formed json

Upload the following to the s3 bucket associated with your cloudigrade instance:

bad.json.gz

  1. Observe the tracebacks in the celery logs. Observe that it keeps having the same error until you manually delete the message from the queue.

  2. After deleteing the message from the queue, you can move on to the next example of bad data

Instance no longer exists at time of message receipt

See #457 (closed)

Expected Result

  • Log analyzer can recover from bad data

Actual Result

  • Messages stay in queue and log analyzer has to fumble through them every time, possibly causing it to miss work it is supposed to be doing.

Unrecognized AWS account

When CloudTrail data ends up in s3 bucket of a cloudigrade that no longer has an AwsAccount with matching AWS account #: (Will do this every time it looks at the queue because message never ack'd. Never gets to the code that deletes the messages).

This is a pain point for testing because we are deleting accounts. It could happen if we ever during operation decide to remove an account but the customer does not delete their cloud trail.

[2018-08-06 20:29:23,381: ERROR/ForkPoolWorker-307] Task analyzer.tasks.analyze_log[9789adb3-fe16-4789-ab09-5323c7d7dff3] raised unexpected: DoesNotExist('AwsAccount matching query does not exist.',)
Traceback (most recent call last):
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/celery/app/trace.py", line 382, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/celery/app/trace.py", line 641, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/celery/app/base.py", line 467, in run
    return task._orig_run(*args, **kwargs)
  File "/opt/cloudigrade/util/aws/helper.py", line 189, in wrapped
    result = original_function(*args, **kwargs)
  File "/opt/cloudigrade/analyzer/tasks.py", line 62, in analyze_log
    instances = _parse_log_for_ec2_instance_events(log)
  File "/opt/cloudigrade/analyzer/tasks.py", line 96, in _parse_log_for_ec2_instance_events
    account = AwsAccount.objects.get(aws_account_id=account_id)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/site-packages/django/db/models/query.py", line 399, in get
    self.model._meta.object_name
account.models.AwsAccount.DoesNotExist: AwsAccount matching query does not exist.

Non-gzipped file

When a non-gzipped file ends up in s3 bucket: (Will do this every time it looks at the queue because message never ack'd. Never gets to the code that deletes the messages)

This would occur if CloudTrail does something wild, which could possibly happen.

[2018-08-06 20:40:15,963: ERROR/ForkPoolWorker-2] Task analyzer.tasks.analyze_log[7f7223fe-effe-434e-8e47-1d2793a1ff63] raised unexpected: OSError("Not a gzipped file (b'ba')",)
Traceback (most recent call last):
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/celery/app/trace.py", line 382, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/celery/app/trace.py", line 641, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/celery/app/base.py", line 467, in run
    return task._orig_run(*args, **kwargs)
  File "/opt/cloudigrade/util/aws/helper.py", line 189, in wrapped
    result = original_function(*args, **kwargs)
  File "/opt/cloudigrade/analyzer/tasks.py", line 57, in analyze_log
    logs.append(aws.get_object_content_from_s3(bucket, key))
  File "/opt/cloudigrade/util/aws/s3.py", line 32, in get_object_content_from_s3
    content = gzip.decompress(object_bytes).decode('utf-8')
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/gzip.py", line 532, in decompress
    return f.read()
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/gzip.py", line 276, in read
    return self._buffer.read(size)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/gzip.py", line 463, in read
    if not self._read_gzip_header():
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/gzip.py", line 411, in _read_gzip_header
    raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'ba')

Mal-formed json

When bad json is properlly gzipped and found in the s3 bucket: (Will do this every time it looks at the queue because message never ack'd. Never gets to the code that deletes the messages)

This would occur if CloudTrail does something wild, which could possibly happen.

[2018-08-06 21:00:17,403: ERROR/ForkPoolWorker-2] Task analyzer.tasks.analyze_log[2a2e1883-7a64-4d5e-8abe-f4a6bf52c4c4] raised unexpected: JSONDecodeError('Expecting value: line 1 column 1 (char 0)',)
Traceback (most recent call last):
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/celery/app/trace.py", line 382, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/celery/app/trace.py", line 641, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/rh/rh-python36/root/usr/lib/python3.6/site-packages/celery/app/base.py", line 467, in run
    return task._orig_run(*args, **kwargs)
  File "/opt/cloudigrade/util/aws/helper.py", line 189, in wrapped
    result = original_function(*args, **kwargs)
  File "/opt/cloudigrade/analyzer/tasks.py", line 62, in analyze_log
    instances = _parse_log_for_ec2_instance_events(log)
  File "/opt/cloudigrade/analyzer/tasks.py", line 87, in _parse_log_for_ec2_instance_events
    log = json.loads(log)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/opt/rh/rh-python36/root/usr/lib64/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Additional context

Files for reproducing these problems are attached with descriptive names. Ping me if you have any questions!

Edited Aug 13, 2018 by Elijah DeLee
Assignee Loading
Time tracking Loading