Commit 62799a96 authored by Vesa Vertainen's avatar Vesa Vertainen
Browse files

moved feature_extractor to proprietary repository

parent 21ec578b
test:
stage: test
image: python:3.7-slim
script:
- cd feature_extractor
- pip3 install -r requirements.txt
- apt-get update -qy
- apt-get install -y jq
- jq --arg CENUID "$CENSYS_UID" '.Censys.uid = $CENUID' API.json > tmp.$$.json && mv tmp.$$.json API.json
- jq --arg CENKEY "$CENSYS_KEY" '.Censys.key = $CENKEY' API.json > tmp.$$.json && mv tmp.$$.json API.json
- python3 analyze_parallel.py --infile input2
- ls ./output
- cat ./output/*
only:
- master
\ No newline at end of file
# What
# THREAT_FEEDS
Tools for collecting daily snapshots of publicly available TI feeds and analysis
of how dynamic and timely they are.
......@@ -34,3 +34,8 @@ creates
dynamicity.txt
timeliness.txt
~~~~
# FEATURE_EXTRACTOR
See the [feature_extractor repository](https://gitlab.com/cincan/feature_extractor)
{
"AbuseIPDB" :{
"key":""
},
"Censys": {
"uid":"",
"key":""
},
"GoogleSafebrowsing" : {
"key":""
},
"OTXQuery" : {
"key":""
},
"PhishTank" : {
"key":""
},
"Shodan" : {
"key":""
},
"VirusTotal" : {
"key":""
},
"MISPWarningLists" : {
"path":"/path/to/misp-warninglists/"
}
}
\ No newline at end of file
#docker run -v $(pwd):/data cincan/feature_extractor --injsonl /data/jsonl_input --path /data/
#cincan run cincan/feature_extractor --injsonl jsonl_input --path ./
FROM python:3.6-slim-buster
LABEL MAINTAINER=cincan.io
COPY . /wp1/feature_extractor
RUN apt update && apt install -y \
&& cd wp1/feature_extractor/ \
&& pip install -r requirements.txt \
&& adduser --shell /sbin/nologin --disabled-login --gecos "" appuser \
&& chown -R appuser:appuser /wp1
USER appuser
WORKDIR /home/appuser
ENTRYPOINT ["/usr/local/bin/python","/wp1/feature_extractor/analyze_parallel.py","--confpath", "/wp1/feature_extractor"]
## Table of Contents
* [Table of Contents](#table-of-contents)
* [Overview](#overview)
* [Installation](#installation)
* [Configuration](#configuration)
* [active_analyzers](#active_analyzers)
* [API.json](#apijson)
* [Track.json](#trackjson)
* [Usage](#usage)
# Overview
Feature Extractor is tool for running [Cortex-analyzers](https://github.com/TheHive-Project/Cortex-Analyzers) from command line and parsing and printing results to a configurable HTML report.
The motivation for this tool is to make the work of analysts quicker by aiding in the process of deciding relevance of various IoCs (indicators of compromise).
### Cortex-analyzers
Cortex-analyzers are tools designed to be used in conjunction with the [Cortex-platform](https://github.com/TheHive-Project/Cortex/blob/master/README.md). Analyzers take as an input predefined type of IoC and return as an output JSON file containing the results of the analysis. Typically analyzer queries an API of some service.
### Purpose of the project
Purpose of this project is to streamline the process of analyzing IoCs. Cortex Analyzers do good job of standardizing the process of querying different APIs. However the level of processing the output is fairly limited.
This is where Feature Extractor comes in. Feature Extractor allows analysts to select analyzers which are ran and input either single or several IoCs to the tool which then runs selected analyzers on them and generates streamlined and configurable report to standalone HTML file for easy viewing and sharing of the results.
# Installation
Before installation it is assumed that user has python 3.6 or newer and related packages (e.g. pip) installed.
### Installation instructions
Clone git repository
~~~
git clone https://gitlab.com/CinCan/wp1.git
~~~
navigate to correct folder
~~~
cd wp1/feature_extractor/
~~~
and run
~~~
pip3 install -r requirements.txt
~~~
# Configuration
Configuration of the tool consists of following three files which are straight forward to setup.
~~~
./active_analyzers
./API.json
./Track.json
~~~
Contents of the files are following
- active_analyzers file contains list of currently activated Cortex analyzers. Here user can select which analyzers are used.
- API.json file contains configuration options required by the analyzers. This includes API keys, file paths etc.
- Track.json file determines which fields of the analyzers output are parsed in to the report
Format of each of the files is described below
### active_analyzers
Activate analyzers you want to use by adding it to file
~~~
./active_analyzers
~~~
File consists of lines of the form
~~~
folder/analyzer
~~~
where `folder` is name of the folder containing the analyzer and `analyzer` is name of the analyzers JSON file without the .json extension.
#### Example file
```
Threatcrowd/Threatcrowd
VirusTotal/VirusTotal_GetReport
VirusTotal/VirusTotal_Scan
EmergingThreats/EmergingThreats_DomainInfo
CinCanTools/CinCanTools
```
Analyzer can be disabled by removing it from the file or commenting it with #
### API.json
Contains configuration for analyzers. After changing active_analyzers, API.json may be updated with command
~~~
./analyze.py -u
~~~
This reads configuration JSONs of active analyzers and updates API.json accordingly by appending required configuration fields to the file.
_Now user may add required API keys and filepaths to the file._
#### Example file
~~~
{
"Censys": {
"uid": "",
"key": ""
},
"PhishTank" : {
"key" : ""
},
"Shodan" : {
"key" : ""
},
"VirusTotal" : {
"key" : ""
},
"MISPWarningLists" : {
"path" : "/path/to/misp-warninglists/"
}
}
~~~
### Track.json
Configuration is best explained by example. Lets go through setting up the parsing for some fields found in analyzers.
Basic format of Track.json is the following
~~~
{
"tracked" : [
{
"analyzer": "AbuseIPDB",
"service" : "AbuseIPDB",
"fields" : [
{
"name" : "Attack categories",
"expression" : "full.values.[*].categories_strings[*]",
"type": "multiset",
"datatypes": ["ip"]
}
]
},
{
"analyzer" : "CinCanTools",
"service" : "CinCanTools",
"fields": [
{
"name": "Majestic Top 10k",
"expression": "full.majestic_top10k",
"datatypes": ["domain"]
}
]
},
{
"analyzer": "MISPWarningLists2",
"service": "MISPWarningLists2",
"fields" : [
{
"name": "Hits",
"expression": "full.results.[*].[name,description,version]",
"type": "object-array",
"datatypes" : ["domain"]
}
]
}
]
}
~~~
Note: analyzer == `folder`and service == `analyzer` from `active_analyzers` file
Tracked fields are described by the object
~~~
{
"name": "Hits",
"expression": "full.results.[*].[name,description,version]",
"type": "object-array",
"datatypes" : ["domain"]
}
~~~
keys are as follows
- "name" - Name displayed for the field in the report
- "expression" - Expression in JSONPath format. Basics of JSONPath syntax can be found from [ JSONPath - XPath for JSON ](https://goessner.net/articles/JsonPath/)
- "type" - Field which determines how the results returned by expression string are to be displayed. Supported display types are None, multiset, object, object-array and subobject.
- "datatypes" - Input types for which the field is parsed. Accepted datatypes are `ip, domain, url, fqdn, hash` and `mail`.
#### Display types
Supported display types are None, multiset, object, object-array and subobject.
- None - if type is not defined object is displayed as simple string
- multiset - For counting occurrences of atomic values for specific key inside objects in JSON array
- object - For displaying single object in html table
- subobject - For displaying subset of fields of single object (can also be used to sort fields in an object)
- object-array - For displaying array of json objects in html table
Lets go through each type by parsing some fields from the following analyzer output.
Note: There aren't any restrictions for the format of the output file. All that is required is that the file conforms to the [JSON specification](https://tools.ietf.org/html/rfc7159).
Output:
~~~
{
"full" : {
"ip_address" : "1.1.1.1",
"location" : {
"province": "Occitanie",
"city": "Cazeres",
"country": "France",
"longitude": 1.0863,
"registered_country": "France",
"postal_code" : "31220",
"country_code" : "FR",
"latitude" : 43.2071,
"timezone" : "Europe/Paris",
"continent" : "Europe"
},
"results" : [
{
"type" : "SSH scanner",
"confidence" : "high"
},
{
"type" : "Port scanner",
"confidence" : "low"
},
{
"type" : "SSH scanner",
"confidence" : "high"
}
],
passive_dns: [
{
last: "2019-02-15T23:00:00+00:00",
hostname: "android.ydns.eu",
address: "89.234.157.254",
first: "2019-02-15T23:00:00+00:00",
asset_type: "hostname",
},
{
last: "2018-05-19T22:00:00+00:00",
hostname: "89.234.157.254",
address: "89.234.157.254",
first: "2018-05-19T22:00:00+00:00",
asset_type: "hostname",
},
{
last: "2019-02-07T23:00:00+00:00",
hostname: "marylou.nos-oignons.net",
address: "89.234.157.254",
first: "2017-04-19T06:13:35+00:00",
asset_type: "hostname",
}
]
}
}
~~~
##### None
Adding object
~~~
{
"Name" : "IP",
"expression" : "full.ip_address",
"datatypes" : ["x", "y", "z"]
}
~~~
to Track.json in fields array of corresponding analyzer adds line
~~~
IP: 1.1.1.1
~~~
to report.
##### multiset
Adding object
~~~
{
"name" : "Types",
"expression" : "full.results.[*].type",
"type" : "multiset",
"datatypes" : ["x", "y", "z"]
}
~~~
to Track.json in fields array of corresponding analyzer adds following result to the report
~~~
Types: SSH scanner: 2, Port scanner : 1
~~~
##### object
Adding object
~~~
{
"name" : "Location",
"expression" : "full.location",
"type" : "object",
"datatypes" : ["x", "y", "z"]
}
~~~
to Track.json in fields array of corresponding analyzer adds html table to the report
![image](/uploads/b36a88601c2b0e6fbeca2edabd91e438/image.png)
##### subobject
Adding object
```
{
"name" : "Location",
"expression" : "full.location.[city,postal_code,province,country,country_code,longitude,latitude]",
"type": "subobject",
"datatypes" : ["x", "y", "z"]
}
```
to Track.json in fields array of corresponding analyzer adds html table to the report
![image](/uploads/96c1308fcfdcf9cb0cca0e4210b45a73/image.png)
Note: This can also be used to sort the fields.
##### object-array
Adding object
~~~
{
"name": "Passive DNS",
"expression" : "full.passive_dns.[*].[hostname,first,last]",
"type": "object-array",
"datatypes" : ["x", "y", "z"]
}
~~~
to Track.json in fields array of corresponding analyzer adds html table to the report
![image](/uploads/d66955e035ec9a1df1af5e3256505369/image.png)
# Usage
There are few ways to run the tool. To quickly analyze one IoC run
~~~
./analyze.py datatype:data
~~~
where `datatype` is one of `ip, domain, url, fqdn, hash` or `mail`
and `data` is the corresponding IoC.
#### input file
Alternatively analysis may be run on several IoC at once from a file
~~~
./analyze.py -f file
~~~
File has to consists of lines of the form `datatype:data`.
Example:
~~~
ip:1.1.1.1
domain:example.com
hash:e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
~~~
#### ioc_parser file
Alternatively Feature Extractor accepts csv files from [ioc_parser](https://github.com/armbues/ioc_parser) with
~~~
./analyze.py -i file
~~~
#### Help
~~~
./analyze.py --help
~~~
#### Notes
- misp-warninglist path as given by 'pwd' command
- Requires python 3.6 or newer
{
"tracked" : [
{
"analyzer": "AbuseIPDB",
"service" : "AbuseIPDB",
"fields" : [
{
"name" : "Attack categories",
"expression" : "full.values.[*].categories_strings[*]",
"type": "multiset",
"datatypes": ["ip"]
}
]
},
{
"analyzer" : "Censys",
"service" : "Censys",
"fields" : [
{
"name" : "Message",
"expression" : "full.message",
"datatypes" : ["ip", "domain", "hash"]
},
{
"name" : "Open ports",
"expression" : "full.ip.ports[*]",
"datatypes" : ["ip"]
},
{
"name" : "Protocols",
"expression" : "full.ip.protocols[*]",
"datatypes" : ["ip"]
},
{
"name" : "Timestamp",
"expression" : "full.ip.updated_at",
"datatypes" : ["ip"]
},
{
"name" : "Location",
"expression" : "full.ip.location.[city,postal_code,province,country,country_code,longitude,latitude]",
"type": "subobject",
"datatypes" : ["ip"]
},
{
"name" : "Autonomous system",
"expression" : "full.ip.autonomous_system.[name,description,rir,path,asn,routed_prefix,country_code]",
"type": "subobject",
"datatypes" : ["ip"]
},
{
"name" : "Alexa rank",
"expression":"full.website.alexa_rank",
"datatypes": ["domain"]
},
{
"name" : "Open ports",
"expression" : "full.website.ports[*]",
"datatypes": ["domain"]
},
{
"name" : "Protocols",
"expression" : "full.website.protocols[*]",
"datatypes": ["domain"]
},
{
"name" : "Timestamp",
"expression" : "full.website.updated_at",
"datatypes": ["domain"]
}
]
},
{
"analyzer" : "CinCanTools",
"service" : "CinCanTools",
"fields": [
{
"name": "Majestic Top 10k",
"expression": "full.majestic_top10k",
"datatypes": ["domain"]
}
]
},
{
"analyzer" : "GoogleSafebrowsing",
"service" : "GoogleSafebrowsing",
"fields" : [
{
"name" : "Status",
"expression" : "summary.taxonomies.[0].level",
"datatypes" : ["domain"]
},
{
"name" : "Result",
"expression" : "summary.taxonomies.[0].value",
"datatypes" : ["domain"]
}
]
},
{
"analyzer" : "GreyNoise",
"service" : "GreyNoise",
"fields" : [
{
"name": "Status",
"expression" : "full.status",
"datatypes" : ["ip"]
},
{
"name" : "Records",
"expression" : "full.returned_count",
"datatypes" : ["ip"]
},
{
"name" : "Categories",
"expression" : "full.records.[*].category",
"type": "multiset",
"datatypes" : ["ip"]
},
{
"name" : "Types",
"expression" : "full.records.[*].name",
"type": "multiset",
"datatypes" : ["ip"]
}
]
},
{
"analyzer" : "DShield",
"service" : "DShield_lookup",
"fields" : [
{
"name" : "Summary",
"expression" : "summary.taxonomies[0].value",
"datatypes" : ["ip"]
},
{
"name" : "Reputation",
"expression" : "full.reputation",
"datatypes" : ["ip"]
},
{
"name" : "Threat feeds",
"expression" : "full.threatfeeds",
"type": "object",
"datatypes" : ["ip"]
}
]
},
{
"analyzer" :"OTXQuery",
"service" : "OTXQuery",
"fields": [
{
"name": "Passive DNS",
"expression" : "full.passive_dns.[*].[hostname,first,last]",
"type": "object-array",
"datatypes" : ["ip"]
},
{
"name": "Whois",
"expression" : "full.whois",
"type": "url",
"datatypes" : ["domain"]
},
{
"name": "Location",
"expression" : "full.[city,country_name,country_code,continent_code]",
"type": "subobject",
"datatypes" : ["domain"]
},
{
"name" : "ASN",