Pandas internal format for the ML part
This page documents the format generated by the Transform, Aggregate and Dimensionality reduction scripts. The format builds on top of Pandas' DataFrames and can be directly fed into the ML's training as well as classification (inference) steps.
This format is generated after the Feature Extraction JSON scripts run over all files. All these JSON files are combined into one dataframe.
import pandas as pd
# (... do the proprocessing, transforms, aggregations etc...)
df = DataFrame(data)
df.columns
> Index(['permissions.internet', 'permissions.bluetooth', 'permissions....', <...here is a long list of perms...>,
'nr_of_tracking_domains_found', 'nr_of_known_clean_domains_found',
'nr_of_unkown_domains_found',
'metaDataName.UMENGAPPKEY', <...lots of API keys / metaDataNames ... >,
'broadcastReceiverIntentFilterActionNames.XXX', <... lots of broadcast receiver names ...>,
'dependency: beautifulsoup',
'dependency: statistics', 'dependency: functools',
'dependency: known-tracking-library', 'label'],
dtype='object')
df.head()
> permissions.internet permissions.bluetooth permissions.webcam nr_of_tracking_domains_found nr_of_known_clean_domains_found nr_of_unkown_domains_found dependency: beautifulsoup dependency: statistics dependency: functools dependency: known-tracking-library label
ID
0 1 0 0 6 3 3 0 1 0 0 0
1 1 0 0 6 3 3 0 1 0 0 0
2 1 0 0 6 3 3 0 1 0 0 0
Each row shall be an APK. The row (ID) is given by the ID of the APK. This is an arbitrary integer.
Columns
Categorical
- permissions.*: this is a bitvector of used permissions (according to the MANIFEST): 0 = permission not used, 1 = permission used
- dependency.*: again a bit vector which denotes the presence of a certain library being used.
- metaDataName.*: a bit vector which denotes the presence of a certain APIKey name in the APK
- broadcastReceiverIntentFilterActionNames.*: a bit vector which denotes the presence of a certain broadcast receiver channel name.
- label: 0 = clean, 1 = contains trackers
Continuous
- nr_of_tracking_domains_found: integer. Number of domains which were in the list of known tracking domains
- nr_of_clean_domains_found: integer. Number of domains which were in the Alexa-Top-N list but not in the tracking domains list
- nr_of_unkown_domains_found: integer. Number of domains which were in neither of both lists.