Pandas internal format for the ML part

This page documents the format generated by the Transform, Aggregate and Dimensionality reduction scripts. The format builds on top of Pandas' DataFrames and can be directly fed into the ML's training as well as classification (inference) steps.

This format is generated after the Feature Extraction JSON scripts run over all files. All these JSON files are combined into one dataframe.


import pandas as pd

# (... do the proprocessing, transforms, aggregations etc...)

df = DataFrame(data)
df.columns

> Index(['permissions.internet', 'permissions.bluetooth', 'permissions....', <...here is a long list of perms...>,
       'nr_of_tracking_domains_found', 'nr_of_known_clean_domains_found',
       'nr_of_unkown_domains_found',
       'metaDataName.UMENGAPPKEY', <...lots of API keys / metaDataNames ... >,
       'broadcastReceiverIntentFilterActionNames.XXX', <... lots of broadcast receiver names ...>,
       'dependency: beautifulsoup',
       'dependency: statistics', 'dependency: functools',
       'dependency: known-tracking-library', 'label'],
      dtype='object')

df.head()
> permissions.internet 	permissions.bluetooth 	permissions.webcam 	nr_of_tracking_domains_found 	nr_of_known_clean_domains_found 	nr_of_unkown_domains_found 	dependency: beautifulsoup 	dependency: statistics 	dependency: functools 	dependency: known-tracking-library 	label
ID 											
0 	1 	0 	0 	6 	3 	3 	0 	1 	0 	0 	0
1 	1 	0 	0 	6 	3 	3 	0 	1 	0 	0 	0
2 	1 	0 	0 	6 	3 	3 	0 	1 	0 	0 	0

Each row shall be an APK. The row (ID) is given by the ID of the APK. This is an arbitrary integer.

Columns

Categorical

permissions.*: this is a bitvector of used permissions (according to the MANIFEST): 0 = permission not used, 1 = permission used
dependency.*: again a bit vector which denotes the presence of a certain library being used.
metaDataName.*: a bit vector which denotes the presence of a certain APIKey name in the APK
broadcastReceiverIntentFilterActionNames.*: a bit vector which denotes the presence of a certain broadcast receiver channel name.
label: 0 = clean, 1 = contains trackers

Continuous

nr_of_tracking_domains_found: integer. Number of domains which were in the list of known tracking domains
nr_of_clean_domains_found: integer. Number of domains which were in the Alexa-Top-N list but not in the tracking domains list
nr_of_unkown_domains_found: integer. Number of domains which were in neither of both lists.

Comments

Please register or sign in to add a comment.

Pandas internal data format for the ML part

Pandas internal format for the ML part

Columns

Categorical

Continuous

Comments