Filtering based on a large subset of photo ids

I wanted to download a subset of yfcc100m consisting of the 15 million images which were used in training CLIP. Information about the subset can be found here. My experience was that the filter function in this package was too slow/impractical when a user wants to filter based on a large (15 million) list of photo ids. After contacting @jfolz by email for advice, he suggested an alternative approach where the user dumps out the rows from the database and performs the filtering on larger chunks of the data. I managed to get it to work and will share my approach here for others.

Step 1: Dump rows from database to python, filter them, and write to new SQL db

We create a new database called "yfcc15m_dataset.db". generate_rows_from_db() from yfcc100m.convert_metadata can be used to extract rows from database efficiently. Due to memory restrictions I performed filtering and wrote to DB in every 1 million observations extracted from the database. You can tweak that according to your own preferences.

import pandas as pd
import sqlite3
from yfcc100m.convert_metadata import generate_rows_from_db
from datadings.tools import yield_threaded
from tqdm import tqdm

df = pd.read_table(
    "yfcc100m_subset_data.tsv", header=None, names=["line_number", "identifier", "hash"]
)
clip_photoids = df["identifier"].tolist()  # 15 million ids
del df

rows = generate_rows_from_db("yfcc100m_dataset.sql", "yfcc100m_dataset")
gen = yield_threaded(rows)

conn = sqlite3.connect("yfcc15m_dataset.db")
c = conn.cursor()

c.execute(
    """CREATE TABLE IF NOT EXISTS yfcc15m_dataset 
    (photoid integer primary key, uid text, unickname text, 
    datetaken text, dateuploaded text, capturedevice text, title text, 
    description text, usertags text, machinetags text, longitude text, 
    latitude text, accuracy integer, pageurl text, downloadurl text, 
    licensename text, licenseurl text, serverid integer, farmid integer, 
    secret text, secretoriginal text, ext text, marker integer)"""
)

chunk = []
for i, row in enumerate(tqdm(gen, total=100000000)):
    chunk.append(row)

    if (i + 1) % 1000000 == 0:
        df = pd.DataFrame(
            chunk,
            columns=[
                "photoid",
                "uid",
                "unickname",
                "datetaken",
                "dateuploaded",
                "capturedevice",
                "title",
                "description",
                "usertags",
                "machinetags",
                "longitude",
                "latitude",
                "accuracy",
                "pageurl",
                "downloadurl",
                "licensename",
                "licenseurl",
                "serverid",
                "farmid",
                "secret",
                "secretoriginal",
                "ext",
                "marker",
            ],
        )
        df = df[df["photoid"].isin(clip_photoids)].reset_index(drop=True)

        for row in df.itertuples(index=False, name=None):
            c.execute(
                """INSERT INTO yfcc15m_dataset VALUES 
                (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, 
                ?, ?, ?, ?, ?, ?, ?, ?, ?, ?,
                ?, ?, ?)""",
                row,
            )

        conn.commit()
        chunk = []

c.close()

Step 2: Change name of db file and table

Move old db file to different folder for backup before proceeding with the renaming of the database file.

import sqlite3

"""
Move yfcc100m_dataset.sql to another folder before running this script.
We rename yfcc15_dataset because yfcc100m package is hardcoded to read
a db file called yfcc100_dataset.sql and a table named yfcc100m_dataset.
"""

conn = sqlite3.connect("yfcc15m_dataset.db")
c = conn.cursor()

c.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(c.fetchall())

c.execute("ALTER TABLE yfcc15m_dataset RENAME TO yfcc100m_dataset;")
conn.commit()

c.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(c.fetchall())  # New name should be printed

I then manually renamed the database file in file explorer from yfcc15m_dataset.db to yfcc100m_dataset.sql.

Step 3: Create metadata shards and download images.

To make metadata shards from your newly created database, you can run convert_metadata in the terminal as outlined in the regular package documentation, but with the option --skip-verification added.

python -m yfcc100m.convert_metadata <input dir> -o <meta dir> --skip-verification

After metadata shards are craeted, images can be downloaded as usual:

python -m yfcc100m.download <meta dir> -o <zip dir>

Step 4 (optional): Write metadata+key to csv

If you want to join the "key" of the images (i.e. destination where they are stored) with existing metadata, then you are in luck because this data already exists as part of the metadata shards created earlier. In order to save them to a tabular data format there are convenience functions in yfcc100m which can help with this. Here is how I did it:

import pandas as pd
import os
import urllib
from pathlib import Path
from tqdm import tqdm
from yfcc100m.tools import load_metadata


meta_shards = os.listdir("meta")

df_list = []
for shard in tqdm(meta_shards):
    metadata_generator = load_metadata(Path("meta"), Path(shard).stem)
    df_list.append(pd.DataFrame(metadata_generator))

df = pd.concat(df_list)
df = df[["photoid", "uid", "title", "description", "dateuploaded", "downloadurl", "ext", "key"]]
df = df.reset_index(drop=True)

df.to_csv("yfcc15m.csv.gz", index=False)

Another convenient option would be to use the datadings format created by the yfcc100m package author.

Edited Jul 03, 2021 by Faton R