1. Overview

This site collects three related artifacts:

  1. CodeQL databases built from open-source projects.
  2. CodeQL analysis results for those projects.

The intended uses are:

  1. Provide ready-made databases for CodeQL query development.
  2. Provide large ready-made analysis result sets for work on presenting and disseminating results that may number in the millions.
  3. Provide databases for the MRVA project.

Future work includes summary pages that explain the workflow, data locations, and comparisons of full tracing builds to --build-mode=none

The corpus is based on deployed open-source software rather than arbitrary GitHub repositories. The databases were built from a random sample of 7000 Debian projects drawn from a much larger Debian package population. Of those 7000 projects, 2487 successfully produced a C/C++ CodeQL database using full build tracing. Of those projects, 843 also successfully produced a C/C++ CodeQL database using --build-mode=none, where CodeQL reads the source tree without tracing a build. No other languages are included yet.

This document follows the standard, but often under-explained, end-to-end CodeQL path:

  1. Convert source code into CodeQL database bundles. The database is the artifact that CodeQL queries examine.
  2. Apply query suites to one or more CodeQL databases. Each analyzed database produces one SARIF file containing zero or more results.
  3. Build reports and comparisons from the SARIF results.

The same path applies to one database and to large corpora, but the scale changes the engineering problem.

  1. Single database: choose one source tree, build one database, and run one query suite. If 20 queries produce results, the output is small enough for direct review and direct SARIF inspection.
  2. Corpus scale: choose 2487 source trees, build two databases per project when comparing BMF and BMN, and run one query suite. If 20 queries produce results per database, that is about 2487 * 2 * 20 = 99480 result groups before considering multiple findings per query. Direct SARIF inspection is no longer practical; SQL ingestion becomes useful.
  3. Large recurring scale: choose 120000 source trees, build one database per project, and run one query suite weekly. If 20 queries produce results per database, one run can produce about 120000 * 20 = 2400000 result groups. Keeping recent weekly runs plus monthly snapshots quickly reaches tens of millions of result groups. At that point, a high-performance SQL engine such as DuckDB is required for comparison and reporting.

The following sections describe the general workflow used across the source repository corpus. For two concrete end-to-end examples, see the detailed QEMU and DPDK directories. They follow the same create/analyze workflow, with additional logging for CPU, RAM, disk use, package events, SQLite import, and gnuplot graph generation.

2. From code to CodeQL DB

Start with a source tree source. Create a CodeQL database codeql_DB_type from source, where type is either full build tracing or build-mode=none. The resulting database directory is zipped so it can be analyzed later or moved between hosts.

The two database creation modes are:

  • bmf, build-mode full: CodeQL traces a real build with codeql database create ... --command'…' …=. Successful database bundles are under data/codeql-db-zips-bm-full/.
  • bmn, build-mode none: CodeQL scans visible C/C++ source without running the build with codeql database create ... --build-mode=none .... Successful database bundles are under data/codeql-db-zips-bm-none/.

2.1. Results for Full Build Trace

This is labeled bmf, for build-mode full. CodeQL observes a real build, so the database reflects files and compiler options that actually participated in that build.

codeql database create DB_DIR \
  --language=cpp \
  --source-root=SOURCE_DIR \
  --command='debian/rules build' \
  --threads="$THREADS" \
  --ram="$RAM_MB"

codeql database bundle --output PROJECT.zip DB_DIR

The corpus-level BMF database bundles are available at data/codeql-db-zips-bm-full/; a paginated searchable listing follows. There are 2487 successful BMF DB zip bundles.

2.2. Results for build-mode=none

This is labeled bmn, for build-mode=none. CodeQL does not run the build; it infers extraction from the source tree.

The core CodeQL command used to build every one of these DBs is

codeql database create \
  --language=cpp \
  --source-root="$SOURCE_ROOT" \
  --threads="$THREADS" \
  --ram="$RAM_MB" \
  --build-mode=none \
  "$DB_PATH"

The corpus-level BMN database bundles are available at data/codeql-db-zips-bm-none/; a paginated searchable listing follows. There are 843 successful BMN DB zip bundles.

3. From CodeQL DB to results

After database creation, DBs from both modes are analyzed with a CodeQL query suite; standard preinstalled query suites useful for C/C++ include:

cpp-security-and-quality.qls
broad security and quality suite used here.
cpp-security-extended.qls
security-focused suite with extended checks.
cpp-code-scanning.qls
default code-scanning-oriented suite.
(no term)
Individual query packs or query paths under the installed CodeQL checkout.

Here, CodeQL version 2.23.6 was used.

The analysis steps are identical for bmf and bmn; only the DB zip and output SARIF directory change.

  1. Choose the DB zip.
  2. Extract it to a temporary directory.
  3. Find the directory containing codeql-database.yml.
  4. Run codeql database analyze with the selected suite.
  5. Store the SARIF and retain stdout, stderr, resource logs, and status.

The core CodeQL command used to analyze every CodeQL DB is

codeql database analyze \
  --format=sarif-latest \
  --rerun \
  -j"$THREADS" \
  --ram="$RAM_MB" \
  --output "$SARIF" \
  -- "$DB_DIR" cpp-security-and-quality.qls

3.1. SARIF files produced for bm-full

The work here used the C/C++ cpp-security-and-quality suite. The BMF SARIF outputs are available at data/bmf-db-cpp-security-and-quality-sarifs/. There are 2486 BMF C/C++ cpp-security-and-quality SARIF files.

3.2. SARIF files produced for bm-none

This section also used the C/C++ cpp-security-and-quality suite. The BMN SARIF outputs are available at data/bmn-db-cpp-security-and-quality-sarifs/.

There are 832 BMN C/C++ cpp-security-and-quality SARIF files, significantly fewer than the analysis of databases from full builds.

4. Comparisons

Future work.