1. Overview
This site collects three related artifacts:
- CodeQL databases built from open-source projects.
- CodeQL analysis results for those projects.
The intended uses are:
- Provide ready-made databases for CodeQL query development.
- Provide large ready-made analysis result sets for work on presenting and disseminating results that may number in the millions.
- Provide databases for the MRVA project.
Future work includes summary pages that explain the workflow, data locations, and
comparisons of full tracing builds to --build-mode=none
The corpus is based on deployed open-source software rather than arbitrary
GitHub repositories. The databases were built from a random sample of 7000
Debian projects drawn from a much larger Debian package population. Of those
7000 projects, 2487 successfully produced a C/C++ CodeQL database using full
build tracing. Of those projects, 843 also successfully produced a C/C++
CodeQL database using --build-mode=none, where CodeQL reads the source tree
without tracing a build. No other languages are included yet.
This document follows the standard, but often under-explained, end-to-end CodeQL path:
- Convert source code into CodeQL database bundles. The database is the artifact that CodeQL queries examine.
- Apply query suites to one or more CodeQL databases. Each analyzed database produces one SARIF file containing zero or more results.
- Build reports and comparisons from the SARIF results.
The same path applies to one database and to large corpora, but the scale changes the engineering problem.
- Single database: choose one source tree, build one database, and run one query suite. If 20 queries produce results, the output is small enough for direct review and direct SARIF inspection.
- Corpus scale: choose 2487 source trees, build two databases per project
when comparing BMF and BMN, and run one query suite. If 20 queries produce
results per database, that is about
2487 * 2 * 20 = 99480result groups before considering multiple findings per query. Direct SARIF inspection is no longer practical; SQL ingestion becomes useful. - Large recurring scale: choose 120000 source trees, build one database per
project, and run one query suite weekly. If 20 queries produce results per
database, one run can produce about
120000 * 20 = 2400000result groups. Keeping recent weekly runs plus monthly snapshots quickly reaches tens of millions of result groups. At that point, a high-performance SQL engine such as DuckDB is required for comparison and reporting.
The following sections describe the general workflow used across the source repository corpus. For two concrete end-to-end examples, see the detailed QEMU and DPDK directories. They follow the same create/analyze workflow, with additional logging for CPU, RAM, disk use, package events, SQLite import, and gnuplot graph generation.
2. From code to CodeQL DB
Start with a source tree source. Create a CodeQL database
codeql_DB_type from source, where type is either full build tracing or
build-mode=none. The resulting database directory is zipped so it can be
analyzed later or moved between hosts.
The two database creation modes are:
bmf, build-mode full: CodeQL traces a real build withcodeql database create ... --command'…' …=. Successful database bundles are under data/codeql-db-zips-bm-full/.bmn, build-mode none: CodeQL scans visible C/C++ source without running the build withcodeql database create ... --build-mode=none .... Successful database bundles are under data/codeql-db-zips-bm-none/.
2.1. Results for Full Build Trace
This is labeled bmf, for build-mode full. CodeQL observes a real build, so
the database reflects files and compiler options that actually participated in
that build.
codeql database create DB_DIR \ --language=cpp \ --source-root=SOURCE_DIR \ --command='debian/rules build' \ --threads="$THREADS" \ --ram="$RAM_MB" codeql database bundle --output PROJECT.zip DB_DIR
The corpus-level BMF database bundles are available at data/codeql-db-zips-bm-full/; a paginated searchable listing follows. There are 2487 successful BMF DB zip bundles.
2.2. Results for build-mode=none
This is labeled bmn, for build-mode=none. CodeQL does not run the build; it
infers extraction from the source tree.
The core CodeQL command used to build every one of these DBs is
codeql database create \ --language=cpp \ --source-root="$SOURCE_ROOT" \ --threads="$THREADS" \ --ram="$RAM_MB" \ --build-mode=none \ "$DB_PATH"
The corpus-level BMN database bundles are available at data/codeql-db-zips-bm-none/; a paginated searchable listing follows. There are 843 successful BMN DB zip bundles.
3. From CodeQL DB to results
After database creation, DBs from both modes are analyzed with a CodeQL query suite; standard preinstalled query suites useful for C/C++ include:
cpp-security-and-quality.qls- broad security and quality suite used here.
cpp-security-extended.qls- security-focused suite with extended checks.
cpp-code-scanning.qls- default code-scanning-oriented suite.
- (no term)
- Individual query packs or query paths under the installed CodeQL checkout.
Here, CodeQL version 2.23.6 was used.
The analysis steps are identical for bmf and bmn; only the DB zip and
output SARIF directory change.
- Choose the DB zip.
- Extract it to a temporary directory.
- Find the directory containing
codeql-database.yml. - Run
codeql database analyzewith the selected suite. - Store the SARIF and retain stdout, stderr, resource logs, and status.
The core CodeQL command used to analyze every CodeQL DB is
codeql database analyze \ --format=sarif-latest \ --rerun \ -j"$THREADS" \ --ram="$RAM_MB" \ --output "$SARIF" \ -- "$DB_DIR" cpp-security-and-quality.qls
3.1. SARIF files produced for bm-full
The work here used the C/C++ cpp-security-and-quality suite. The BMF SARIF
outputs are available at
data/bmf-db-cpp-security-and-quality-sarifs/.
There are 2486 BMF C/C++ cpp-security-and-quality SARIF files.
3.2. SARIF files produced for bm-none
This section also used the C/C++ cpp-security-and-quality suite. The BMN SARIF
outputs are available at data/bmn-db-cpp-security-and-quality-sarifs/.
There are 832 BMN C/C++ cpp-security-and-quality SARIF files, significantly
fewer than the analysis of databases from full builds.
4. Comparisons
Future work.