Capitol Trades Data Pipeline

Python PostgreSQL Playwright REST API

An automated data engineering pipeline that scrapes, processes, and serves congressional stock trading disclosures through a normalized database and RESTful API.

📖 Overview

This project is a fully automated web scraping system that collects trade disclosure data from Capitol Trades, enriches it with real-time market data from financial APIs, and stores it in a normalized PostgreSQL database. The data is exposed through a secure REST API with role-based access control, and visualized through an interactive web dashboard.

The system runs as a distributed scraping network across multiple Raspberry Pi devices, with scheduled jobs ensuring data freshness and comprehensive logging for monitoring and debugging.

✨ Key Features

🛠️ Tech Stack

Category Technology
Language Python 3.8+
Web Scraping Playwright (Chromium headless browser)
Database PostgreSQL 14+ with psycopg2
API Framework FastAPI with Pydantic validation
External APIs FinnHub (stocks), CoinMarketCap (crypto)
Scheduling Cron jobs / systemd timers
Logging Python logging module with file handlers
Infrastructure Raspberry Pi cluster (distributed scraping)

🏗️ System Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                        DATA COLLECTION LAYER                            │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────────┐  │
│  │   Playwright    │  │    FinnHub      │  │    CoinMarketCap        │  │
│  │   Scraper       │  │    Stock API    │  │    Crypto API           │  │
│  │ (Capitol Trades)│  │   (Prices)      │  │   (Prices/Tickers)      │  │
│  └────────┬────────┘  └────────┬────────┘  └────────────┬────────────┘  │
│           │                    │                        │               │
│           └────────────────────┼────────────────────────┘               │
│                                ▼                                        │
├─────────────────────────────────────────────────────────────────────────┤
│                        DATA PROCESSING LAYER                            │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                     Python ETL Pipeline                          │   │
│  │  • Date parsing & normalization    • Ticker validation           │   │
│  │  • Fuzzy matching (difflib)        • Duplicate detection         │   │
│  │  • Options parsing (calls/puts)    • Error handling & retries    │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                ▼                                        │
├─────────────────────────────────────────────────────────────────────────┤
│                        DATA STORAGE LAYER                               │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                    PostgreSQL Database                           │   │
│  │  ┌────────────┐  ┌────────────┐  ┌────────────┐  ┌────────────┐  │   │
│  │  │ politicians│  │  parties   │  │  chambers  │  │  tickers   │  │   │
│  │  │ (dimension)│  │ (dimension)│  │ (dimension)│  │ (dimension)│  │   │
│  │  └────────────┘  └────────────┘  └────────────┘  └────────────┘  │   │
│  │  ┌────────────────────────────────────────────────────────────┐  │   │
│  │  │              normalized_trades (fact table)                │  │   │
│  │  │  politician_id │ party_id │ ticker_id │ traded_date │ ...  │  │   │
│  │  └────────────────────────────────────────────────────────────┘  │   │
│  │  ┌────────────────────────────────────────────────────────────┐  │   │
│  │  │                    prices (time-series)                    │  │   │
│  │  └────────────────────────────────────────────────────────────┘  │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                ▼                                        │
├─────────────────────────────────────────────────────────────────────────┤
│                         API & PRESENTATION                              │
│  ┌─────────────────────────┐      ┌─────────────────────────────────┐   │
│  │      FastAPI REST       │      │        Flask Dashboard          │   │
│  │  • GET/POST endpoints   │      │  • Interactive tables           │   │
│  │  • API key auth         │      │  • Charts & visualizations      │   │
│  │  • Role-based access    │      │  • Pagination & filtering       │   │
│  └─────────────────────────┘      └─────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

🗄️ Database Schema

The database follows a normalized star schema design, separating dimension tables (politicians, parties, chambers, tickers) from the central fact table (normalized_trades). This architecture enables efficient queries, reduces data redundancy, and supports complex analytics.

Dimension Tables

Table Purpose
politicians Unique list of politicians involved in trades
parties Political party affiliations (Democrat, Republican, etc.)
chambers Legislative chambers (House, Senate)
tickers Unified asset registry (stocks, crypto, other)
owners Trade attribution (self, spouse, child)
trade_types Transaction types (Purchase, Sale, Exchange)
trade_sizes Value ranges ($1K-$15K, $15K-$50K, etc.)

Fact Tables

Table Purpose
normalized_trades Central trade records with foreign keys to all dimensions
option_trade_details Extended metadata for options trades (strike, expiry, call/put)
prices Time-series price data for all tracked tickers

🕷️ Scraper Pipeline

The scraper uses Playwright to automate a headless Chromium browser, navigating through paginated trade listings and extracting structured data. The pipeline includes intelligent error handling, duplicate detection, and automatic ticker validation.

🔄 Data Collection

  • Headless browser automation
  • Pagination handling (100 per page)
  • Configurable retry logic (3 attempts)
  • Rate limiting to avoid blocks

🧹 Data Processing

  • Date parsing with fuzzy matching
  • Ticker validation via FinnHub
  • Options contract parsing
  • Duplicate trade detection

📝 Logging & Monitoring

  • Separate log files per component
  • Debug, info, and error levels
  • Insert tracking and auditing
  • Runtime performance metrics
# Scraper configuration
MAX_RETRIES = 3
BASE_URL = "https://www.capitoltrades.com/trades?pageSize=100&page="
STOP_ON_DUPLICATE = False  # Continue or halt on duplicate detection

# Logging setup
capitol_scraper_logger = logging.getLogger("capitol_scraper_normalized")
db_inserts_logger = logging.getLogger("db_inserts")
debug_logger = logging.getLogger("normalized_insert")

🔐 REST API

The data is exposed through a secure FastAPI backend with role-based access control. API keys grant different permission levels, enabling both read-only dashboards and read-write integrations for automated systems.

Authentication

# All requests require an API key header
headers = {
    "Content-Type": "application/json",
    "api-key": "YOUR_API_KEY"
}

# Roles: admin, read-write, read-only

Available Endpoints

Endpoint Method Description
/get_capitol_trades GET Retrieve trades with filters, pagination, sorting
/get_stock_tickers GET List all tracked stock symbols
/get_crypto_tickers GET List all tracked cryptocurrency symbols
/get_prices GET Historical price data for any ticker
/insert_trade POST Add new trade records (read-write only)
/batch_insert POST Bulk insert multiple records (read-write only)

Query Examples

# Get trades by a specific politician
GET /get_capitol_trades?fields=politician,trade_ticker&filter=politician=Nancy Pelosi

# Get trades within a date range
GET /get_capitol_trades?fields=trade_ticker,traded_date&date_from=2024-01-01&date_to=2024-12-31

# Get trades by type with pagination
GET /get_capitol_trades?filter=type=Purchase&page=1&limit=50

📊 Technical Specifications

Metric Value
Python Version 3.8+
PostgreSQL Version 14+
Scrape Page Size 100 trades per page
Max Retry Attempts 3 per page
API Key Roles admin, read-write, read-only
Ticker Categories stock, crypto, other
Scheduling Cron (configurable intervals)
Log Rotation Per-component log files

✅ Implemented Features

✓ Playwright Browser Automation
✓ Headless Chromium Scraping
✓ FinnHub Stock API Integration
✓ CoinMarketCap Crypto API
✓ Normalized Star Schema
✓ Dimension & Fact Tables
✓ Options Trade Parsing
✓ Fuzzy Date Matching
✓ Duplicate Detection
✓ REST API with Auth
✓ Role-Based Access Control
✓ Batch Insert Support
✓ Comprehensive Logging
✓ Cron Job Scheduling
✓ Distributed Scraping

🚀 Future Enhancements