LLM Prefilter Usage Guide¶

The LLM prefilter automatically identifies irrelevant job postings and routes them to a separate filtered_jobs table, keeping only relevant jobs in the main job_listings table.

Quick Start¶

1. Job Search with Prefiltering¶

# Enable prefiltering during job ingestion
python -m flows.ingest "Python Developer" "Remote" linkedin 50 --prefilter --prefilter-threshold 0.45

# Use custom threshold (0.0 = filter everything, 1.0 = filter nothing)
python -m flows.ingest "Data Engineer" "San Francisco" linkedin 25 --prefilter --prefilter-threshold 0.6

What happens: - Jobs with relevance score ≥ threshold → saved to job_listings - Jobs with relevance score < threshold → saved to filtered_jobs - Prefilter reasoning stored in ai_model_reasoning field - Raw prefilter data stored in raw_payload["prefilter"]

2. Autonomous Filtering of Existing Jobs¶

# Sweep existing jobs and filter irrelevant ones
python -m flows.prefilter sweep --threshold 0.45 --limit 500 --dry-run

# Actually move jobs (remove --dry-run)
python -m flows.prefilter sweep --threshold 0.45 --limit 500 --days-back 30

# Process in smaller batches
python -m flows.prefilter sweep --threshold 0.5 --batch-size 10 --limit 100

What happens: - Loads unprocessed jobs from job_listings table - Applies LLM prefiltering to each job - Jobs below threshold: moved from job_listings → filtered_jobs - Jobs above threshold: updated with prefilter metadata, stay in job_listings

Configuration¶

Enable by Default (config/settings.toml)¶

[prefilter]
enabled = true
threshold = 0.45
timeout = 90
concurrent = 4

Optional Stoplist (disabled by default)¶

[prefilter]
use_stoplist = true
stoplist_path = "config/job_filter.toml"

Database Tables¶

job_listings: Relevant jobs (score ≥ threshold)
filtered_jobs: Irrelevant jobs (score < threshold)
Both tables have identical schema
Duplicate detection works across both tables

Examples¶

# High-precision filtering (fewer false positives)
python -m flows.ingest "AI Engineer" "Remote" --prefilter --prefilter-threshold 0.7

# Broad filtering (more aggressive)
python -m flows.ingest "Software Engineer" "NYC" --prefilter --prefilter-threshold 0.3

# Dry run to see what would be filtered
python -m flows.prefilter sweep --dry-run --limit 50

# Clean up last week's jobs
python -m flows.prefilter sweep --days-back 7 --threshold 0.5

Monitoring¶

Check prefilter results:

-- Count jobs by table
SELECT 'relevant' as type, COUNT(*) FROM job_listings
UNION ALL
SELECT 'filtered' as type, COUNT(*) FROM filtered_jobs;

-- View prefilter reasoning
SELECT role_title, ai_model_reasoning 
FROM job_listings 
WHERE ai_model_reasoning LIKE '%[prefilter]%';