What ADA-IG is

ADA-IG (Accessible Data Archive — Instagram) is a self-hosted, read-only Instagram data ingestion and monitoring platform. It connects to Instagram via the official Meta Graph API for authorized Business or Creator accounts, periodically ingests profile, post, and comment data, tracks field-level changes over time, and exposes everything through a citation-first search interface and a clean operator dashboard.

Every record is traceable back to the raw source artifact and the run that produced it. No data is ever invented, inferred beyond what the API returns, or collected through unsupported means. The platform is intended to give operators a reliable, auditable window into account activity — nothing more.

What it is not

Tech stack

Python 3.12 runtime
FastAPI web framework
Uvicorn ASGI server
SQLAlchemy 2 ORM
Alembic migrations
PostgreSQL 16 database
pgvector vector search
Jinja2 templates
Pydantic settings + schemas
Docker Compose local orchestration
Playwright optional public capture

Design constraints

🔗
Provenance-first
Every record links to the source run and raw artifact that produced it. You can always trace a field value back to the exact API response.
👁
Read-only
The platform never writes to Instagram. No post creation, no DMs, no comment moderation, no follow/unfollow.
🚧
No graph expansion
Monitors only accounts explicitly added as targets. No follower harvesting, no friends-of-friends, no bulk discovery.
🔑
Official API lane first
Defaults to the Meta Graph API. The optional Playwright public-capture lane must be explicitly enabled and is off by default.
📦
Versioned snapshots
Every ingestion creates a new snapshot. The platform never overwrites history — it appends and diffs.
🔍
Citation-first search
Every search result includes the run_id, artifact_id, and entity_id that sourced it. No result is anonymous.

Service architecture

All services run in Docker Compose. Each has a single, bounded responsibility.

Browser │ HTTP :18080 ▼ ┌─────────────────────────────┐ │ api │ FastAPI + Uvicorn │ (FastAPI + Jinja2 UI) │ Serves the dashboard UI and REST API └──────────────┬──────────────┘ │ SQLAlchemy ▼ ┌─────────────────────────────┐ │ db │ PostgreSQL 16 + pgvector extension │ (PostgreSQL 16 + pgvector) │ Stores all normalized entities and vectors └─────────────────────────────┘ ▲ │ SQLAlchemy ┌──────────────┴──────────────┐ │ worker │ Background ingestion loop │ (background ingestor) │ Periodically fetches and processes data └─────────────────────────────┘ ┌─────────────────────────────┐ │ migrate │ One-shot Alembic runner │ (alembic upgrade head) │ Runs on startup, exits when done └─────────────────────────────┘ ┌─────────────────────────────┐ │ playwright │ Optional public capture (off by default) │ (--profile public-capture)│ Requires PUBLIC_CAPTURE_ENABLED=true └─────────────────────────────┘

Data flow

1
Trigger
A run is started via manual demo button (POST /api/runs/manual-demo), inbound OpenClaw webhook (POST /api/hooks/openclaw/run), or the worker bootstrap loop.
2
Ingest
The official ingestor calls the Meta Graph API (or Playwright for public capture). Raw JSON responses are written to disk under storage/artifacts/ and recorded as SourceArtifact rows in the database.
3
Normalize
pipeline.py calls normalizer.py which maps the raw artifact payload to Profile, Post, and Comment ORM records, creating new versioned snapshots for each.
4
Diff
diff_engine.py compares the new snapshot against the previous one for each entity, writing EntityDiff records for every changed field (old value, new value, run reference).
5
Embed
chunker.py splits text fields (bio, captions, comments) into overlapping chunks. embedder.py converts each chunk to a vector and writes EmbeddingChunk rows for pgvector indexing.
6
Search
A query is embedded, then pgvector performs cosine similarity search across EmbeddingChunk. Every result carries provenance: run_id, artifact_id, and entity_id.

File structure

openclaw-instagram-data-platform/ ├── app/ │ ├── main.py # FastAPI app factory; mounts all routers and UI routes │ ├── api/routes/ # One file per REST resource │ │ ├── health.py # /healthz and /readyz probes │ │ ├── runs.py # Ingestion run listing and detail │ │ ├── profiles.py # Profile listing │ │ ├── posts.py # Post listing │ │ ├── comments.py # Comment listing │ │ ├── changes.py # Field-level diff listing │ │ ├── search.py # Vector similarity search │ │ ├── replay.py # Re-normalize an existing run's artifacts │ │ ├── hooks.py # Inbound OpenClaw automation webhook │ │ ├── settings.py # Settings form save + JSON read │ │ └── operator.py # Seed, wipe, summary, connection status │ ├── core/ # App infrastructure, no business logic │ │ ├── config.py # Pydantic Settings; reads .env │ │ ├── db.py # SQLAlchemy engine and session factory │ │ ├── logging.py # Structured logging setup │ │ ├── security.py # Guardrail checks (public capture, playwright) │ │ ├── storage.py # Artifact file read/write abstraction │ │ ├── env_validation.py # Startup .env sanity checks │ │ ├── metrics.py # Internal counters │ │ └── queue.py # In-process task queue │ ├── models/ # SQLAlchemy ORM models (one per table) │ │ ├── source_target.py # Monitored account (profile_key, status) │ │ ├── source_run.py # One ingestion run (status, trigger, timestamps) │ │ ├── source_artifact.py # Raw JSON blob record (hash, size, path) │ │ ├── profile.py # Canonical Instagram account │ │ ├── profile_snapshot.py # Point-in-time profile state │ │ ├── post.py # Canonical Instagram post │ │ ├── post_snapshot.py # Point-in-time post state (caption, likes) │ │ ├── comment.py # Individual comment │ │ ├── entity_diff.py # Field-level change (old_value, new_value) │ │ ├── embedding_chunk.py # Text chunk + pgvector embedding │ │ ├── run_event.py # Timestamped event within a run │ │ ├── dead_letter.py # Records that failed processing │ │ ├── enums.py # Shared enums (EntityType, RunStatus, etc.) │ │ └── common.py # Shared base model and UUID/timestamp mixins │ ├── services/ingestion/ # The full ingestion pipeline │ │ ├── pipeline.py # Orchestrates a complete ingestion run end-to-end │ │ ├── normalizer.py # Maps raw API payload to ORM models + snapshots │ │ ├── diff_engine.py # Compares snapshots, writes EntityDiff records │ │ ├── artifact_writer.py # Writes raw JSON to disk and SourceArtifact to DB │ │ ├── official_ingestor.py # Meta Graph API fetcher │ │ └── public_capture_ingestor.py # Playwright-based public capture (opt-in) │ ├── services/retrieval/ # Search pipeline │ │ ├── search.py # Query entry point and result assembly │ │ ├── search_service.py # pgvector cosine similarity query │ │ ├── chunker.py # Splits text into overlapping embedding chunks │ │ └── embedder.py # Generates vectors (dev: hash stub, prod: model) │ ├── services/dashboard/ │ │ ├── queries.py # Raw ORM queries for dashboard data │ │ └── query_service.py # Aggregation and summary logic │ ├── settings_service.py # Reads and writes .env settings from the browser │ ├── operator_service.py # Seed demo data, wipe platform data, count summary │ └── connection_service.py # Evaluates Instagram API credential health │ ├── integrations/ │ │ ├── instagram/client.py # Meta Graph API HTTP client (stub, incomplete) │ │ └── playwright/capture.py # Playwright browser automation (stub) │ ├── workers/ │ │ ├── main.py # Worker container entrypoint │ │ ├── runner.py # Task execution loop │ │ └── tasks/ # Background task definitions │ ├── schemas/ # Pydantic request/response schemas │ │ ├── common.py # Shared types (pagination, provenance) │ │ ├── runs.py # Run request/response shapes │ │ └── search.py # Search result schema │ └── ui/templates/ # Jinja2 HTML templates │ ├── base.html # Navigation, global CSS, layout shell │ ├── index.html # Dashboard (KPIs, profiles, posts, runs) │ ├── profile.html # Profile drilldown (posts, comments, diffs) │ ├── settings.html # Platform settings (tabbed form + operator controls) │ └── about.html # This page ├── migrations/versions/ # Alembic migration scripts ├── docker/ │ ├── api.Dockerfile # Image for api and migrate services │ ├── worker.Dockerfile # Image for the background worker │ └── playwright.Dockerfile # Image for optional public capture ├── scripts/ │ ├── dev_seed.py # CLI script to seed demo data into the DB │ ├── demo_payloads.py # Sample Instagram API payloads for testing │ └── rebuild.sh # Full image rebuild + migration + restart ├── compose.yaml # Docker Compose service definitions ├── pyproject.toml # Python dependencies and tool config (ruff, pytest) └── alembic.ini # Alembic migration configuration

API reference

Every endpoint this platform exposes. Click a group to expand.

Health & readiness 2 endpoints
GET /healthz Basic liveness probe. Returns {"status":"ok"}. Used by Docker and load balancers to confirm the process is running.
GET /readyz Readiness probe. Verifies the database connection is alive before reporting ready. Returns {"status":"ok"} or an error with the reason.
Ingestion runs 3 endpoints
GET /api/runs List all ingestion runs ordered by start time descending. Returns status, trigger type, timestamps, and error summary for each run.
GET /api/runs/{run_id} Detail for a single run by UUID. Includes full status, all associated events, and links to the artifacts produced.
POST /api/runs/manual-demo Trigger a demo ingestion run using built-in fixture payloads. Does not require live Meta credentials. Useful for testing the full pipeline end-to-end.
Profiles 1 endpoint
GET /api/profiles List all tracked profiles joined with their current snapshot. Returns username, follower count, bio, and ingestion metadata.
Posts 1 endpoint
GET /api/posts List all ingested posts joined with their current snapshot. Returns caption, like count, comment count, media type, and provenance.
Comments 1 endpoint
GET /api/comments List all captured comments. Returns author name, comment text, timestamp, and the post it belongs to.
Changes (diffs) 1 endpoint
GET /api/changes List all detected field-level diffs across all entities, ordered by creation time. Each record includes entity type, entity ID, field name, old value, new value, and the run that produced the change.
Search 1 endpoint
GET /api/search?q= Vector similarity search across all indexed text (bios, captions, comments). The query is embedded and compared against stored EmbeddingChunk vectors using pgvector cosine similarity. Every result includes run_id, artifact_id, and entity_id for full provenance.
Replay 1 endpoint
POST /api/replay/{run_id} Re-run the normalize → diff → embed pipeline against the existing raw artifacts for a given run, without re-fetching from Instagram. Useful when normalizer or diff logic is updated and you want to reprocess historical data.
Automation hooks 1 endpoint
POST /api/hooks/openclaw/run Inbound webhook that triggers an ingestion run from an external automation system. Requires the X-OpenClaw-Secret header to match the configured OPENCLAW_WEBHOOK_SECRET value. Returns the created run ID.
Settings 2 endpoints
GET /api/settings Returns the current platform configuration as JSON. Sensitive values (app secret, encryption key) are masked. Useful for automation systems that need to read current config state.
POST /settings Browser form submission that saves updated settings to .env. Accepts a multipart form body with all platform config fields. On success, redirects to /settings?saved=1.
Operator 7 endpoints
GET /api/operator/summary Returns row counts for every entity table (targets, runs, artifacts, profiles, posts, comments, diffs, embedding chunks, run events) plus artifact file count and storage path.
GET /api/operator/connection-status Evaluates the current Instagram Graph API credential state. Returns status (connected, ready_for_auth, needs_credentials), whether keys are present, auth state, and the latest run event.
POST /settings/actions/seed-demo Seeds a full demo dataset into the database using fixture payloads. Safe to run multiple times. Redirects to /settings?action=seeded.
POST /settings/actions/test-config Runs a live connection test against the configured Instagram Graph API credentials and records the result. Redirects to /settings?action=tested.
POST /settings/actions/restart-services No-op that redirects to /settings?action=restart-needed to display the restart command. Actual restart must be done from the host shell.
POST /settings/actions/wipe-data Deletes all platform data (runs, artifacts, profiles, posts, comments, diffs, embeddings, run events, and stored artifact files) while preserving .env. Requires the exact confirmation phrase in the form body. Redirects to /settings?action=wiped on success or ?action=wipe-error on phrase mismatch.
UI pages 4 pages
GET / Main dashboard. KPI row (profiles, posts, comments, runs, changes, artifacts), profile table, recent posts, recent runs, change feed, and recent comments.
GET /profiles/{profile_id} Profile drilldown page. Shows current snapshot, snapshot history, all posts with captions and engagement, per-post comments, top commenter leaderboard, hashtag cloud, and field-level change history.
GET /settings Platform settings page. Tabbed form for credentials and runtime config, live connection status, operator controls (seed/wipe), and collapsible help and data footprint panels.
GET /about This page. Platform overview, architecture diagram, file structure, full API reference, data flow, and design constraints.

Environment variables

Variable Purpose Default
DATABASE_URLPostgreSQL DSN for SQLAlchemy
ARTIFACT_ROOTLocal path where raw JSON artifacts are written/data/artifacts
META_APP_IDMeta developer app ID for Instagram Graph API
META_APP_SECRETMeta developer app secret
META_REDIRECT_URIOAuth callback URL registered in the Meta app
OPENCLAW_WEBHOOK_SECRETShared secret for the inbound automation hook
ENCRYPTION_KEYKey for future stored token/secret protection
EMBEDDING_MODELEmbedder to use for vector searchdev-hash-embedder
PUBLIC_CAPTURE_ENABLEDEnables Playwright public capture scaffoldfalse
LOG_LEVELUvicorn and app log verbosityINFO
APP_ENVDeployment environment labeldev

Version

VersionV1.1
ScopeRead-only — ingest, monitor, search, diff
API laneMeta Instagram Graph API (official)
Auth statusMeta OAuth flow not yet implemented — pending V1.2
Public captureScaffold present, disabled by default
Searchpgvector cosine similarity, dev-hash-embedder in dev mode