Claude Opus 4.7 for Data Catalog Auto-Tagging
Learn how Claude Opus 4.7 auto-classifies tables, columns, and PII at scale. Technical deep-dive on implementing AI-driven data governance.
Understanding Data Catalog Auto-Tagging and Why It Matters
Data governance at scale is a grinding problem. You’ve got hundreds of tables, thousands of columns, and no clear inventory of what’s sensitive, what’s redundant, and what actually powers your analytics. Manual tagging is slow. Regex-based classification is brittle. And every time your schema evolves, you’re back to square one.
That’s where Claude Opus 4.7 comes in. With its expanded context window, improved instruction following, and native support for complex reasoning tasks, Claude Opus 4.7 can systematically classify your entire data catalog—tables, columns, PII markers, data lineage hints—without the operational overhead of purpose-built data governance platforms.
This isn’t a marketing claim. It’s a practical engineering problem that D23’s managed Apache Superset platform and teams building embedded analytics need to solve. When you’re embedding dashboards and self-serve BI into products, you need fast, reliable metadata. When you’re managing data across multiple teams, you need consistent tagging. When you’re handling customer data, you need PII detection that actually works.
This article walks through how to implement Claude Opus 4.7-powered auto-tagging in your data stack, from schema parsing to batch processing to integration with your analytics platform.
What Makes Claude Opus 4.7 Suited for Data Catalog Tasks
Claude Opus 4.7 is Anthropic’s latest flagship model, and it brings several capabilities specifically useful for data governance work. Understanding these capabilities is the foundation for building a robust tagging system.
Extended Context Window and Document Reasoning
The 1M token context window in Claude Opus 4.7 means you can feed it an entire data schema—hundreds of table and column definitions, sample values, documentation strings—in a single request. This matters because context is signal. When Claude sees a column named user_phone alongside a column named user_email and another named customer_ssn, it can infer that all three are personally identifiable information. A column called zip_code in isolation might be ambiguous; in context with a full customer table schema, it’s clearly PII.
According to Anthropic’s official documentation on Claude Opus 4.7, the model excels at document reasoning and structured analysis—exactly what you need for parsing database schemas, understanding relationships, and making consistent classification decisions across large datasets.
Improved Instruction Following and Structured Output
Data tagging requires consistency. You need Claude to apply the same taxonomy, the same confidence thresholds, the same rules every time. Claude Opus 4.7 has significantly improved instruction-following capabilities. You can define a detailed schema for tags, specify exactly how to handle edge cases, and expect reliable adherence to that specification across thousands of classification tasks.
This is critical for production systems. When you’re feeding tags back into your managed Apache Superset instance or your embedded analytics platform, you need predictable, parseable output. Claude Opus 4.7’s improved ability to follow complex, multi-step instructions means fewer parsing errors and less post-processing.
Vision and Multimodal Understanding
While the core task is text-based, Claude Opus 4.7’s enhanced vision capabilities matter in practice. If you’re working with data dictionaries stored as PDFs, images of schema diagrams, or screenshots of data lineage tools, Claude can ingest those directly. DataCamp’s analysis of Opus 4.7 highlights that multimodal improvements extend to better reasoning over complex visual layouts, which translates to understanding data architecture diagrams and legacy documentation.
Agentic Workflows and Iterative Refinement
Claude Opus 4.7 supports agentic patterns—the model can call tools, receive results, and refine its classification based on new information. In a data tagging context, this means Claude can look up additional metadata, check for naming conventions in your codebase, or validate classifications against known PII lists before returning final tags. Heroku’s guide on managed inference with Claude Opus 4.7 details how to structure these workflows for enterprise operations.
Building a Data Catalog Auto-Tagging System: Architecture and Design
Implementing Claude Opus 4.7-powered tagging requires thought about data flow, schema design, and integration points. Here’s how to structure it.
Step 1: Schema Extraction and Normalization
Your first task is to extract metadata from your data warehouse or data lake. This includes:
- Table names and descriptions (from comments, documentation, or metadata stores)
- Column names, data types, and descriptions
- Sample values (anonymized if necessary) from each column
- Frequency and recency information (when the table was last modified, how often it’s queried)
- Existing tags or classifications (if you have partial metadata already)
- Business context (which team owns the table, what product area it supports)
Normalization is crucial. If your schema comes from multiple sources—Postgres, Snowflake, BigQuery, S3 data lake—you need a common format. Build a schema extractor that produces consistent JSON or YAML output.
Example normalized schema for a single table:
{
"table_name": "customers",
"database": "analytics",
"owner": "data_team",
"description": "Core customer dimension table",
"columns": [
{
"name": "customer_id",
"data_type": "bigint",
"description": "Unique customer identifier",
"sample_values": ["12345", "12346", "12347"],
"nullability": "not null",
"is_primary_key": true
},
{
"name": "email",
"data_type": "varchar",
"description": "Customer email address",
"sample_values": ["john@example.com", "jane@example.com"],
"nullability": "nullable"
},
{
"name": "phone",
"data_type": "varchar",
"description": "Customer phone number",
"sample_values": ["555-1234", "555-5678"],
"nullability": "nullable"
},
{
"name": "ssn",
"data_type": "varchar",
"description": "Social security number (encrypted)",
"sample_values": null,
"nullability": "nullable"
}
]
}
Notice that for sensitive columns like SSN, we don’t include sample values. Claude Opus 4.7 can still classify based on the column name, description, and context—it doesn’t need to see actual PII to identify it.
Step 2: Defining Your Tagging Taxonomy
Before you send anything to Claude, define what tags you want applied. This is your classification schema. A typical taxonomy might include:
Data Classification Tags:
public— safe to expose in dashboards, reports, external toolsinternal— restricted to internal teams, shouldn’t be embedded in customer-facing analyticsconfidential— restricted to specific teams or rolessensitive— requires encryption, masking, or audit logging
PII and Privacy Tags:
pii_direct— directly identifies an individual (name, email, phone, SSN)pii_quasi— could identify an individual in combination with other data (zip code, birth date, employer)pii_sensitive— financial, health, or government ID informationpii_none— confirmed non-PII
Data Quality and Lineage Tags:
source_raw— unprocessed data from operational systemssource_processed— cleaned, deduplicated, or enrichedderived— calculated or aggregated from other tablesdeprecated— scheduled for removal
Domain and Business Tags:
customer— relates to customer master datafinancial— revenue, pricing, transaction dataoperational— internal process and system datamarketing— campaign, engagement, behavioral data
Your taxonomy should be specific enough to be useful but not so granular that it becomes unmanageable. Start with 20-30 tags across these categories.
Step 3: Crafting the Claude Prompt
Your prompt to Claude Opus 4.7 is the core of the system. It needs to:
- Explain the tagging taxonomy clearly
- Provide examples of how tags should be applied
- Define confidence thresholds and edge cases
- Specify the output format exactly
- Include relevant context about your business
Here’s a template:
You are a data governance expert. Your task is to classify database tables and columns using a predefined taxonomy.
TAXONOMY:
Data Classification:
- public: Safe for external dashboards and tools
- internal: Restricted to internal teams
- confidential: Restricted to specific roles
- sensitive: Requires encryption or masking
PII Classification:
- pii_direct: Directly identifies a person (name, email, phone, SSN, passport, driver's license)
- pii_quasi: Could identify someone in combination with other data (zip code, birth date, employer, job title)
- pii_sensitive: Financial (bank account, credit card), health, or government IDs
- pii_none: Not personally identifiable
Data Quality:
- source_raw: Unprocessed operational data
- source_processed: Cleaned or deduplicated
- derived: Calculated or aggregated
- deprecated: Scheduled for removal
Business Domain:
- customer: Customer master data
- financial: Revenue, pricing, transactions
- operational: Internal process data
- marketing: Campaign and engagement data
RULES:
1. Every column must receive exactly one PII classification tag.
2. Every table must receive at least one data classification tag.
3. Only assign tags you are confident about (>70% confidence). If unsure, assign 'uncertain' and explain.
4. Consider column names, descriptions, data types, and sample values.
5. In a customer table, email + phone + SSN are clearly pii_direct. Zip code in the same table is pii_quasi.
6. Encrypted or hashed columns should be tagged based on what they contain, not their appearance.
7. Aggregate tables (sums, counts) are not PII, even if derived from PII.
EXAMPLES:
Table: users
Columns:
- user_id (bigint): Unique identifier → public, source_raw, customer
- email (varchar): User email → pii_direct, internal
- password_hash (varchar): Hashed password → pii_sensitive, internal
- signup_date (date): Account creation date → public, customer
- ip_address (varchar): Last login IP → pii_quasi, internal
Table: daily_revenue
Columns:
- date (date): Report date → public, financial
- total_revenue (decimal): Sum of all transactions → public, financial, derived
- transaction_count (integer): Number of transactions → public, financial, derived
Now classify the following table:
[TABLE SCHEMA HERE]
Respond in JSON format:
{
"table": "table_name",
"classifications": {
"table_tags": ["tag1", "tag2"],
"columns": [
{
"name": "column_name",
"pii_tag": "pii_direct|pii_quasi|pii_sensitive|pii_none",
"quality_tag": "source_raw|source_processed|derived|deprecated",
"business_tag": "customer|financial|operational|marketing",
"data_classification": "public|internal|confidential|sensitive",
"confidence": 0.95,
"reasoning": "Brief explanation"
}
]
}
}
This prompt is detailed but not overwhelming. It gives Claude Opus 4.7 clear rules, examples, and expected output format. The improvements in instruction following from Claude Opus 4.7 mean it will adhere to this structure reliably.
Step 4: Batch Processing and API Integration
For a production system, you’ll process your catalog in batches. Don’t send your entire schema in one request—that’s inefficient and error-prone. Instead:
- Group tables by domain or owner — Process all customer tables together, then financial tables, etc. This gives Claude relevant context without overwhelming the request.
- Use the Anthropic API with appropriate batching — The Anthropic API documentation supports batch processing for cost optimization.
- Cache repeated context — If you’re processing many tables from the same schema, use prompt caching to avoid re-sending the taxonomy and rules.
- Implement retry logic — Some classifications might need human review. Flag low-confidence results (< 70%) for manual validation.
- Store results in your metadata store — Write tags back to your data catalog system, whether that’s a dedicated metadata database, your data warehouse’s schema metadata, or a separate metadata service.
Here’s a Python pseudocode example:
import anthropic
import json
from typing import List, Dict
client = anthropic.Anthropic()
def classify_table_batch(tables: List[Dict], taxonomy: str) -> List[Dict]:
"""
Classify a batch of tables using Claude Opus 4.7.
"""
results = []
for table in tables:
schema_json = json.dumps(table, indent=2)
message = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
messages=[
{
"role": "user",
"content": f"{taxonomy}\n\nClassify this table:\n\n{schema_json}"
}
]
)
# Parse Claude's response
response_text = message.content[0].text
classification = json.loads(response_text)
results.append(classification)
return results
# Usage
taxonomy_prompt = """You are a data governance expert..."""
tables_to_classify = load_tables_from_warehouse()
classifications = classify_table_batch(tables_to_classify, taxonomy_prompt)
store_classifications(classifications)
The key is making this repeatable and auditable. Every classification should include reasoning, confidence scores, and a timestamp. If you need to update your taxonomy or retrain later, you’ll want to know which classifications came from which version of the system.
Handling PII Detection at Scale
PII detection is the highest-stakes classification task. A missed sensitive column could lead to compliance violations. A false positive on a non-sensitive column creates unnecessary restrictions. Claude Opus 4.7 handles this well, but you need to structure it carefully.
Multi-Layer PII Detection Strategy
Don’t rely on Claude alone. Use a layered approach:
Layer 1: Pattern Matching — Use regex or exact matching for obvious cases. Column names matching ^(ssn|social_security|credit_card|cvv|password|api_key) are definitely PII. This is fast and deterministic.
Layer 2: Claude Classification — For columns that don’t match obvious patterns, send them to Claude Opus 4.7. This catches semantic PII—columns that don’t have obvious names but contain sensitive data based on description or context.
Layer 3: Human Review — Flag low-confidence classifications and columns that Claude marks as ‘uncertain’. Have a data steward review these manually. This is expensive but necessary for compliance.
Layer 4: Continuous Monitoring — Once tags are applied, monitor query access patterns. If a column tagged as non-PII is frequently accessed by non-data teams or exported to external systems, flag it for review.
Example: Detecting Quasi-Identifiers
Quasi-identifiers are particularly tricky. A zip code alone isn’t PII. But a zip code + birth date + gender can re-identify someone. Claude Opus 4.7’s context awareness helps here:
Table: customer_demographics
Columns:
- customer_id (bigint)
- birth_date (date)
- zip_code (varchar)
- gender (varchar)
- state (varchar)
Claude's reasoning:
"This table contains multiple quasi-identifiers. Individually, zip_code and gender are low-risk. But together with birth_date, they form a quasi-identifier set. This combination can re-identify individuals in many cases. Tag this table as containing pii_quasi data, and recommend that birth_date + zip_code + gender not be exposed together in customer-facing analytics."
This is exactly the kind of reasoning that Claude Opus 4.7’s improved instruction following and document reasoning enables. It’s not just pattern matching—it’s understanding relationships and context.
Handling Encrypted and Hashed Data
A column containing $2b$12$... (bcrypt hash) or AES-256-encrypted values should be tagged based on what the column contains, not what it looks like. Claude Opus 4.7 can infer this from the column description and context:
Column: password_hash
Description: "Bcrypt-hashed password"
Data type: varchar
Sample values: ["$2b$12$...", "$2b$12$..."]
Claude's classification:
- pii_tag: pii_sensitive (because it's a hashed password)
- data_classification: sensitive (requires encryption at rest and audit logging)
The key is including enough context in your schema extraction. Include column descriptions that explicitly state encryption methods, hashing algorithms, and data sensitivity.
Integration with Analytics Platforms and Data Catalogs
Once you’ve classified your catalog, you need to act on those classifications. This is where the tags become operationally useful.
Integration with D23 and Embedded Analytics
If you’re using D23’s managed Apache Superset platform for embedded analytics or self-serve BI, you can use tags to:
- Control dashboard visibility — Only show non-sensitive columns in customer-facing dashboards
- Enforce column-level access control — Restrict sensitive columns to specific roles
- Auto-generate data dictionaries — Use tags and classifications to create documentation
- Power text-to-SQL filters — When users ask “show me customer revenue,” the text-to-SQL engine can exclude columns tagged as pii_direct or pii_sensitive
D23’s API-first architecture makes it straightforward to push classifications as metadata. You can write tags to Superset’s metadata layer via the API, and they’ll be enforced across all dashboards and queries.
Metadata Store Integration
Store classifications in a dedicated metadata service or data catalog:
- Apache Atlas — Open-source data governance and metadata management
- Collibra — Commercial metadata platform with governance workflows
- Alation — Data catalog with AI-powered classification
- Custom metadata database — PostgreSQL or similar with a simple schema for tags and classifications
Your metadata store should support:
- Versioning (track classification changes over time)
- Lineage (which tables depend on which other tables)
- Access control (who can see classifications, who can modify them)
- Audit logging (what changed, when, and by whom)
Automating Downstream Actions
Once tags are stored, automate enforcement:
def enforce_classification_rules(table_name: str, classifications: Dict):
"""
Apply classification rules to a table in your data warehouse.
"""
if 'pii_direct' in classifications['pii_tags']:
# Encrypt at rest
apply_table_encryption(table_name)
# Restrict access
restrict_access(table_name, allowed_roles=['data_team', 'compliance'])
# Enable audit logging
enable_audit_logging(table_name)
if 'sensitive' in classifications['data_classification']:
# Mask in non-production environments
apply_masking(table_name, environment='staging')
# Require approval for exports
require_export_approval(table_name)
if 'deprecated' in classifications['quality_tags']:
# Schedule for deletion
schedule_deletion(table_name, days_until_deletion=90)
# Notify owners
notify_table_owner(table_name, message="This table is deprecated and will be deleted in 90 days")
This automation is critical. Manual enforcement doesn’t scale. Once Claude Opus 4.7 has classified your catalog, you need systems that automatically apply those classifications.
Real-World Challenges and Solutions
Implementing this at scale isn’t frictionless. Here are common challenges and how to address them.
Challenge 1: Inconsistent or Missing Descriptions
Many data warehouses have tables and columns with minimal documentation. If your schema looks like:
Table: t_cust_012
Columns: id, col_1, col_2, col_3, col_4
Claude Opus 4.7 can’t infer much. Solution: Before running classification, invest in documentation. Run a quick script to extract sample values, infer data types, and identify potential PII based on patterns. Feed this enriched metadata to Claude.
Alternatively, use Claude to generate suggested descriptions based on sample values:
Column: col_2
Data type: varchar
Sample values: ["john.doe@company.com", "jane.smith@company.com", "bob.jones@company.com"]
Claude's suggestion: "This column contains email addresses. Description: 'Customer email address for contact and communication purposes.'"
Challenge 2: Business Logic That Violates Classification Rules
Sometimes your business has legitimate reasons to store PII in unusual places. Example: A table of test users for QA purposes that includes real-looking email addresses and phone numbers, but they’re fake.
Solution: Allow overrides and exceptions in your classification system. If a human reviews a classification and disagrees with Claude, they should be able to override it with a reason. Track these overrides and use them to refine your prompt over time.
Challenge 3: Schema Evolution and Drift
Your data warehouse isn’t static. Tables are added, columns are renamed, and old data is archived. Your classification system needs to handle this.
Solution:
- Run classification regularly — Monthly or quarterly, depending on change velocity
- Track classification versions — Store which version of your taxonomy was used for each classification
- Detect schema changes — Flag new columns and modified tables for re-classification
- Grandfather existing classifications — Don’t reclassify everything every time. Only reclassify changed columns.
Challenge 4: Cross-Database and Multi-Team Consistency
If you have multiple data warehouses (Snowflake, BigQuery, Redshift) or multiple teams managing data, you need consistent classifications across all of them.
Solution: Maintain a single source of truth for your taxonomy and classification rules. Use Claude Opus 4.7 consistently across all databases. Store all classifications in a central metadata service. Regularly audit for inconsistencies.
Advanced Patterns: Agentic Classification
Claude Opus 4.7 supports agentic workflows—the model can call tools, receive results, and refine its classification. This opens up more sophisticated patterns.
Tool-Augmented Classification
Instead of Claude just reading your schema, give it tools to:
- Look up table lineage — Query your data warehouse to find upstream dependencies
- Check access logs — See which roles/users access each table
- Validate against compliance frameworks — Check if a column matches GDPR, HIPAA, or CCPA requirements
- Cross-reference with known PII lists — Query a database of known sensitive values
Example:
def classify_with_tools(table_schema: Dict) -> Dict:
"""
Classify a table using Claude with tool access.
"""
tools = [
{
"name": "lookup_lineage",
"description": "Find upstream tables and transformations",
"input_schema": {"table_name": "str"}
},
{
"name": "check_access_logs",
"description": "See which roles access this table",
"input_schema": {"table_name": "str", "days": "int"}
},
{
"name": "validate_compliance",
"description": "Check against GDPR/HIPAA/CCPA requirements",
"input_schema": {"column_name": "str", "data_type": "str"}
}
]
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=4096,
tools=tools,
messages=[
{
"role": "user",
"content": f"Classify this table, using tools as needed: {json.dumps(table_schema)}"
}
]
)
# Process tool calls and refine classification
# ...
Caylent’s deep dive on Claude Opus 4.7’s agentic capabilities provides detailed guidance on implementing these patterns.
Iterative Refinement with Human Feedback
Use an agentic loop to refine classifications based on human feedback:
- Claude classifies a table
- A human reviews and disagrees with one classification
- Claude receives the feedback and re-classifies with the new context
- The refined classification is stored
- Over time, Claude learns patterns from your feedback
This requires careful prompt engineering to avoid Claude just agreeing with whatever feedback it receives. But when done right, it creates a self-improving system.
Cost and Performance Considerations
Running Claude Opus 4.7 at scale has cost and latency implications. Here’s how to optimize.
Cost Optimization
- Batch processing — Use Anthropic’s batch API for non-urgent classifications. Batch requests are 50% cheaper than real-time API calls.
- Prompt caching — If you’re classifying many tables with the same taxonomy and rules, cache the system prompt. This reduces costs for repeated requests.
- Confidence-based filtering — Only send low-confidence results to Claude for refinement. Use cheaper pattern matching for obvious cases.
- Right-sizing context — Don’t send your entire 1M token context window if you only need 10K tokens. Be precise about what you include.
Latency Optimization
- Parallel processing — Process multiple tables concurrently. Claude Opus 4.7 can handle concurrent requests.
- Batch grouping — Group related tables to reduce the number of API calls. Instead of 100 calls for 100 tables, make 10 calls for 10 batches of 10 tables.
- Caching — Use prompt caching to avoid re-processing the same taxonomy and rules.
Monitoring and Observability
Track:
- Classification latency — How long does each table take to classify?
- Cost per table — How much are you spending to classify your catalog?
- Accuracy metrics — How often do humans agree with Claude’s classifications?
- Confidence distribution — What percentage of classifications are high-confidence vs. uncertain?
Use these metrics to refine your approach over time.
Putting It All Together: A Complete Example
Let’s walk through a complete example from schema extraction to classification to enforcement.
Scenario
You’re a mid-market SaaS company with a Snowflake data warehouse. You have ~500 tables across 10 schemas. You want to classify all of them using Claude Opus 4.7, then enforce access control in your D23 embedded analytics platform and data warehouse.
Step 1: Extract Schema
import snowflake.connector
import json
def extract_snowflake_schema():
conn = snowflake.connector.connect(
user='data_engineer',
password='...',
account='xy12345.us-east-1',
warehouse='compute'
)
cursor = conn.cursor()
cursor.execute("SELECT * FROM information_schema.tables WHERE table_schema != 'INFORMATION_SCHEMA'")
tables = cursor.fetchall()
schemas = []
for table in tables:
schema_name = table[1]
table_name = table[2]
# Get columns
cursor.execute(f"""SELECT column_name, data_type, comment FROM information_schema.columns
WHERE table_schema = '{schema_name}' AND table_name = '{table_name}'"")
columns = cursor.fetchall()
# Get sample values
cursor.execute(f"SELECT * FROM {schema_name}.{table_name} LIMIT 5")
samples = cursor.fetchall()
schema_obj = {
"table_name": table_name,
"schema_name": schema_name,
"description": table[3] or "",
"columns": [
{
"name": col[0],
"data_type": col[1],
"description": col[2] or "",
"sample_values": [str(s[i]) for s in samples[:3]] if samples else []
}
for i, col in enumerate(columns)
]
}
schemas.append(schema_obj)
return schemas
Step 2: Classify with Claude
def classify_schemas(schemas: List[Dict]) -> List[Dict]:
taxonomy = """You are a data governance expert..."""
classifications = []
for i, schema in enumerate(schemas):
if i % 10 == 0:
print(f"Classifying table {i} of {len(schemas)}...")
schema_json = json.dumps(schema, indent=2)
message = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
messages=[
{
"role": "user",
"content": f"{taxonomy}\n\nClassify:\n{schema_json}"
}
]
)
classification = json.loads(message.content[0].text)
classification['table_id'] = f"{schema['schema_name']}.{schema['table_name']}"
classifications.append(classification)
return classifications
Step 3: Store in Metadata Database
def store_classifications(classifications: List[Dict]):
conn = psycopg2.connect("dbname=metadata user=metadata_admin")
cursor = conn.cursor()
for classification in classifications:
# Store table-level classifications
cursor.execute("""
INSERT INTO table_classifications (table_id, data_classification, quality_tag, business_tag, updated_at)
VALUES (%s, %s, %s, %s, NOW())
ON CONFLICT (table_id) DO UPDATE SET
data_classification = EXCLUDED.data_classification,
quality_tag = EXCLUDED.quality_tag,
business_tag = EXCLUDED.business_tag,
updated_at = NOW()
""", (
classification['table_id'],
classification['classifications']['table_tags'][0],
classification['classifications']['table_tags'][1] if len(classification['classifications']['table_tags']) > 1 else None,
classification['classifications']['table_tags'][2] if len(classification['classifications']['table_tags']) > 2 else None
))
# Store column-level classifications
for column in classification['classifications']['columns']:
cursor.execute("""
INSERT INTO column_classifications (table_id, column_name, pii_tag, data_classification, confidence, reasoning)
VALUES (%s, %s, %s, %s, %s, %s)
ON CONFLICT (table_id, column_name) DO UPDATE SET
pii_tag = EXCLUDED.pii_tag,
data_classification = EXCLUDED.data_classification,
confidence = EXCLUDED.confidence,
reasoning = EXCLUDED.reasoning
""", (
classification['table_id'],
column['name'],
column['pii_tag'],
column['data_classification'],
column['confidence'],
column['reasoning']
))
conn.commit()
cursor.close()
conn.close()
Step 4: Enforce in Data Warehouse
def enforce_classifications(classifications: List[Dict]):
conn = snowflake.connector.connect(...)
cursor = conn.cursor()
for classification in classifications:
table_id = classification['table_id']
schema_name, table_name = table_id.split('.')
# If table contains PII, restrict access
has_pii = any(
col['pii_tag'] != 'pii_none'
for col in classification['classifications']['columns']
)
if has_pii:
# Create a role for PII access
role_name = f"role_{schema_name}_{table_name}_pii"
cursor.execute(f"CREATE ROLE IF NOT EXISTS {role_name}")
cursor.execute(f"GRANT SELECT ON {schema_name}.{table_name} TO {role_name}")
cursor.execute(f"GRANT ROLE {role_name} TO ROLE data_team")
# Encrypt the table
cursor.execute(f"ALTER TABLE {schema_name}.{table_name} SET DATA_RETENTION_TIME_IN_DAYS = 90")
# If table is deprecated, schedule for deletion
if 'deprecated' in classification['classifications']['table_tags']:
cursor.execute(f"ALTER TABLE {schema_name}.{table_name} RENAME TO {table_name}_deprecated_{int(time.time())}")
conn.close()
Step 5: Sync to Analytics Platform
def sync_to_d23(classifications: List[Dict]):
"""
Push classifications to D23 for enforcement in embedded analytics.
"""
d23_api_key = os.getenv('D23_API_KEY')
headers = {'Authorization': f'Bearer {d23_api_key}'}
for classification in classifications:
table_id = classification['table_id']
# Update table metadata in D23
payload = {
"table_id": table_id,
"tags": classification['classifications']['table_tags'],
"column_permissions": [
{
"column_name": col['name'],
"visible_to_roles": get_roles_for_classification(col['data_classification']),
"pii_tag": col['pii_tag']
}
for col in classification['classifications']['columns']
]
}
response = requests.post(
'https://d23.io/api/v1/metadata/tables',
json=payload,
headers=headers
)
if response.status_code != 200:
print(f"Failed to sync {table_id}: {response.text}")
This end-to-end flow takes you from raw schema to enforced classifications across your data infrastructure.
Monitoring, Iteration, and Continuous Improvement
Your classification system isn’t a one-time project. It needs ongoing monitoring and refinement.
Accuracy Metrics
Track how often humans agree with Claude’s classifications:
- Precision — Of the columns Claude tagged as PII, how many actually were?
- Recall — Of all actual PII columns, how many did Claude catch?
- F1 score — Harmonic mean of precision and recall
Aim for >95% precision and >90% recall on PII detection. For other classifications, >80% is acceptable.
Feedback Loop
When humans override Claude’s classifications, use that feedback to refine your prompt:
- Collect override data (original classification, human override, reason)
- Periodically review override patterns
- Update your prompt or taxonomy to address common disagreements
- Re-classify affected tables with the new prompt
- Measure improvement
Regression Testing
When you update your prompt or taxonomy, test it on a representative sample of tables you’ve already classified. Make sure your changes don’t break existing, correct classifications.
Conclusion
Claude Opus 4.7 is a powerful tool for automating data catalog classification at scale. Its extended context window, improved instruction following, and agentic capabilities make it well-suited for PII detection, business classification, and data quality tagging.
Implementing this requires careful attention to schema extraction, taxonomy design, prompt engineering, and integration with your data infrastructure. But the payoff is significant: a complete, accurate, automatically-maintained data catalog that drives governance, security, and analytics decisions across your organization.
If you’re using D23 for embedded analytics, self-serve BI dashboards, or managing data across multiple teams, this approach gives you the metadata foundation you need to scale safely and efficiently. Combined with D23’s API-first architecture and AI-powered analytics capabilities, Claude-driven classification becomes a core component of your data governance strategy.
Start small—pick one schema or domain, classify it with Claude Opus 4.7, and validate the results. Then expand to your entire catalog. The system gets better with feedback, and the operational benefits compound quickly.