Transforming unstructured course catalog data from 500+ institutions into standardized, actionable information using Amazon Bedrock
The Challenge
DegreeData, a Vermont-based education data company, provides standardized course and program information to universities, transfer evaluation services, and education technology platforms. Their core business depends on maintaining accurate, up-to-date course catalog data from hundreds of academic institutions.
The challenges were significant:
- Inconsistent formats – Each institution publishes course catalogs in different formats—PDFs, web pages, spreadsheets—with no standardization.
- Massive volume – Processing catalogs from 500+ institutions, each containing thousands of courses, required enormous manual effort.
- Time-intensive processing – Each catalog took 8-12 hours of manual work to parse, validate, and standardize.
- Data quality issues – Manual processing introduced errors and inconsistencies that affected downstream customers.
- Seasonal bottlenecks – Academic calendar cycles created predictable but overwhelming workload spikes.
DegreeData needed a solution that could handle the variety and volume of academic data while maintaining the accuracy their customers depend on.
The Solution
Horus Technologies designed an AI-driven data processing pipeline using Amazon Bedrock to automate the transformation of unstructured course catalog data into DegreeData's standardized schema.
Architecture Overview
The solution combines generative AI with structured data validation:
- Amazon S3 – Ingestion point for source catalogs in various formats (PDF, HTML, Excel).
- Amazon Bedrock – Foundation models parse unstructured content and extract course information intelligently.
- AWS Lambda – Serverless functions handle validation logic and data transformation.
- Amazon RDS – PostgreSQL database stores standardized course data in DegreeData's schema.
- AWS Step Functions – Orchestrates the multi-stage processing pipeline with error handling.
How It Works
- Source catalogs are uploaded to S3, triggering the processing pipeline.
- Amazon Bedrock analyzes each document, understanding structure regardless of format.
- The AI extracts key fields: course codes, titles, descriptions, credit hours, prerequisites, and learning outcomes.
- Lambda functions validate extracted data against business rules and flag anomalies.
- Validated data is transformed into DegreeData's standard schema and written to the database.
- Quality reports are generated for human review of edge cases.
The breakthrough was using Bedrock's ability to understand academic content in context. The AI recognizes that "CHEM 101" is a course code, "Introduction to Chemistry" is a title, and "3 credits" indicates credit hours—regardless of how each institution formats this information.
Results and Impact
The implementation delivered exceptional results, exceeding DegreeData's initial projections:
Processing Time Reduction
8-12 hours → 1.5-2 hours
Annual Labor Savings
Reduced manual processing
Data Accuracy
Improved from baseline
ROI in 14 Months
Return on investment
Scale Achieved
- 500+ institutions processed simultaneously during peak academic periods
- Parallel processing – Multiple catalogs processed concurrently without bottlenecks
- Faster updates – Course data refreshed more frequently, improving customer value
- Reduced errors – Automated validation catches inconsistencies human reviewers missed
Technology Deep Dive
Why Generative AI for Academic Data?
Traditional approaches to this problem—rule-based parsers, template matching, or basic OCR—failed because:
- No two catalogs are alike – Each institution has unique formatting, terminology, and structure.
- Context matters – Understanding that "Prerequisites: MATH 101 or equivalent" requires comprehension, not just text matching.
- Edge cases are common – Cross-listed courses, variable credits, and complex prerequisite trees require intelligent interpretation.
Amazon Bedrock's foundation models excel at this type of unstructured-to-structured transformation because they understand academic content semantically, not just syntactically.
Prompt Engineering for Education Data
Horus Technologies developed specialized prompts that guide the AI to extract education-specific information accurately:
- Course identification patterns across different numbering systems
- Credit hour variations (semester hours, quarter hours, contact hours)
- Prerequisite parsing with logical operators (AND, OR, concurrent enrollment)
- Program and degree mapping to standard taxonomies
Business Transformation
Beyond the quantitative metrics, the project transformed how DegreeData operates:
- From reactive to proactive – Team can now focus on expanding institutional coverage rather than processing backlogs.
- Improved customer relationships – Faster data updates mean customers always have current information.
- New product opportunities – The speed improvement enabled new real-time data products that weren't previously feasible.
- Competitive advantage – DegreeData can now onboard new institutions faster than competitors.
Lessons Learned
This project provided valuable insights for applying generative AI to data processing challenges:
- Domain expertise matters – Understanding education data structures was essential for effective prompt engineering.
- Validation is critical – AI extraction requires robust validation layers to catch errors before they reach production.
- Human oversight enhances quality – The system flags low-confidence extractions for human review, combining AI speed with human judgment.
- Iterative improvement – Processing accuracy improved over time as edge cases were identified and prompts refined.
Is This Right for Your Organization?
If your organization processes large volumes of semi-structured or unstructured data from multiple sources, generative AI can likely deliver similar efficiency gains. Good candidates include:
- Data aggregation businesses that consolidate information from many sources
- Organizations with document processing bottlenecks
- Companies where manual data entry is a significant cost center
- Businesses needing to scale processing capacity without proportional staffing increases
Horus Technologies specializes in building intelligent data processing solutions on AWS. Our team includes former AWS engineers who understand how to architect systems that scale with your business needs.