Koji for Education | Course evaluation and quality management software

What quality assurance teams told us, and how we answer it.

What we heard

Response rates in freefall

7 out of 150 studentsself-selection bias

Surveys that close with zero responses

3-response minimumsmall master's cohorts

Eight pages of comments, summarised by hand

TAs filtering in Copilotnon-constructive comments

Same courses score poorly for years

QA supplies datano system for follow-through

Privacy officers block every AI pilot

AVG, not just GDPRdata sovereignty

How Koji answers it

Conversations students actually finish

2x response ratelast-lecture timing

Open text that's safe to read

auto-moderationquality-scored

Triangulated, not a single number

panels + response groups + pass rates

Closes the feedback loop

you said / we didaction tracking

Clears your privacy officer's review

AVG · EU hosting · SURFconext

The platform, in four pillars

EvaSys stops at data delivery. Koji owns the full quality cycle.

Conversations, not forms. AI-powered interviews with follow-up questions, in text or voice across 30+ languages, that students actually finish.

Triangulated quality. Course evaluations, panel discussions, student response groups, peer review, and pass rates in one picture.

Closes the feedback loop. "You said / we did" back to students, with action tracking for educational directors.

Compliance-first. AVG/GDPR, EU data residency, SURFconext SSO, and no training on student data.

How we built it, and why

Every capability exists because quality assurance officers, educational directors, and reform working groups told us it was needed.

Conversations, not questionnaires

An AI interviewer conducts each evaluation as an adaptive conversation with follow-up questions, in text or voice across 30+ languages. Students open up more in a conversational format than they ever do on a static form.

Why we built it this way

Online response rates have collapsed. One instructor received 7 responses from 150 students. Self-selection bias means only the highly motivated or very angry respond. In our pilot, conversations doubled response rates and lifted completion by 50%. The strongest lever, running the evaluation in the last 10 minutes of the final lecture, quadrupled responses, so the platform is built around exactly that in-class moment.

A short instrument, and open feedback that's finally safe to read

Five to seven core questions on a 5-point scale, with conversational depth where it matters. Open comments are moderated, summarised, and quality-checked before anyone sees them.

Why we built it this way

Quality assurance teams told us the most valuable feedback is qualitative, but raw text is overwhelming. Some courses generate eight pages of comments that TAs filter by hand using Copilot. A minority of comments are non-constructive or abusive. We filter inappropriate content, show a summary by default (raw on request), prompt students to be respectful, and make clear anonymity can be forfeited for abuse.

Quality scoring on every response

Each conversation is scored on relevance, depth, coverage, and completion. Only responses scoring 3 or higher are included in reports, so they reflect signal, not noise.

Why we built it this way

EvaSys requires a minimum of three responses to generate a report, so many surveys close with zero usable data, especially in small master's cohorts. Quality scoring means a handful of genuine conversations beats a pile of half-finished forms, and courses of five still produce actionable insights.

Triangulation, not a single number

Koji combines the course evaluation with panel discussions, student response group feedback, peer review, instructor self-reflection, learning-outcome data, and pass rates into a single course-quality picture.

Why we built it this way

Quantitative scores alone don't reveal much. A course can score a six or an eight, but quality assurance officers still can't tell whether the issue is the slides, the deadlines, or the course manual. The platform is built around triangulation, exactly the direction reform working groups are now mandating.

Governed access and contextual interpretation

Role-based views: instructors see full results; educational directors and programme directors get contextualised summaries with response rate, class size and trajectory; heads of department see development-oriented summaries, never raw scores; HR sees data only if the instructor shares it.

Why we built it this way

Faculty councils are explicitly decoupling evaluations from promotion and tenure, primarily because of bias concerns. Evaluations focus on course improvement, not faculty assessment. The product encodes who sees what, always with context, and states plainly that this is student experience, not teaching quality, and not an input to personnel decisions.

Close the feedback loop

A "You said / we did" slide shows each new cohort what changed because of last year’s feedback, and tracks whether educational directors and instructors acted on recommendations.

Why we built it this way

Students stop filling in evaluations because they never see the effects of their feedback. Quality assurance teams are stuck as data suppliers while the same courses score poorly for years with the same comments. Closing the loop is the one capability no incumbent offers, and the most durable way to sustain response rates over time.

Longitudinal, cross-course analytics

Query across years, courses, programmes and cohorts, for example "show all responses about English proficiency in this course over five years", and surface trends with alerts for persistently weak courses.

Why we built it this way

EvaSys can't cross-reference responses or query history, so quality assurance officers rebuild it by hand in Excel every quarter. Dashboards get finalised, then the data sits there. We made historical querying and trend tracking a first-class capability.

Built for accreditation and comparability

NVAO, AACSB and EQUIS-ready reporting with anchor items that preserve year-on-year comparability even as the instrument is reformed. Visible markers wherever the methodology changed, and continuous-improvement evidence for your six-year accreditation cycle and three-year midterm review.

Why we built it this way

Reforming a survey breaks historical baselines, a genuine accreditation risk. Quality assurance teams need to demonstrate that student voices are heard and that feedback drives improvement. The platform keeps the trend line intact and the change auditable.

Compliance-first, EU-sovereign

AVG/GDPR by design, EU data residency, SURFconext SSO, DPIA-ready documentation, automatic PII redaction, human oversight of AI outputs, and a hard commitment never to train models on student data.

Why we built it this way

Institutions want to use AI, but privacy officers create legal barriers. Dutch universities are cautious about data going to American companies, even when it's not personally identifiable. Everything is built to clear a privacy officer's review and a SURF DPIA, not to contest one.

How universities use Koji for Education

Full data migration from EvaSys — historical evaluations, course structures, and baselines
Actionable data from every course, including cohorts of five where EvaSys closes empty
Anchor items preserve year-on-year comparability, even through instrument reform

Course evaluation reform

Roll out a reformed, improvement-focused evaluation instrument across a faculty or institution, aligned with your working group's recommendations.

Qualitative feedback at scale

Turn open-ended responses into structured, traceable insights with verbatim quotes, instead of having TAs summarise eight pages of comments by hand.

Accreditation and programme review

Generate continuous-improvement evidence for NVAO, AACSB and EQUIS, including your six-year accreditation cycle and three-year midterm review.

Quality assurance workflow

Automate the evaluation cycle from distribution through reporting: fact sheets, year reports, and one-page course summaries, ready for educational directors.

Close the feedback loop

Show each new cohort what changed because of last year's feedback. Track whether recommendations were acted on, across every course and programme.

Cross-faculty benchmarking

A consistent quality picture across every programme and faculty, with cross-referencing and trend analysis shareable at inter-faculty QA meetings.

Response rates

vs. static surveys (piloted)

+50%

Completion

students finish the conversation

30+

Languages

text & voice

5–8 min

Per student

adaptive interview

Why teams switch

Incumbents stop at data delivery. Koji owns the whole Plan-Do-Check-Act quality cycle.

Dimension

EvaSys / Qualtrics

Koji

Data collection

Static questionnaires

Conversational AI with follow-up questions

Response rates

25–30%

2x (piloted)

Open-ended analysis

TAs summarise by hand

Auto coding and theming, with verbatim quotes

Close the feedback loop

Not provided

"You said / we did" + action tracking

Triangulation

Not provided

Panels, response groups, pass rates built in

Quality scoring

Minimum response threshold

Relevance · depth · coverage · completion

Cross-reference and query history

Not supported

Query across years, courses, cohorts

Voice support

30+ languages

EU data sovereignty

German / US hosting

EU residency, AVG-compliant

Pricing

Opaque site licences

Per-conversation

26 research biases.
Addressed by design.

Decades of SET research document systemic biases in course evaluations. Static surveys cannot fix them. Conversational AI, quality scoring, and triangulation can.

Collection biases

Who responds and who doesn't

Non-response bias

Students who skip evaluations differ systematically from those who respond, skewing results.

Conversational format doubles response rates. In-class timing during the final lecture captures the full cohort, not just the motivated few.

Self-selection bias

Only students with strong opinions (positive or negative) volunteer, producing a bimodal distribution that misrepresents the majority.

Higher participation means less self-selection. Quality scoring filters extreme outliers so a handful of angry or enthusiastic responses cannot dominate.

Survivorship bias

Students who dropped the course never evaluate it, inflating perceived quality.

Evaluations can be timed before drop deadlines or run at multiple points. The platform flags low-enrolment-to-response ratios so QA teams see the gap.

Small sample bias

A single outlier in a 10-person seminar moves the mean dramatically, yet results carry the same weight as a 200-person lecture.

No minimum response threshold. Quality scoring means five genuine conversations produce actionable insights where EvaSys would close the evaluation empty.

Survey fatigue

Students evaluating multiple courses give progressively less thoughtful, more uniform responses.

Five-to-eight-minute adaptive conversations instead of repetitive Likert grids. Each course feels like a fresh conversation, not the sixth identical form.

Mode effects

Online vs. paper, in-class vs. at-home produce different response patterns and rates.

One consistent format — conversational AI in text or voice — administered in-class during the final lecture. Removes the mode-switching noise.

Response biases

How answers get distorted

Social desirability bias

Students give answers they think are socially appropriate rather than honest, especially about their own effort.

Students are more candid with an AI interviewer than with a human or a form they suspect the professor will read. The AI is not a person to impress.

Acquiescence bias

A tendency to agree with statements regardless of content. 'Strongly agree' becomes the default on Likert scales.

Open-ended conversation replaces agree/disagree scales. Follow-up questions probe for specifics — there is no agree button to click.

Halo effect

A positive impression of the instructor bleeds into ratings of unrelated dimensions. A charismatic lecturer gets high marks for syllabus clarity.

The AI asks about specific dimensions separately with targeted follow-ups. It is harder to give blanket ratings when each topic is explored individually.

Recency bias

Events near the end of the semester — a difficult final, a great last lecture — disproportionately influence the evaluation.

The AI prompts for reflections across the full course timeline, not just recent events. Follow-up questions surface experiences from early and mid-semester.

Central tendency bias

Respondents avoid extreme ends of rating scales, clustering around the midpoint and compressing variance.

No Likert scales to cluster on. Quality scores are derived from conversation substance — depth, relevance, coverage — not from a numerical scale the student selects.

Anchoring bias

Exposure to a peer's opinion or a prior year's published score shifts responses toward that reference point.

Each conversation starts fresh with no prior ratings, benchmarks, or peer responses visible. The student responds to the AI's questions, not to a pre-anchored scale.

Confirmation bias

Students who formed an early impression selectively recall evidence that confirms it, ignoring contradictory experiences.

Follow-up questions probe for nuance and counterexamples. The AI asks about specific course elements rather than letting a single impression colour everything.

Attribution bias

Students attribute learning outcomes to the instructor rather than their own effort. Poor grades become 'bad teaching' rather than insufficient study.

The AI can probe what the student did differently, not just what the instructor did. Quality scoring weights substantive observations over attributional claims.

Instrument biases

How the tool itself distorts results

Leading question bias

Questions like 'How effectively did the instructor...' presuppose effectiveness and push responses upward.

The AI generates neutral, adaptive questions from the evaluation brief. No embedded assumptions. Follow-ups are driven by the student's own words.

Framing effects

Positively framed items ('well-prepared') produce different distributions than negatively framed items ('unprepared') measuring the same construct.

Conversational format avoids fixed positive/negative framing. The AI adapts question phrasing based on the conversation flow, not a static template.

Order effects

Placing demographic or grade-expectation questions first primes identity- or outcome-related thinking that contaminates subsequent ratings.

Adaptive conversation flow with no fixed question sequence. The AI follows the student's narrative rather than a predetermined order.

Construct validity failure

Many SET instruments measure student satisfaction or entertainment value rather than teaching effectiveness or learning.

The evaluation brief defines what to measure. Quality scores assess depth and relevance of feedback, not satisfaction ratings. Triangulation with learning outcomes separates teaching quality from likeability.

Demographic biases

Systematic distortions by instructor identity

Gender bias

Female instructors receive systematically lower SET scores than male instructors teaching identical content, even in randomised experiments.

AI analyses feedback content, not the instructor's identity. Standardised analysis pipeline treats all courses equally. Reports surface what students said, not how they rated a person.

Racial and ethnic bias

Instructors from underrepresented groups receive lower evaluations independent of teaching quality, compounding for women of colour.

Same mechanism — the AI evaluates the substance of student feedback, not the instructor's demographics. No Likert scores that encode implicit bias.

Native language bias

Non-native speakers receive lower 'clarity' ratings that bleed into global effectiveness via the halo effect.

Students respond in their own language (30+). The AI interviews in the student's preferred language, removing accent perception from the equation entirely.

Expected grade bias

Students expecting higher grades give higher evaluations — one of the most robust findings in SET research, incentivising grade inflation.

Quality scoring weights the substance of what students say, not their sentiment. Triangulation with actual pass rates and learning outcomes separates grade satisfaction from teaching quality.

Analysis biases

How results get misread

Quantification bias

Reducing rich qualitative feedback to numerical averages loses critical information. A 3.8 vs. 4.0 difference is treated as meaningful when it is noise.

Reports present qualitative themes with traceable verbatim quotes alongside quality scores. No score-only summaries. Decision-makers see what students actually said.

Ecological fallacy

Drawing conclusions about individual instructors from aggregate departmental averages.

Per-course, per-dimension reporting with explicit drill-down. No false aggregation across incomparable contexts.

Reporting bias

Institutions selectively report or publicise favourable data. Departments suppress low-response-rate evaluations.

Automated, standardised reports generated for every course. No selective reporting — the system produces the same output structure regardless of results.

Normalisation failures

Comparing raw scores across departments, course levels, or class sizes without adjustment treats incomparable numbers as equivalent.

Quality scores are absolute (depth, relevance, coverage), not relative. Cross-department comparisons are explicit about context, and discipline-level benchmarking accounts for structural differences.

Questions decision-
makers ask

The objections we hear most, answered directly.

No. EvaSys and Qualtrics stop at data delivery. Koji owns the full quality cycle: conversational collection, quality-scored qualitative analysis, triangulation with panel discussions, student response groups, and pass rates, governed role-based access, and a closed feedback loop back to students. The evaluation is the starting point, not the product.

Koji is a full replacement for EvaSys, not an add-on. We handle the complete data migration — historical evaluations, course structures, and reporting baselines — so you keep year-on-year comparability from day one. Most institutions start with a single-faculty pilot running alongside the existing contract, then roll out institution-wide once the pilot confirms results. The transition is managed, not disruptive.

We built Koji to pass a privacy officer's review, not to contest one. All data stays in the EU (AWS Frankfurt), with full AVG/GDPR compliance, SURFconext SSO, and DPIA-ready documentation. LLM inference runs through your university's own enterprise AI account, so student data never passes through a third-party provider Koji controls. We do not train models on student data, and we publish our sub-processor register, DPA, and AI governance framework publicly.

PII is automatically redacted during processing. Names, student numbers, and identifying details are stripped before data reaches reports. At the National Student Survey scale, manual PII cleaning is impossible. Koji handles this automatically, with a documented redaction pipeline that your data protection officer can audit.

Every conversation is quality-scored on relevance, depth, coverage, and completion. Only responses scoring 3 or higher are included in reports, so low-effort or off-topic responses are filtered out. Open comments are moderated and summarised before anyone sees them, with a respect prompt at the start. Students are also more honest with an AI interviewer than when speaking directly to professors, reducing social desirability bias.

No. It makes them more effective. Panel discussions and student response group feedback become first-class data sources that triangulate the evaluation. The platform can surface themes from the evaluation to inform panel agendas, and where low response rates make survey scores unreliable, panel input fills the gap.

Setting up an evaluation takes 10 to 20 minutes. Upload your course syllabus, and Koji's agent helps you create a professional evaluation brief, even if you're not a trained researcher. Distribution, collection, moderation, analysis, and reporting are fully automated. Quality assurance teams have told us the platform replaces hours of manual comment summarising, Excel wrangling, and fact-sheet preparation per course, per quarter.

Per conversation. You pay only for quality-scored conversations that meet the inclusion threshold, not for seat licences or campus-wide contracts. Text conversations and voice conversations are priced separately. There are no setup fees, and pricing scales linearly, so a five-student master's course costs proportionally less than a three-hundred-student bachelor's course.

Yes. The platform supports 30+ languages for both text and voice conversations, including Dutch, English, German, French, and Spanish. Students can respond in their preferred language, and reports are generated in the language of your choice. The AI interviewer adapts to the student's language automatically.

No. Anchor items preserve year-on-year comparability even as the instrument changes. The platform marks methodological breaks visibly so trends stay honest, and generates continuous-improvement evidence your quality assurance team needs for NVAO accreditation and midterm reviews.

From student feedback to continuous improvement

You set the scope. Our agent co-designs the evaluation brief.

Koji interviews students in real conversations. With follow-up questions a form never asks.

Actionable recommendations with verbatim quotes, surfaced automatically.

2x your response rates while hearing what questionnaires miss.

What quality assurance teams told us, and how we answer it.

What we heard

Response rates in freefall

Surveys that close with zero responses

Eight pages of comments, summarised by hand

Same courses score poorly for years

Privacy officers block every AI pilot

How Koji answers it

Conversations students actually finish

Open text that's safe to read

Triangulated, not a single number

Closes the feedback loop

Clears your privacy officer's review

The platform, in four pillars

How we built it, and why

Conversations, not questionnaires

A short instrument, and open feedback that's finally safe to read

Quality scoring on every response

Triangulation, not a single number

Governed access and contextual interpretation

Close the feedback loop

Longitudinal, cross-course analytics

Built for accreditation and comparability

Compliance-first, EU-sovereign

How universities use Koji for Education

Course evaluation reform

Qualitative feedback at scale

Accreditation and programme review

Quality assurance workflow

Close the feedback loop

Cross-faculty benchmarking

Why teams switch

26 research biases. Addressed by design.

Collection biases

Non-response bias

Self-selection bias

Survivorship bias

Small sample bias

Survey fatigue

Mode effects

Response biases

Social desirability bias

Acquiescence bias

Halo effect

Recency bias

Central tendency bias

Anchoring bias

Confirmation bias

Attribution bias

Instrument biases

Leading question bias

Framing effects

Order effects

Construct validity failure

Demographic biases

Gender bias

Racial and ethnic bias

Native language bias

Expected grade bias

Analysis biases

Quantification bias

Ecological fallacy

Reporting bias

Normalisation failures

Questions decision-makers ask

Isn't this just another survey tool?

We already use EvaSys. What does switching look like?

Our privacy officer will never approve an AI tool for student data.

Students write personal information in responses. How is PII handled?

Won't students game it or write abuse?

Does this replace our panel discussions and student response groups?

We don't have capacity for another complex tool.

How is Koji for Education priced?

Does it work in Dutch and other languages?

We've reformed our evaluation instrument. Do we lose year-on-year comparison?

26 research biases.
Addressed by design.

Questions decision-
makers ask