What quality assurance teams told us, and how we answer it.
What we heard
Response rates in freefall
Surveys that close with zero responses
Eight pages of comments, summarised by hand
Same courses score poorly for years
Privacy officers block every AI pilot
How Koji answers it
Conversations students actually finish
Open text that's safe to read
Triangulated, not a single number
Closes the feedback loop
Clears your privacy officer's review
The platform, in four pillars
EvaSys stops at data delivery. Koji owns the full quality cycle.
Conversations, not forms. AI-powered interviews with follow-up questions, in text or voice across 30+ languages, that students actually finish.
Triangulated quality. Course evaluations, panel discussions, student response groups, peer review, and pass rates in one picture.
Closes the feedback loop. "You said / we did" back to students, with action tracking for educational directors.
Compliance-first. AVG/GDPR, EU data residency, SURFconext SSO, and no training on student data.
How we built it, and why
Every capability exists because quality assurance officers, educational directors, and reform working groups told us it was needed.
Conversations, not questionnaires
An AI interviewer conducts each evaluation as an adaptive conversation with follow-up questions, in text or voice across 30+ languages. Students open up more in a conversational format than they ever do on a static form.
Online response rates have collapsed. One instructor received 7 responses from 150 students. Self-selection bias means only the highly motivated or very angry respond. In our pilot, conversations doubled response rates and lifted completion by 50%. The strongest lever, running the evaluation in the last 10 minutes of the final lecture, quadrupled responses, so the platform is built around exactly that in-class moment.
A short instrument, and open feedback that's finally safe to read
Five to seven core questions on a 5-point scale, with conversational depth where it matters. Open comments are moderated, summarised, and quality-checked before anyone sees them.
Quality assurance teams told us the most valuable feedback is qualitative, but raw text is overwhelming. Some courses generate eight pages of comments that TAs filter by hand using Copilot. A minority of comments are non-constructive or abusive. We filter inappropriate content, show a summary by default (raw on request), prompt students to be respectful, and make clear anonymity can be forfeited for abuse.
Quality scoring on every response
Each conversation is scored on relevance, depth, coverage, and completion. Only responses scoring 3 or higher are included in reports, so they reflect signal, not noise.
EvaSys requires a minimum of three responses to generate a report, so many surveys close with zero usable data, especially in small master's cohorts. Quality scoring means a handful of genuine conversations beats a pile of half-finished forms, and courses of five still produce actionable insights.
Triangulation, not a single number
Koji combines the course evaluation with panel discussions, student response group feedback, peer review, instructor self-reflection, learning-outcome data, and pass rates into a single course-quality picture.
Quantitative scores alone don't reveal much. A course can score a six or an eight, but quality assurance officers still can't tell whether the issue is the slides, the deadlines, or the course manual. The platform is built around triangulation, exactly the direction reform working groups are now mandating.
Governed access and contextual interpretation
Role-based views: instructors see full results; educational directors and programme directors get contextualised summaries with response rate, class size and trajectory; heads of department see development-oriented summaries, never raw scores; HR sees data only if the instructor shares it.
Faculty councils are explicitly decoupling evaluations from promotion and tenure, primarily because of bias concerns. Evaluations focus on course improvement, not faculty assessment. The product encodes who sees what, always with context, and states plainly that this is student experience, not teaching quality, and not an input to personnel decisions.
Close the feedback loop
A "You said / we did" slide shows each new cohort what changed because of last year’s feedback, and tracks whether educational directors and instructors acted on recommendations.
Students stop filling in evaluations because they never see the effects of their feedback. Quality assurance teams are stuck as data suppliers while the same courses score poorly for years with the same comments. Closing the loop is the one capability no incumbent offers, and the most durable way to sustain response rates over time.
Longitudinal, cross-course analytics
Query across years, courses, programmes and cohorts, for example "show all responses about English proficiency in this course over five years", and surface trends with alerts for persistently weak courses.
EvaSys can't cross-reference responses or query history, so quality assurance officers rebuild it by hand in Excel every quarter. Dashboards get finalised, then the data sits there. We made historical querying and trend tracking a first-class capability.
Built for accreditation and comparability
NVAO, AACSB and EQUIS-ready reporting with anchor items that preserve year-on-year comparability even as the instrument is reformed. Visible markers wherever the methodology changed, and continuous-improvement evidence for your six-year accreditation cycle and three-year midterm review.
Reforming a survey breaks historical baselines, a genuine accreditation risk. Quality assurance teams need to demonstrate that student voices are heard and that feedback drives improvement. The platform keeps the trend line intact and the change auditable.
Compliance-first, EU-sovereign
AVG/GDPR by design, EU data residency, SURFconext SSO, DPIA-ready documentation, automatic PII redaction, human oversight of AI outputs, and a hard commitment never to train models on student data.
Institutions want to use AI, but privacy officers create legal barriers. Dutch universities are cautious about data going to American companies, even when it's not personally identifiable. Everything is built to clear a privacy officer's review and a SURF DPIA, not to contest one.
How universities use Koji for Education
- Full data migration from EvaSys — historical evaluations, course structures, and baselines
- Actionable data from every course, including cohorts of five where EvaSys closes empty
- Anchor items preserve year-on-year comparability, even through instrument reform
Course evaluation reform
Roll out a reformed, improvement-focused evaluation instrument across a faculty or institution, aligned with your working group's recommendations.
Qualitative feedback at scale
Turn open-ended responses into structured, traceable insights with verbatim quotes, instead of having TAs summarise eight pages of comments by hand.
Accreditation and programme review
Generate continuous-improvement evidence for NVAO, AACSB and EQUIS, including your six-year accreditation cycle and three-year midterm review.
Quality assurance workflow
Automate the evaluation cycle from distribution through reporting: fact sheets, year reports, and one-page course summaries, ready for educational directors.
Close the feedback loop
Show each new cohort what changed because of last year's feedback. Track whether recommendations were acted on, across every course and programme.
Cross-faculty benchmarking
A consistent quality picture across every programme and faculty, with cross-referencing and trend analysis shareable at inter-faculty QA meetings.
Why teams switch
Incumbents stop at data delivery. Koji owns the whole Plan-Do-Check-Act quality cycle.
26 research biases.
Addressed by design.
Decades of SET research document systemic biases in course evaluations. Static surveys cannot fix them. Conversational AI, quality scoring, and triangulation can.
Collection biases
Who responds and who doesn'tNon-response bias
Students who skip evaluations differ systematically from those who respond, skewing results.
Self-selection bias
Only students with strong opinions (positive or negative) volunteer, producing a bimodal distribution that misrepresents the majority.
Survivorship bias
Students who dropped the course never evaluate it, inflating perceived quality.
Small sample bias
A single outlier in a 10-person seminar moves the mean dramatically, yet results carry the same weight as a 200-person lecture.
Survey fatigue
Students evaluating multiple courses give progressively less thoughtful, more uniform responses.
Mode effects
Online vs. paper, in-class vs. at-home produce different response patterns and rates.
Response biases
How answers get distortedSocial desirability bias
Students give answers they think are socially appropriate rather than honest, especially about their own effort.
Acquiescence bias
A tendency to agree with statements regardless of content. 'Strongly agree' becomes the default on Likert scales.
Halo effect
A positive impression of the instructor bleeds into ratings of unrelated dimensions. A charismatic lecturer gets high marks for syllabus clarity.
Recency bias
Events near the end of the semester — a difficult final, a great last lecture — disproportionately influence the evaluation.
Central tendency bias
Respondents avoid extreme ends of rating scales, clustering around the midpoint and compressing variance.
Anchoring bias
Exposure to a peer's opinion or a prior year's published score shifts responses toward that reference point.
Confirmation bias
Students who formed an early impression selectively recall evidence that confirms it, ignoring contradictory experiences.
Attribution bias
Students attribute learning outcomes to the instructor rather than their own effort. Poor grades become 'bad teaching' rather than insufficient study.
Instrument biases
How the tool itself distorts resultsLeading question bias
Questions like 'How effectively did the instructor...' presuppose effectiveness and push responses upward.
Framing effects
Positively framed items ('well-prepared') produce different distributions than negatively framed items ('unprepared') measuring the same construct.
Order effects
Placing demographic or grade-expectation questions first primes identity- or outcome-related thinking that contaminates subsequent ratings.
Construct validity failure
Many SET instruments measure student satisfaction or entertainment value rather than teaching effectiveness or learning.
Demographic biases
Systematic distortions by instructor identityGender bias
Female instructors receive systematically lower SET scores than male instructors teaching identical content, even in randomised experiments.
Racial and ethnic bias
Instructors from underrepresented groups receive lower evaluations independent of teaching quality, compounding for women of colour.
Native language bias
Non-native speakers receive lower 'clarity' ratings that bleed into global effectiveness via the halo effect.
Expected grade bias
Students expecting higher grades give higher evaluations — one of the most robust findings in SET research, incentivising grade inflation.
Analysis biases
How results get misreadQuantification bias
Reducing rich qualitative feedback to numerical averages loses critical information. A 3.8 vs. 4.0 difference is treated as meaningful when it is noise.
Ecological fallacy
Drawing conclusions about individual instructors from aggregate departmental averages.
Reporting bias
Institutions selectively report or publicise favourable data. Departments suppress low-response-rate evaluations.
Normalisation failures
Comparing raw scores across departments, course levels, or class sizes without adjustment treats incomparable numbers as equivalent.