When the analysis layer outpaces the data infrastructure
The democratization of analysis since the arrival of AI-assisted tools is real and in many ways welcome. But as it lowers the barrier to institutional research, there also are structural consequences.
This article is the fourth and final part of a series on educational data infrastructure and AI-supported learning. The previous articles established which improvements to data infrastructure are required for universities to answer institutional research questions about process-oriented and AI-supported learning. We now build on the technical foundation by taking a governance perspective on how to scale this institutionally.
With years of experience as institutional researcher, I’m noticing a change in usage of institutional education data. The arrival of AI-assisted analysis tools (large language models that can write database queries from plain-language descriptions, generate scripts from data tables, or interpret datasets without formal analytical training) is lowering the barrier to institutional data analysis in ways that are both genuinely useful and structurally consequential. Take a policy advisor who previously depended on a central data team to answer a question about student retention patterns, who can now (with sufficient access and the right tools) attempt that analysis independently. Or a programme director curious about course-level dropout rates who no longer needs to file a request and wait three weeks. This democratization of analysis is in many ways welcome. But the thing is, it’s happening faster than most data governance frameworks were designed to accommodate.
In a decentralized analysis landscape, the role of data definitions becomes more important. In a centralized model, a data team mediates between raw institutional data and the answers that decision-makers receive. That mediation creates - often invisibly for end users - definitional consistency: ensuring that “international student” means the same thing in a retention analysis as it does in an intake report, that “first year” is calculated from the same reference date across departments, that credits from exchange programmes are included or excluded on a consistent basis. When analysis moves outside that mediation layer, those definitions become assumptions made implicitly by whoever is running the analysis, and could be shaped by whatever the underlying data system happens to contain and however the querying tool happens to interpret it.
AI-assisted analysis rather scales this problem than solving it. A language model generating a query against a definitionally inconsistent dataset will produce a result that looks authoritative (i.e. formatted, labeled, often accompanied by a chart) but that may be comparing populations defined differently across faculties, drawing on fields populated inconsistently across systems, or missing sub-populations entirely because of how a filter was applied. The practice that has come to be called vibe coding in software development circles (i.e. generating functional analyses or applications through conversational AI prompting, often by users without deep technical training) documents exactly this pattern at the application level: outputs that run without error and return plausible-looking numbers, but that carry assumptions the author was unaware of making. The parallel in institutional data analysis is direct. A 2026 EDUCAUSE case study documenting one American university’s experience with AI deployment on institutional data captured this dynamic with the quote: “AI does not solve governance, it exposes it.” Institutions that deployed AI tools expecting them to make sense of fragmented, inconsistently defined data found instead that the fragmentation became visible at scale, in the hands of users who had no frame of reference for evaluating what they were seeing.
The three previous articles in this series have built toward a conclusion about what this means. The event log is the raw material for process-oriented educational analysis. The dimensional model is the semantic governance layer that makes that raw material interpretable consistently. Process mining is the analytical technique that operates on both to reveal actual student trajectories rather than assumed ones. Causal inference is the additional layer required before pattern observations can be treated as explanations. And real-time signals are the downstream possibility that becomes achievable once the foundational layers are stable. AI tools in the analysis layer make the distance between those layers and the end user shorter, but they do not make the layers optional. A well-governed dimensional model with clearly defined conformed dimensions (e.g. shared institutional definitions of what a student is, what an enrolment is, what an average grade consists of) is what makes AI-assisted analysis by non-experts reliable across the institution rather than locally plausible.
This creates a specific challenge for central data and institutional research teams. The traditional value proposition of those teams (i.e. access to data and the expertise to interpret it) is under pressure as access becomes easier and AI lowers the interpretive barrier for non-specialists. Teams that respond by positioning themselves as approval gates will find that pressure compounding, as the tools available to work around central teams continue to improve. The more durable response is a shift in value proposition: from being the team that produces analyses to being the team that makes decentralized analysis reliable. In practice, that means investing in data model documentation that is legible to non-technical users, in governed environments where AI-assisted querying operates within defined semantic boundaries, and in sustained engagement with the faculty and policy users.
Questions about data infrastructure, such as “What event data is captured? How is it governed, what shared definitions apply across facylties and how are analytical layers sequenced?” are policy questions. So let’s get to what this means for institutional leadership and policy leads. As I’ve stated in the first article (“Before AI can transform universities, fix the data”), the governance model that gave Dutch universities their autonomy also gave them the freedom to build incompatible systems. That was a coherent arrangement when the primary demands on those systems were standardized output measures for external reporting. The questions emerging from AI-supported and process-oriented learning are fundamentally different. They concern trajectories, sequences, and causal relationships, and they will increasingly be asked by staff working outside central data teams. Whether this shift will create enough shared urgency to reconsider the governance arrangements that produced today’s fragmentation remains uncertain. What is becoming increasingly difficult to ignore, however, is the growing cost of leaving that fragmentation unresolved.
This cost mechanism has two compounding components. The first is analytical debt: every analysis produced against a fragmented or definitionally inconsistent infrastructure becomes an institutional record (a report, a policy decision, a programme evaluation) built on undocumented assumptions. As analyses accumulate, they become reference points for subsequent ones, which inherit original definitions without knowing it. The analytical layer builds on itself and inconsistencies compound, which becomes a structural problem when AI tools increase the volume of analysis. The second component is the infrastructure gap relative to the pace of pedagogical change. AI-supported and adaptive learning tools are entering higher education, generating new categories of data and new pressure to evaluate what they are doing. Every semester that passes without an event log infrastructure in place is a semester of timestamped interactions, assessment attempts, and engagement patterns that cannot be retrospectively reconstructed, while they form the historical baseline against which future process models are calibrated and against which early warning signals become interpretable. These two components interact in a specific way: the arrival of AI analysis tools is creating demand for the kind of process-level institutional insight that depends on historical event data, at precisely the moment when that data is not being captured in usable form. Delaying the infrastructure decision allows the analytical debt to accumulate.
